Continuing on the Big Data theme, Google Cloud Dataflow is the next component I want to take a look in Google Cloud Platform.
What is Dataflow?
- Dataflow is mainly for batch or stream data processing.
- Good for high volume computation and embarrassingly parallel workloads.
- Consists of 2 major components:
- Dataflow SDKs: A programming model and SDKs for large-scale cloud data processing.
- Dataflow Service: Ties together and fully manages several different Google Cloud Platform technologies to execute data processing jobs in the cloud.
- Dataflow SDK is being open sourced as Apache Beam.
Dataflow Programming Model
Dataflow Programming Model consists of 4 concepts:
- Pipelines: Set of operations that can read a source of input data, transform it and write out the output. Contains data (PCollections) and processing on the data (Transforms)
- PCollections: Inputs and outputs for each step in the pipeline. Immutable after creation. 2 flavors:
- Bounded: Fixed-size data set for text, BigQuery, Datastore or custom data.
- Unbounded: Continuously updating data set, or streaming data such as Pub/Sub or custom data.
- Transforms: A data processing operation, or a step, in the pipeline. Takes PCollection as input and produces PCollection as output. 2 flavors:
- Core: You provide the processing logic as a function object. 4 Core transform types: ParDo, GroupByKey, Combine, Flatten.
- Composite: Built from multiple sub-transforms.
- I/O Sources and Sinks: Source APIs to read data into the pipeline, and sink APIs to write output data from your pipeline. APIs for common formats such as:
- Text files
- BigQuery tables
- Avro files
- Pub/Sub
- BigTable (beta)
Dataflow SDKs
Two supported languages:
- Java: Dataflow SDK for Java is fully available
- Python: Dataflow SDK for Python is in development.
Dataflow Service
- Dataflow Service is a managed service in Google Cloud Platform to deploy and execute Dataflow pipelines (as Dataflow jobs).
- Simplifies distributed parallel processing by:
- Automatic partitioning and distribution of Compute Engine instances.
- Optimization of the pipeline.
- Automatic scaling of resources as needed.
- Automatically spins and tears down necessary resources (Compute Engine, Cloud Storage) to run the Dataflow job.
- Provides tools like Dataflow Monitoring Interface and the Dataflow Command-line Interface.