Dataflow – Mete Atamel

Continuing on the Big Data theme, Google Cloud Dataflow is the next component I want to take a look in Google Cloud Platform.

What is Dataflow?

Dataflow is mainly for batch or stream data processing.
Good for high volume computation and embarrassingly parallel workloads.
Consists of 2 major components:
1. Dataflow SDKs: A programming model and SDKs for large-scale cloud data processing.
2. Dataflow Service: Ties together and fully manages several different Google Cloud Platform technologies to execute data processing jobs in the cloud.
Dataflow SDK is being open sourced as Apache Beam.

Dataflow Programming Model consists of 4 concepts:

Pipelines: Set of operations that can read a source of input data, transform it and write out the output. Contains data (PCollections) and processing on the data (Transforms)
PCollections: Inputs and outputs for each step in the pipeline. Immutable after creation. 2 flavors:
- Bounded: Fixed-size data set for text, BigQuery, Datastore or custom data.
- Unbounded: Continuously updating data set, or streaming data such as Pub/Sub or custom data.
Transforms: A data processing operation, or a step, in the pipeline. Takes PCollection as input and produces PCollection as output. 2 flavors:
- Core: You provide the processing logic as a function object. 4 Core transform types: ParDo, GroupByKey, Combine, Flatten.
- Composite: Built from multiple sub-transforms.
I/O Sources and Sinks: Source APIs to read data into the pipeline, and sink APIs to write output data from your pipeline. APIs for common formats such as:
- Text files
- BigQuery tables
- Avro files
- Pub/Sub
- BigTable (beta)

Two supported languages:

Dataflow Service is a managed service in Google Cloud Platform to deploy and execute Dataflow pipelines (as Dataflow jobs).
Simplifies distributed parallel processing by:
1. Automatic partitioning and distribution of Compute Engine instances.
2. Optimization of the pipeline.
3. Automatic scaling of resources as needed.
Automatically spins and tears down necessary resources (Compute Engine, Cloud Storage) to run the Dataflow job.
Provides tools like Dataflow Monitoring Interface and the Dataflow Command-line Interface.