Google Cloud Dataflow


Continuing on the Big Data theme, Google Cloud Dataflow is the next component I want to take a look in Google Cloud Platform.

What is Dataflow?

  • Dataflow is mainly for batch or stream data processing.
  • Good for high volume computation and embarrassingly parallel workloads.
  • Consists of 2 major components:
    1. Dataflow SDKs: A programming model and SDKs for large-scale cloud data processing.
    2. Dataflow Service: Ties together and fully manages several different Google Cloud Platform technologies to execute data processing jobs in the cloud.
  • Dataflow SDK is being open sourced as Apache Beam.

Dataflow Programming Model

Dataflow Programming Model consists of 4 concepts:

  1. Pipelines: Set of operations that can read a source of input data, transform it and write out the output. Contains data (PCollections) and processing on the data (Transforms)
  2. PCollections: Inputs and outputs for each step in the pipeline. Immutable after creation. 2 flavors:
    • Bounded: Fixed-size data set for text, BigQuery, Datastore or custom data.
    • Unbounded: Continuously updating data set, or streaming data such as Pub/Sub or custom data.
  3. Transforms: A data processing operation, or a step, in the pipeline. Takes PCollection as input and produces PCollection as output. 2 flavors:
    • Core:  You provide the processing logic as a function object. 4 Core transform types: ParDo, GroupByKey, Combine, Flatten.
    • Composite: Built from multiple sub-transforms.
  4. I/O Sources and Sinks: Source APIs to read data into the pipeline, and sink APIs to write output data from your pipeline. APIs for common formats such as:
    • Text files
    • BigQuery tables
    • Avro files
    • Pub/Sub
    • BigTable (beta)

Dataflow SDKs

Two supported languages:

  1. Java: Dataflow SDK for Java is fully available
  2. Python: Dataflow SDK for Python is in development.

Dataflow Service

  • Dataflow Service is a managed service in Google Cloud Platform to deploy and execute Dataflow pipelines (as Dataflow jobs).
  • Simplifies distributed parallel processing by:
    1. Automatic partitioning and distribution of Compute Engine instances.
    2. Optimization of the pipeline.
    3. Automatic scaling of resources as needed.
  • Automatically spins and tears down necessary resources (Compute Engine, Cloud Storage) to run the Dataflow job.
  • Provides tools like Dataflow Monitoring Interface and the Dataflow Command-line Interface.