Google Cloud Platform introduced online resizing of Google Cloud Persistent Disks almost a month ago. When I first read about this feature, I was so amazed that I had to try it right away.
I started with a Compute Engine instance with a persistent disk of size 100GB and doubled it to 200GB with a few clicks and resize2fs command.
Not only it worked flawlessly but it was also very quick. I documented my experience in this Cloud Minute video.
In this post, I want to take a look at Cloud Functions. It’s still in Alpha but you can already play with it and I really like the idea of deploying functions without having to worry about the underlying infrastructure.
What are Cloud Functions?
In a nutshell, Cloud Functions enable you to write managed functions to respond to events in your cloud environment.
- Events: Cloud Storage and Cloud Pub/Sub events or simple HTTP invocation can act as triggers to cloud functions.
- Response: Cloud functions can respond in async (with Storage and Pub/Sub events) or sync (with HTTP invocation) fashion.
Writing Cloud Functions
- Each function must accept context and data as parameters and must signal completion by calling one of context.success, context.failure, and context.done methods.
- console can be used to log error and debug messages and logs can be viewed using gcloud get-logs command.
Deploying Cloud Functions
Cloud Functions can be deployed using gcloud deploy from 2 locations:
- Local filesystem: You can create your function locally and use gcloud to deploy it. (One caveat is that you need to create a Cloud Storage bucket for gcloud to store your function before it can deploy it.)
- Cloud Source repository: You can put your function to Cloud Source repository, (A Git repository hosted on Google Cloud Platform) and deploy it from there using gcloud.
Triggering Cloud Functions
Cloud Functions can be triggered (async or sync) in 3 ways:
- Cloud Pub/Sub: A new message to a specific topic in Cloud Pub/Sub (async).
- Cloud Storage: An object created/deleted/updated in a specific bucket (async).
- HTTP Post: A simple HTTP Post (sync). (This requires an HTTP endpoint in Cloud Function and this endpoint is created by specifying –trigger-http flag during deployment of the function.)
Dataproc is the fourth component in Big Data section of Google Cloud Platform that I took a look and here are my short notes on Dataproc.
What is Dataproc?
- Managed Spark, Hadoop, Hive, and Pig instances in Google Cloud.
- It’s low cost, managed, and fast to start/scale/shutdown.
- Integrated with the rest of Google Cloud components such as Cloud Storage, BigQuery, Bigtable.
- You can create a Dataproc cluster and then submit jobs (Hadoop, Spark, PySpark, Hive, SparkSql, Pig) from GCP Console Dataproc section, or command line (gcloud dataproc), or via Dataproc REST API.
- You can view job’s output from Jobs section of GCP Console Dataproc or using gcloud dataproc jobs wait.
- You can SSH into master and other nodes in the cluster.
- All Cloud Dataproc clusters come with the BigQuery connector for Hadoop to read/write data from BigQuery.
Continuing on the Big Data theme, Google Cloud Dataflow is the next component I want to take a look in Google Cloud Platform.
What is Dataflow?
- Dataflow is mainly for batch or stream data processing.
- Good for high volume computation and embarrassingly parallel workloads.
- Consists of 2 major components:
- Dataflow SDKs: A programming model and SDKs for large-scale cloud data processing.
- Dataflow Service: Ties together and fully manages several different Google Cloud Platform technologies to execute data processing jobs in the cloud.
- Dataflow SDK is being open sourced as Apache Beam.
Dataflow Programming Model
Dataflow Programming Model consists of 4 concepts:
- Pipelines: Set of operations that can read a source of input data, transform it and write out the output. Contains data (PCollections) and processing on the data (Transforms)
- PCollections: Inputs and outputs for each step in the pipeline. Immutable after creation. 2 flavors:
- Bounded: Fixed-size data set for text, BigQuery, Datastore or custom data.
- Unbounded: Continuously updating data set, or streaming data such as Pub/Sub or custom data.
- Transforms: A data processing operation, or a step, in the pipeline. Takes PCollection as input and produces PCollection as output. 2 flavors:
- Core: You provide the processing logic as a function object. 4 Core transform types: ParDo, GroupByKey, Combine, Flatten.
- Composite: Built from multiple sub-transforms.
- I/O Sources and Sinks: Source APIs to read data into the pipeline, and sink APIs to write output data from your pipeline. APIs for common formats such as:
- Text files
- BigQuery tables
- Avro files
- BigTable (beta)
Two supported languages:
- Java: Dataflow SDK for Java is fully available
- Python: Dataflow SDK for Python is in development.
- Dataflow Service is a managed service in Google Cloud Platform to deploy and execute Dataflow pipelines (as Dataflow jobs).
- Simplifies distributed parallel processing by:
- Automatic partitioning and distribution of Compute Engine instances.
- Optimization of the pipeline.
- Automatic scaling of resources as needed.
- Automatically spins and tears down necessary resources (Compute Engine, Cloud Storage) to run the Dataflow job.
- Provides tools like Dataflow Monitoring Interface and the Dataflow Command-line Interface.
In my previous post, I started looking into Google Cloud Platform’s Big Data offerings and shared my notes on Pub/Sub. In this post, I want to continue on the Big Data theme and explore BigQuery.
What is BigQuery?
These are the main concepts in a BigQuery project:
- Project: This is the top level construct that every GCP project needs.
- Dataset: This is a grouping of tables with access control.
- Table: This is where your data resides and what you query against using SQL.
- Jobs: Actions (load data, copy data etc.) that BigQuery can run on your behalf.
- Access control (ACL): To manage access to projects and datasets. A table inherits its ACL from dataset.
- Either load the data directly into BigQuery or setup data as a federated/external data source.
- If loading data in BigQuery, you can bulk load or stream data as individual records.
- Other Google Cloud sources for loading into BigQuery:
- Cloud Storage.
- Cloud Datastore.
- Cloud Dataflow.
- AppEngine log files.
- Cloud Storage access/storage logs.
- Cloud Audit Logs.
- 3 data source formats: CSV, newline-delimited JSON, Cloud Datastore backup files.
Data can be exported from BigQuery in 2 ways:
- Files: Export up to 1GB of data per file and supports multiple files.
- Use Google Cloud Dataflow to read data from BigQuery.
- Queries are written in BigQuery SQL dialect.
- Synch and async query methods.
- Results are saved either in temporary or persistent tables.
- Queries can be interactive (executed ASAP) or batched (execute when possible).
- Query results are cached by default but caching can be disabled.
Pricing and Quotas
- BigQuery charges for data storage, streaming inserts and query data (details).
- Free of charge: Loading and exporting data.
- For queries, you’re charged for the number of bytes processed. 1TB per month is free.
- There are limits on incoming requests (details)