Google Cloud Dataproc

DataProc_128px

Dataproc is the fourth component in Big Data section of Google Cloud Platform that I took a look and here are my short notes on Dataproc.

What is Dataproc?

  • Managed Spark, Hadoop, Hive, and Pig instances in Google Cloud.
  • It’s low cost, managed, and fast to start/scale/shutdown.
  • Integrated with the rest of Google Cloud components such as Cloud Storage, BigQuery, Bigtable.

Basics

  • You can create a Dataproc cluster and then submit jobs (Hadoop, Spark, PySpark, Hive, SparkSql, Pig) from GCP Console Dataproc section, or command line (gcloud dataproc), or via Dataproc REST API.
  • You can view job’s output from Jobs section of GCP Console Dataproc or using gcloud dataproc jobs wait.
  • You can SSH into master and other nodes in the cluster.
  • All Cloud Dataproc clusters come with the BigQuery connector for Hadoop to read/write data from BigQuery.

Resources