Google Cloud Dataproc


Dataproc is the fourth component in Big Data section of Google Cloud Platform that I took a look and here are my short notes on Dataproc.

What is Dataproc?

  • Managed Spark, Hadoop, Hive, and Pig instances in Google Cloud.
  • It’s low cost, managed, and fast to start/scale/shutdown.
  • Integrated with the rest of Google Cloud components such as Cloud Storage, BigQuery, Bigtable.


  • You can create a Dataproc cluster and then submit jobs (Hadoop, Spark, PySpark, Hive, SparkSql, Pig) from GCP Console Dataproc section, or command line (gcloud dataproc), or via Dataproc REST API.
  • You can view job’s output from Jobs section of GCP Console Dataproc or using gcloud dataproc jobs wait.
  • You can SSH into master and other nodes in the cluster.
  • All Cloud Dataproc clusters come with the BigQuery connector for Hadoop to read/write data from BigQuery.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s