Dataproc is the fourth component in Big Data section of Google Cloud Platform that I took a look and here are my short notes on Dataproc.
What is Dataproc?
- Managed Spark, Hadoop, Hive, and Pig instances in Google Cloud.
- It’s low cost, managed, and fast to start/scale/shutdown.
- Integrated with the rest of Google Cloud components such as Cloud Storage, BigQuery, Bigtable.
Basics
- You can create a Dataproc cluster and then submit jobs (Hadoop, Spark, PySpark, Hive, SparkSql, Pig) from GCP Console Dataproc section, or command line (gcloud dataproc), or via Dataproc REST API.
- You can view job’s output from Jobs section of GCP Console Dataproc or using gcloud dataproc jobs wait.
- You can SSH into master and other nodes in the cluster.
- All Cloud Dataproc clusters come with the BigQuery connector for Hadoop to read/write data from BigQuery.