In my previous post, I started looking into Google Cloud Platform’s Big Data offerings and shared my notes on Pub/Sub. In this post, I want to continue on the Big Data theme and explore BigQuery.
What is BigQuery?
- Cloud database run by Google.
- Basic premise: Enable analyzing terabytes of data, very fast.
- Query language: BigQuery SQL dialect.
- Access via web UI, bq command-line tool, or BigQuery REST APIs.
- There are also client libraries in Java, Python, etc. wrapping REST APIs.
BigQuery Projects
These are the main concepts in a BigQuery project:
- Project: This is the top level construct that every GCP project needs.
- Dataset: This is a grouping of tables with access control.
- Table: This is where your data resides and what you query against using SQL.
- Jobs: Actions (load data, copy data etc.) that BigQuery can run on your behalf.
- Access control (ACL): To manage access to projects and datasets. A table inherits its ACL from dataset.
Loading Data
- Either load the data directly into BigQuery or setup data as a federated/external data source.
- If loading data in BigQuery, you can bulk load or stream data as individual records.
- Other Google Cloud sources for loading into BigQuery:
- Cloud Storage.
- Cloud Datastore.
- Cloud Dataflow.
- AppEngine log files.
- Cloud Storage access/storage logs.
- Cloud Audit Logs.
- 3 data source formats: CSV, newline-delimited JSON, Cloud Datastore backup files.
Exporting Data
Data can be exported from BigQuery in 2 ways:
- Files: Export up to 1GB of data per file and supports multiple files.
- Use Google Cloud Dataflow to read data from BigQuery.
Querying
- Queries are written in BigQuery SQL dialect.
- Synch and async query methods.
- Results are saved either in temporary or persistent tables.
- Queries can be interactive (executed ASAP) or batched (execute when possible).
- Query results are cached by default but caching can be disabled.
- Supports user-defined function in JavaScript (basically the Map part of MapReduce).
Pricing and Quotas
- BigQuery charges for data storage, streaming inserts and query data (details).
- Free of charge: Loading and exporting data.
- For queries, you’re charged for the number of bytes processed. 1TB per month is free.
- There are limits on incoming requests (details)