BigQuery

BigQuery_512px

In my previous post, I started looking into Google Cloud Platform’s Big Data offerings and shared my notes on Pub/Sub.  In this post, I want to continue on the Big Data theme and explore BigQuery.

What is BigQuery?

BigQuery Projects

These are the main concepts in a BigQuery project:

  • Project: This is the top level construct that every GCP project needs.
  • Dataset: This is a grouping of tables with access control.
  • Table: This is where your data resides and what you query against using SQL.
  • Jobs: Actions (load data, copy data etc.) that BigQuery can run on your behalf.
  • Access control (ACL): To manage access to projects and datasets. A table inherits its ACL from dataset.

Loading Data

  • Either load the data directly into BigQuery or setup data as a federated/external data source.
  • If loading data in BigQuery, you can bulk load or stream data as individual records.
  • Other Google Cloud sources for loading into BigQuery:
    • Cloud Storage.
    • Cloud Datastore.
    • Cloud Dataflow.
    • AppEngine log files.
    • Cloud Storage access/storage logs.
    • Cloud Audit Logs.
  • 3 data source formats: CSV, newline-delimited JSON, Cloud Datastore backup files.

Exporting Data

Data can be exported from BigQuery in 2 ways:

  1. Files: Export up to 1GB of data per file and supports multiple files.
  2. Use Google Cloud Dataflow to read data from BigQuery.

Querying

  • Queries are written in BigQuery SQL dialect.
  • Synch and async query methods.
  • Results are saved either in temporary or persistent tables.
  • Queries can be interactive (executed ASAP) or batched (execute when possible).
  • Query results are cached by default but caching can be disabled.
  • Supports user-defined function in JavaScript (basically the Map part of MapReduce).

Pricing and Quotas

  • BigQuery charges for data storage, streaming inserts and query data (details).
  • Free of charge: Loading and exporting data.
  • For queries, you’re charged for the number of bytes processed. 1TB per month is free.
  • There are limits on incoming requests (details)

Resources

Google Cloud Pub/Sub

cps_integration

During my time at Adobe, I used to work on Pub/Sub messaging part of Flex/Livecycle Data Services and I also worked on Java Message Service (JMS) integration of Flex apps to JEE backends. So, it’s not a surprise that I was pretty excited to learn about Google Cloud Pub/Sub.

What is Google Cloud Pub/Sub?

  • Messaging middleware for Google Cloud Platform. Similar to Java Message Service (JMS) in the Java world but in the cloud.
  • Provides many-to-many, async messaging for loosely coupled senders and receivers.
  • Low latency, durable messaging with at least once delivery guarantee (i.e. messages can be delivered more than once and out of order).
  • Both push and pull style delivery supported.
  • In a nutshell:
    • A publisher creates a topic.
    • A subscriber creates a subscription to that topic.
    • Publisher sends messages to that topic.
    • Subscriber receives the message via push or pull, depending on the configuration.
    • Subscriber ACKs the receipt and message is removed from the message queue.

Try Pub/Sub with Google APIs Explorer

Easiest way to try out Pub/Sub without writing a single line of code is via Google APIs Explorer, so let’s try that. First, we need to do some prep work which involves creating a Google Cloud Platform project and activating the Google Cloud Pub/Sub API. Both of these steps are explained here.

Once you have a project and activated Pub/Sub API, go to API Explorer Pub/Sub section, turn on “Authorize requests using OAuth 2.0” with both scopes. This basically enables the browser to issue Pub/Sub requests.

Now, we’re ready to try out Pub/Sub APIs. We’ll basically do the following:

  1. Create a topic.
  2. List topics to make sure it’s created.
  3. Create a subscription to the topic.
  4. Publish a message to the topic.
  5. Pull the message from the subscription.
  6. Acknowledge the message from step 5 to get it deleted from the queue.

In my case, the GCP project id meteatameldevcloud, make sure you change this to whatever your GCP project id is.

Create a topic

Let’s create a topic named testtopic under my project meteatameldevcloud.

List topics

Let’s list all topics, to make sure testtopic has been created.

Create a subscription to the topic

Let’s create a subscription named testsubscription to the topic, so we can receive messages.

  • Go to apis-explorer/#p/pubsub/v1/pubsub.projects.subscriptions.create
  • In name field, enter projects/meteatameldevcloud/subscriptions/testsubscription
  • In request body, add topic: projects/meteatameldevcloud/topics/testtopic as a field
  • In request body, add ackDeadlineSeconds: 100. This basically increases the acknowledgement window from default 10 seconds to 100 seconds. It’s useful to give us more time while demoing.
  • Execute and you should get 200 OK.

Publish a message to the topic

Now, we’re ready to publish a message to the topic.

{"messages":[
   {"attributes":{ 
      "foo": "bar"}
    } 
]}

Pull the message from a subscription

Time to receive the message.

  • Go to apis-explorer/#p/pubsub/v1/pubsub.projects.subscriptions.pull
  • In subscription field, enter projects/meteatameldevcloud/subscriptions/testsubscription
  • In request body, add maxMessages: 1. This basically tells how many messages to receive in a pull request.
  • In request body, add returnImmediately: true. This tells the request to return immediately (instead of waiting/long-polling) if there are no messages in the queue.
  • Execute and you should get 200 OK with the message. Make a note of the ackId of the message, as you’ll need it in the next step.

Acknowledge the message

Finally, let’s ack that we received the message, so it gets deleted from the message queue.

  • Go to apis-explorer/#p/pubsub/v1/pubsub.projects.subscriptions.acknowledge
  • In subscription field, enter projects/meteatameldevcloud/subscriptions/testsubscription
  • In request body, add ackIds and include the ackId from the previous step.
  • Execute and you should get 200 OK.
  • At this point, if you try to pull the message again, you should not get the message anymore, assuming that you acknowledged the message within 100 seconds 🙂

Manage Pub/Sub with GCP Console

You can also manage pub/sub topics and subscriptions from GCP Console Pub/Sub section directly. This is useful in case you want to create or delete a new topic or a subscription and you don’t want to deal with the API or APIs Explorer.

Google Cloud Pub/Sub Java API

There’s a Java API for Google Cloud Pub/Sub and Java Pub/Sub Samples as well. I will explain how to use Pub/Sub from Java in a future post.

Resources