Google Big-Query!

BigQuery process massive data sets in a super-fast way!

What is  Google BigQuery?

Querying massive datasets can be time-consuming and expensive without the right hardware and infrastructure. Google BigQuery solves this problem by enabling super-fast SQL queries against append-only tables using the processing power of Google’s infrastructure.

“Simply move your data into BigQuery and let us handle the hard work.” – What Google says about BigQuery!

You can control access to both the project and your data based on your business needs, such as giving others the ability to view or query your data. Also, you can access BigQuery by using a web UI or a command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries.

BigQuery fundamentals:

There are four main concepts you should understand when using BigQuery.

  • Projects: each project has a friendly name and a unique ID.
    • Projects are the top-level containers in Google Cloud Platform.
    • They Store information about billing and authorised users.
  • Tables: they contains your data.
    • Each table has a schema that describes field names & types.
    • BigQuery also supports the Views (Database views).
  • Datasets: allow you to organise and control access to your tables.
    • You have to create at least one dataset before loading data into BigQuery.
    • You can share the data with others by setting Access Control Lists on the data-sets itself.
  • Jobs: These are the custom actions you construct & BigQuery handles executing them.
    • Jobs could be (Load, export, query or copy data).
    • The jobs will be executed asynchronously, as they could take a lot of time to complete.

1. Loading and exporting data:

  • Jobs or streams could be used to load the data into the BigQuery or you can skip the loading process by setting up a table as a federated data source (location external to BigQuery).
  • Loading jobs support three data sources:
    • Objects in Google Cloud Storage: for storing immutable blob objects (i.e: images & static files).
    • Data send with the job or streaming insert.
    • Google Cloud Datastore: for storing structured application data that are mutable (i.e: User entity, Blog post, .. etc).
  • Google Cloud Data-flow also could be used to load the data into BigQuery.
  • Data limits:
    • Daily limit: 1000 load jobs per table per day, 10,000 load jobs per project per day.
    • Maximum columns per table: 10,000.
    • Maximum file sizes:
      • CSV : 4 GB (Compressed) & 5 TB (Un-Compressed).
      • JSON: 4 GB (Compressed) & 5 TB (Un-Compressed).
      • Maximum size per load job: 12 TB across all input files for CSV and JSON.
      • A maximum number of files per load job: 10,000.

2. Querying and viewing data:

  • Queries are written in BigQuery’s SQL dialect.
  • BigQuery supports both synchronous and asynchronous query methods.
  • All query results are saved to a table, which can be either persistent or temporary:
    • A temporary table is a randomly named table saved in a special dataset. The table has a lifetime of approximately 24 hours. Also, they are not available for sharing.
    • A persistent table can be a new or existing table in any dataset in which you have WRITE privileges.
  • Querying data limit:
    • Maximum tables per query: 1,000.
    • Maximum query length: 256 KB.
  • Synchronous queries:
    • All synchronous query results are saved to an anonymous table with a lifetime of about 24 hours. If you want to save the data for longer than that, re-run the query asynchronously and specify a permanent destination table.
    • You can call jobs.query with your query string to set timeout period, if set to 0 or not specified, the method will default to 10 seconds.
  • Asynchronous queries:
    • Asynchronous queries are run by calling and including the configuration.query property in the request.
    • The method returns immediately with a job ID, and you must either request the job status periodically and check status or call jobs.getQueryResults, which will return when the job is complete.
    • Asynchronous queries are always saved to a table (new, existing or a temporary table).
  • Query caching:
    • BigQuery caches query results on a best-effort basis for increased performance.
    • Also, it’s really cool because you won’t be charged for the cached queries.
    • You can ensure that you are selecting from the cache, by setting the property createDisposition to CREATE_NEVER.
    • Also, you can check if the returned result is from cache or not by checking cacheHit property if set to true.
    • Results are cached for approximately 24 hours and cache lifetimes are extended when a query returns a cached result.
  • Returning large query results:
    • Maximum response size: 128 MB compressed.
    • You can set allowLargeResults to true in your job configuration, to remove the maximum response size limit.
  • Views: The BigQuery views are logical views, not materialised views. (SQL query run every time).
  • Meta-tables: You can use Meta-tables for better READ only performance.

3. Managing data:

  • Listing projects, jobs, tables and datasets.
  • Getting information about jobs, tables and datasets.
  • Defining, updating or patching tables and datasets.
  • Deleting tables and datasets.

4. Exporting Data From BigQuery:

  • You can export up to 1 GB of data per file, also it supports exporting to multiple files.
  • Exporting data limits:
    • Daily Limit: 1,000 exports per day, up to 10 TB.
    • Multiple Wildcard URI Limit: 500 URIs per export.
  • Configuration options: You can configure two aspects of the exported data (the format, and the compression type).
    • It supports (CSV & JSON). The default value is CSV.
    • Nested or repeated data can only be exported to JSON format.
    • It supports (GZIP or NONE) for compression.
  • Options for the destinationUris property:
    • Single URI: [‘gs://my-bucket/file-name.json’] – It will generate all the data in one file (Maximum of 1 GB file).
    • Single wildcard URI: [‘gs://my-bucket/file-name-*.json’] – BigQuery shards your data into multiple files based on the provided pattern.
    • Multiple wildcard URIs:
      • [‘gs://my-bucket/file-name-1-*.json’, ‘gs://my-bucket/file-name-2-*.json’, ‘gs://my-bucket/file-name-3-*.json’] – Multiple wildcard URIs would be used if the running job is parallel.

You can also check Maksym’s blog for more details on Google BigQuery.