|
| 1 | +# Google BigQuery |
| 2 | + |
| 3 | +[BigQuery](https://cloud.google.com/bigquery/) is a serverless, highly-scalable, |
| 4 | +and cost-effective cloud data warehouse with an in-memory BI Engine and machine |
| 5 | +learning built in. |
| 6 | + |
| 7 | +BigQuery connector relies on [BigQuery Storage API](https://cloud.google.com/bigquery/docs/reference/storage/). |
| 8 | +that provides fast access to BigQuery managed storage by using an rpc-based |
| 9 | +protocol. |
| 10 | + |
| 11 | +## Prerequisites |
| 12 | + |
| 13 | +In order to use BigQuery connector, you need to make sure that Google Cloud SDK |
| 14 | +is propertly configured and that you have BigQuery Storage API enabled. |
| 15 | +Depending on environment you are using some prerequisites might be already met. |
| 16 | + |
| 17 | +1. [Select or create a GCP project.](https://pantheon.corp.google.com/projectselector2/home/dashboard) |
| 18 | +2. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/) |
| 19 | +3. [Setup Authentication.](https://cloud.google.com/docs/authentication/#service_accounts) |
| 20 | +If you choose to use [service account](https://cloud.google.com/docs/authentication/production) |
| 21 | +authentication, please make sure that GOOGLE_APPLICATION_CREDENTIALS |
| 22 | +environment variable is initialized with a path pointing to JSON file that |
| 23 | +contains your service account key. |
| 24 | +4. [Enable BigQuery Storage API.](https://cloud.google.com/bigquery/docs/reference/storage/#enabling_the_api) |
| 25 | + |
| 26 | +## Sample Use |
| 27 | + |
| 28 | +BigQuery connector mostly follows [BigQuery Storage API flow](https://cloud.google.com/bigquery/docs/reference/storage/#basic_api_flow), |
| 29 | +but hides complexity associated with decoding serialized data rows into Tensors. |
| 30 | + |
| 31 | +1. Create a `BigQueryClient` client. |
| 32 | +2. Use the `BigQueryClient` to create `BigQueryReadSession` object corresponding |
| 33 | + to a read session. A read session divides the contents of a BigQuery table |
| 34 | + into one or more streams, which can then be used to read data from the |
| 35 | + table. |
| 36 | +3. Call parallel_read_rows on `BigQueryReadSession` object to read from multiple |
| 37 | + BigQuery streams in parallel. |
| 38 | + |
| 39 | +The following example illustrates how to read particular columns from public |
| 40 | +BigQuery dataset. |
| 41 | + |
| 42 | +```python |
| 43 | +from tensorflow.python.framework import ops |
| 44 | +from tensorflow.python.framework import dtypes |
| 45 | +from tensorflow_io.bigquery import BigQueryClient |
| 46 | +from tensorflow_io.bigquery import BigQueryReadSession |
| 47 | + |
| 48 | +GCP_PROJECT_ID = '<FILL_ME_IN>' |
| 49 | +DATASET_GCP_PROJECT_ID = "bigquery-public-data" |
| 50 | +DATASET_ID = "samples" |
| 51 | +TABLE_ID = "wikipedia" |
| 52 | + |
| 53 | +def main(): |
| 54 | + ops.enable_eager_execution() |
| 55 | + client = BigQueryClient() |
| 56 | + read_session = client.read_session( |
| 57 | + "projects/" + GCP_PROJECT_ID, |
| 58 | + DATASET_GCP_PROJECT_ID, TABLE_ID, DATASET_ID, |
| 59 | + ["title", |
| 60 | + "id", |
| 61 | + "num_characters", |
| 62 | + "language", |
| 63 | + "timestamp", |
| 64 | + "wp_namespace", |
| 65 | + "contributor_username"], |
| 66 | + [dtypes.string, |
| 67 | + dtypes.int64, |
| 68 | + dtypes.int64, |
| 69 | + dtypes.string, |
| 70 | + dtypes.int64, |
| 71 | + dtypes.int64, |
| 72 | + dtypes.string], |
| 73 | + requested_streams=2, |
| 74 | + row_restriction="num_characters > 1000") |
| 75 | + dataset = read_session.parallel_read_rows() |
| 76 | + |
| 77 | + row_index = 0 |
| 78 | + for row in dataset.prefetch(10): |
| 79 | + print("row %d: %s" % (row_index, row)) |
| 80 | + row_index += 1 |
| 81 | + |
| 82 | +if __name__ == '__main__': |
| 83 | + app.run(main) |
| 84 | + |
| 85 | +``` |
| 86 | + |
| 87 | +Please refer to BigQuery connector Python docstrings and to |
| 88 | +[Enable BigQuery Storage API](https://cloud.google.com/bigquery/docs/reference/storage/rpc/) |
| 89 | +documentation for more details about each parameter. |
0 commit comments