Skip to content

Commit 18b113d

Browse files
vlasenkoalexeyyongtang
authored andcommitted
Patching BigQuery connector in 0.7.0 release and upgrading release to 0.7.1 (#556)
* adding README document describing how to use BigQuery connector (#467) * adding README document describing how to use BigQuery connector * Fixing BigQuery connector package definition, and updating README.md accordingly * More changes for BigQuery connector (#490) * Fixing Dockerfile * Returning dataset in a form of Dictionary from BigQuery connector * Adding NULL fields support to BigQuery connector * python style tweak * more style tweaks * Style tweaks, comming from google account * Properly setting row_restriction in createReadSessionRequest and updating sample accordingly (#529) * updating version 0.7.0 -> 0.7.1 * locking TF version to 1.14 * linter fix
1 parent b322365 commit 18b113d

File tree

8 files changed

+133
-2
lines changed

8 files changed

+133
-2
lines changed

configure.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ rm -f .bazelrc
1818
if python -c "import tensorflow as tf" &> /dev/null; then
1919
echo 'using installed tensorflow'
2020
else
21-
pip install tensorflow
21+
pip install --upgrade tensorflow==1.14.0
2222
fi
2323
python -m pip install grpcio-tools
2424
python config_helper.py

dev/Dockerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ RUN /bin/bash -c "source activate tfio-dev && python -m pip install \
4444
pyarrow==${ARROW_VERSION} \
4545
pandas \
4646
fastavro \
47+
gast==0.2.2 \
4748
${PIP_ADD_PACKAGES} \
4849
"
4950

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ def has_ext_modules(self):
9595
"""
9696

9797
package = 'tensorflow>=1.14.0,<1.15.0'
98-
version = '0.7.0'
98+
version = '0.7.1'
9999
project = 'tensorflow-io'
100100
if '--package-version' in sys.argv:
101101
print(package)

tensorflow_io/bigquery/BUILD

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,20 @@ load(
99
"tf_io_copts",
1010
)
1111

12+
py_library(
13+
name = "bigquery",
14+
srcs = [
15+
"__init__.py",
16+
"python/__init__.py",
17+
"python/ops/__init__.py",
18+
"python/ops/bigquery_api.py",
19+
],
20+
data = [
21+
":python/ops/_bigquery.so",
22+
],
23+
srcs_version = "PY2AND3",
24+
)
25+
1226
KERNEL_FILES = [
1327
"kernels/bigquery_kernels.cc",
1428
"kernels/bigquery_dataset_op.cc",

tensorflow_io/bigquery/README.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Google BigQuery
2+
3+
[BigQuery](https://cloud.google.com/bigquery/) is a serverless, highly-scalable,
4+
and cost-effective cloud data warehouse with an in-memory BI Engine and machine
5+
learning built in.
6+
7+
BigQuery connector relies on [BigQuery Storage API](https://cloud.google.com/bigquery/docs/reference/storage/).
8+
that provides fast access to BigQuery managed storage by using an rpc-based
9+
protocol.
10+
11+
## Prerequisites
12+
13+
In order to use BigQuery connector, you need to make sure that Google Cloud SDK
14+
is propertly configured and that you have BigQuery Storage API enabled.
15+
Depending on environment you are using some prerequisites might be already met.
16+
17+
1. [Select or create a GCP project.](https://pantheon.corp.google.com/projectselector2/home/dashboard)
18+
2. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)
19+
3. [Setup Authentication.](https://cloud.google.com/docs/authentication/#service_accounts)
20+
If you choose to use [service account](https://cloud.google.com/docs/authentication/production)
21+
authentication, please make sure that GOOGLE_APPLICATION_CREDENTIALS
22+
environment variable is initialized with a path pointing to JSON file that
23+
contains your service account key.
24+
4. [Enable BigQuery Storage API.](https://cloud.google.com/bigquery/docs/reference/storage/#enabling_the_api)
25+
26+
## Sample Use
27+
28+
BigQuery connector mostly follows [BigQuery Storage API flow](https://cloud.google.com/bigquery/docs/reference/storage/#basic_api_flow),
29+
but hides complexity associated with decoding serialized data rows into Tensors.
30+
31+
1. Create a `BigQueryClient` client.
32+
2. Use the `BigQueryClient` to create `BigQueryReadSession` object corresponding
33+
to a read session. A read session divides the contents of a BigQuery table
34+
into one or more streams, which can then be used to read data from the
35+
table.
36+
3. Call parallel_read_rows on `BigQueryReadSession` object to read from multiple
37+
BigQuery streams in parallel.
38+
39+
The following example illustrates how to read particular columns from public
40+
BigQuery dataset.
41+
42+
```python
43+
from tensorflow.python.framework import ops
44+
from tensorflow.python.framework import dtypes
45+
from tensorflow_io.bigquery import BigQueryClient
46+
from tensorflow_io.bigquery import BigQueryReadSession
47+
48+
GCP_PROJECT_ID = '<FILL_ME_IN>'
49+
DATASET_GCP_PROJECT_ID = "bigquery-public-data"
50+
DATASET_ID = "samples"
51+
TABLE_ID = "wikipedia"
52+
53+
def main():
54+
ops.enable_eager_execution()
55+
client = BigQueryClient()
56+
read_session = client.read_session(
57+
"projects/" + GCP_PROJECT_ID,
58+
DATASET_GCP_PROJECT_ID, TABLE_ID, DATASET_ID,
59+
["title",
60+
"id",
61+
"num_characters",
62+
"language",
63+
"timestamp",
64+
"wp_namespace",
65+
"contributor_username"],
66+
[dtypes.string,
67+
dtypes.int64,
68+
dtypes.int64,
69+
dtypes.string,
70+
dtypes.int64,
71+
dtypes.int64,
72+
dtypes.string],
73+
requested_streams=2,
74+
row_restriction="num_characters > 1000")
75+
dataset = read_session.parallel_read_rows()
76+
77+
row_index = 0
78+
for row in dataset.prefetch(10):
79+
print("row %d: %s" % (row_index, row))
80+
row_index += 1
81+
82+
if __name__ == '__main__':
83+
app.run(main)
84+
85+
```
86+
87+
Please refer to BigQuery connector Python docstrings and to
88+
[Enable BigQuery Storage API](https://cloud.google.com/bigquery/docs/reference/storage/rpc/)
89+
documentation for more details about each parameter.

tensorflow_io/bigquery/kernels/bigquery_kernels.cc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,8 @@ class BigQueryReadSessionOp : public OpKernel {
146146
*createReadSessionRequest.mutable_read_options()
147147
->mutable_selected_fields() = {selected_fields_.begin(),
148148
selected_fields_.end()};
149+
createReadSessionRequest.mutable_read_options()->set_row_restriction(
150+
row_restriction_);
149151
createReadSessionRequest.set_requested_streams(requested_streams_);
150152
createReadSessionRequest.set_format(apiv1beta1::DataFormat::AVRO);
151153
VLOG(3) << "createReadSessionRequest: "

tensorflow_io/bigquery/kernels/bigquery_lib.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,9 @@ class BigQueryReaderDatasetIterator : public DatasetIterator<Dataset> {
184184
case avro::AVRO_ENUM:
185185
dtype = DT_STRING;
186186
break;
187+
case avro::AVRO_NULL:
188+
dtype = output_types[i];
189+
break;
187190
default:
188191
return errors::InvalidArgument("unsupported data type: ",
189192
field.type());
@@ -250,6 +253,8 @@ class BigQueryReaderDatasetIterator : public DatasetIterator<Dataset> {
250253
((*out_tensors)[i]).scalar<string>()() =
251254
field.value<avro::GenericEnum>().symbol();
252255
break;
256+
case avro::AVRO_NULL: // Fallthrough;
257+
break;
253258
default:
254259
return errors::InvalidArgument("unsupported data type: ",
255260
field.type());
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
# ==============================================================================
15+
16+
"""This module contains the Python API for the Cloud BigQuery integration."""
17+
18+
from __future__ import absolute_import
19+
from __future__ import division
20+
from __future__ import print_function

0 commit comments

Comments
 (0)