Google Cloud Platform notebook runner for nbdev

Allows running any Jupyter notebook function on Google Cloud Platform.

Idea of the project is to provide ability to package and run some code from Jupyter notebook directly to Google Cloud and run it on GCE, GKE or Vertex Training. At the same time the code that invokes the experience of calling into your notebook code looks very similar on either environment.

Assume that you have a function that you can call as: some_function_to_run_on_cloud(arg1, arg2) in your Jupyter Notebook.

With GCP runner you'll be able to run it on GKE as:

import gcp_runner.ai_platform_constants
import gcp_runner.kubernetes_runner

gcp_runner.kubernetes_runner.run_docker_image(
     some_function_to_run_on_cloud,
     arg1,
     arg2,
     ... some optional args to configure your GKE
)

GCP runner is relying on https://nbdev.fast.ai/ to organize projects and provides API to package your code as a docker container and to pass through an entry point function. This way you can move from running code locally to running code on Cloud without any changes.

See https://github.com/vlasenkoalexey/criteo_nbdev for e2e demo on how to use GCP runner to solve a real ML problem.

Install

git clone https://github.com/vlasenkoalexey/gcp_runner
pip install -e gcp_runner

#TODO: upload to pypi

How to use

Let's define some function that we want to run in notebook as well as on Google Cloud. Note that cell has to be marked with export attribute:

#export
import time

def some_function_to_run_on_cloud():
    print('running some_function_to_run_on_cloud')
    print('in main before sleep 1')
    time.sleep(2)
    print('in main after sleep 2')
    time.sleep(5)
    print('in main after sleep 3')
    time.sleep(5)
    print('in main after sleep 4')

Running it in notebook as usual:

some_function_to_run_on_cloud()

running some_function_to_run_on_cloud
in main before sleep 1
in main after sleep 2
in main after sleep 3
in main after sleep 4

Updating project

If you do any changes, call gcp_runner.core.export_and_reload_all to convert all notebooks to python, and reload all modules. Note that modules are reloaded in an order defined by notebook names to make sure that dependencies are processed correctly. If there are errors in one of the modules, you can ignore them by setting ignore_errors=True. Or just restart Kernel.

from gcp_runner.core import export_and_reload_all
export_and_reload_all(silent=True, ignore_errors=False)

Testing code locally

Test that code can be executed locally as a Python script:

import gcp_runner.local_runner
gcp_runner.local_runner.run_python(some_function_to_run_on_cloud)

in gcp_runner entry point
running entrypoint function: gcp_runner.index.some_function_to_run_on_cloud
running some_function_to_run_on_cloud
in main before sleep 1
in main after sleep 2
in main after sleep 3
in main after sleep 4

Or you can test that code can be executed locally as a docker container. In order to build your own Docker file, set build_docker_file argument:

import gcp_runner.local_runner
gcp_runner.local_runner.run_docker(
    some_function_to_run_on_cloud,
    'gcr.io/deeplearning-platform-release/tf2-cpu.2-1',
    build_docker_file=None)

Running in Docker container:  
docker run -v /usr/local/google/home/alekseyv/vlasenkoalexey/gcp_runner/gcp_runner:/gcp_runner gcr.io/deeplearning-platform-release/tf2-cpu.2-1 python -u -m gcp_runner.entry_point --module-name=gcp_runner.index --function-name=some_function_to_run_on_cloud
in gcp_runner entry point  
running entrypoint function: gcp_runner.index.some_function_to_run_on_cloud  
running some_function_to_run_on_cloud  
in main before sleep 1  
in main after sleep 2  
in main after sleep 3  
in main after sleep 4

For simple use cases you might be able to use existing images. In order bo build your own, set build_docker_file parameter.

In order to authenticate your project with gcr.io container registry, run following command once:

gcloud auth configure-docker

Running on Google Cloud AI Platform

TODO: describe google cloud sdk setup

Running as a package on Google Cloud AI Platform:

import gcp_runner.ai_platform_runner

gcp_runner.ai_platform_runner.run_package(
     some_function_to_run_on_cloud, 
     'gs://alekseyv-scalableai-dev-criteo-model-bucket/test-job-dir')

Trimmed output:  
Running training job using package on Google Cloud Platform AI:  
gcloud ai-platform jobs submit training ai_platform_runner_train_package_20200327_131147 \\  
 --runtime-version=2.1 \\   
 --python-version=3.7 \\   
 --stream-logs \\   
 --module-name=gcp_runner.entry_point \\   
 --package-path=/usr/local/google/home/alekseyv/vlasenkoalexey/gcp_runner/gcp_runner \\  
 --job-dir=gs://alekseyv-scalableai-dev-criteo-model-bucket/test-job-dir \\  
 --scale-tier=basic \\  
 --use-chief-in-tf-config=True \\  
 -- \\  
 --job-dir=gs://alekseyv-scalableai-dev-criteo-model-bucket/test-job-dir \\  
 --module-name=gcp_runner.index \\  
 --function-name=some_function_to_run_on_cloud  
        
Job [ai_platform_runner_train_package_20200327_131147] submitted successfully.  
INFO	2020-03-27 13:11:51 -0700	service		Validating job requirements...  
INFO	2020-03-27 13:11:51 -0700	service		Job creation request has been successfully validated.  
INFO	2020-03-27 13:11:51 -0700	service		Job ai_platform_runner_train_package_20200327_131147 is queued.  
INFO	2020-03-27 13:11:51 -0700	service		Waiting for job to be provisioned.  
INFO	2020-03-27 13:11:53 -0700	service		Waiting for training program to start.  
...  
INFO	2020-03-27 13:13:23 -0700	master-replica-0		Running command: python3 -m gcp_runner.entry_point --job-dir=gs://alekseyv-scalableai-dev-criteo-model-bucket/test-job-dir --module-name=gcp_runner.index --function-name=some_function_to_run_on_cloud --job-dir gs://alekseyv-scalableai-dev-criteo-model-bucket/test-job-dir  
INFO	2020-03-27 13:13:35 -0700	master-replica-0		in gcp_runner entry point  
INFO	2020-03-27 13:13:35 -0700	master-replica-0		running entrypoint function: gcp_runner.index.some_function_to_run_on_cloud  
INFO	2020-03-27 13:13:35 -0700	master-replica-0		additional args: ['--job-dir=gs://alekseyv-scalableai-dev-criteo-model-bucket/test-job-dir', '--job-dir', 'gs://alekseyv-scalableai-dev-criteo-model-bucket/test-job-dir']  
INFO	2020-03-27 13:13:35 -0700	master-replica-0		running some_function_to_run_on_cloud  
INFO	2020-03-27 13:13:35 -0700	master-replica-0		in main before sleep 1  
INFO	2020-03-27 13:13:35 -0700	master-replica-0		in main after sleep 2  
INFO	2020-03-27 13:13:35 -0700	master-replica-0		in main after sleep 3  
INFO	2020-03-27 13:13:35 -0700	master-replica-0		in main after sleep 4  
INFO	2020-03-27 13:13:35 -0700	master-replica-0		Module completed; cleaning up.  
INFO	2020-03-27 13:13:35 -0700	master-replica-0		Clean up finished.  
INFO	2020-03-27 13:13:35 -0700	master-replica-0		Task completed successfully.  
state: SUCCEEDED

Running as a custom Docker container on Google Cloud AI Platform:

import gcp_runner.ai_platform_runner

gcp_runner.ai_platform_runner.run_docker_image(
    some_function_to_run_on_cloud,
    'gs://alekseyv-scalableai-dev-criteo-model-bucket/test-job-dir',
    build_docker_file='Dockerfile',
    master_image_uri='gcr.io/alekseyv-scalableai-dev/tf2-cpu.2-1')

Trimmed output:  
INFO	2020-03-27 13:01:57 -0700	master-replica-0		running some_function_to_run_on_cloud  
INFO	2020-03-27 13:01:57 -0700	master-replica-0		in main before sleep 1  
INFO	2020-03-27 13:01:59 -0700	master-replica-0		in main after sleep 2  
INFO	2020-03-27 13:02:04 -0700	master-replica-0		in main after sleep 3  
INFO	2020-03-27 13:02:09 -0700	master-replica-0		in main after sleep 4

TODO: example how to run distributed training TODO: example how to run distributed hyper parameter tuner

Running Google Cloud Kubernetes

TODO: describe how to setup and configure Kubernetes
TODO: describe that it is faster to use Kubernetes for iterative work
TODO: distributed training
TODO: distributed HP tuning

Note that Kubernetes doesn't offer convenient way of streaming logs, so currently script is going to keep pulling logs until you terminate it.

import gcp_runner.kubernetes_runner

gcp_runner.kubernetes_runner.run_docker_image(
    some_function_to_run_on_cloud,
    'gs://alekseyv-scalableai-dev-criteo-model-bucket/test-job-dir',
    build_docker_file='Dockerfile',
    image_uri='gcr.io/alekseyv-scalableai-dev/tf2-cpu.2-1')

Trimmed output:  
kubernetes-runner-train-docker-chief-0		running some_function_to_run_on_cloud  
kubernetes-runner-train-docker-chief-0		in main before sleep 1  
kubernetes-runner-train-docker-chief-0		in main after sleep 2  
kubernetes-runner-train-docker-chief-0		in main after sleep 3  
kubernetes-runner-train-docker-chief-0		in main after sleep 4

Worklog/Ideas

Review and normalize API names
- Provide 2 APIs, one for running everything as a package, one as a container and specify where to run it as an argument. So moving from local environment to remote environment is just a matter of a flag switch
Fix packages setup for Linux
Add tests
Either add a function to show inline tensorboard, or a callback to show training/validation graphs for remote runs
Collecting usage stats basing on job name
Add callbacks/functionality to bring model/environment variables back to notebook instance
Explore magics instead of function calls for running code in the cloud
Only reload updated modules
Try to update globals, or at least show warnings when reloaded module has global variables
Make logic for pulling Kubernetes logs better
Jupyter lab/notebook extension to run any command when some button is pressed, and attach code for notebook conversions
Add function to setup service account
Add logic to download and configure service account from packages
Support project_id replacement for docker container image uri
Add function to install Video drivers on Kubernetes

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
docs		docs
gcp_runner		gcp_runner
.gitignore		.gitignore
00_core.ipynb		00_core.ipynb
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
ai_platform_constants.ipynb		ai_platform_constants.ipynb
ai_platform_runner.ipynb		ai_platform_runner.ipynb
entry_point.ipynb		entry_point.ipynb
index.ipynb		index.ipynb
kubernetes_runner.ipynb		kubernetes_runner.ipynb
local_runner.ipynb		local_runner.ipynb
poll_kubernetes_logs.sh		poll_kubernetes_logs.sh
sample_code_test.ipynb		sample_code_test.ipynb
settings.ini		settings.ini
setup.py		setup.py
template.yaml.jinja		template.yaml.jinja

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Google Cloud Platform notebook runner for nbdev

Install

How to use

Updating project

Testing code locally

Running on Google Cloud AI Platform

Running Google Cloud Kubernetes

Worklog/Ideas

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

vlasenkoalexey/gcp_runner

Folders and files

Latest commit

History

Repository files navigation

Google Cloud Platform notebook runner for nbdev

Install

How to use

Updating project

Testing code locally

Running on Google Cloud AI Platform

Running Google Cloud Kubernetes

Worklog/Ideas

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages