Skip to content
This repository was archived by the owner on Feb 20, 2024. It is now read-only.

Commit 317a8a2

Browse files
authored
Merge pull request #93 from nginyc/dev
[V0.0.9] Add model access rights, downloading of trained models & selecting models for training
2 parents 02e7514 + 472362d commit 317a8a2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+1306
-718
lines changed

.env.sh

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,15 @@
1-
# Core configuration for Rafiki
1+
# Core external configuration for Rafiki
22
export DOCKER_NETWORK=rafiki
3-
export RAFIKI_VERSION=0.0.8
4-
export RAFIKI_IP_ADDRESS=127.0.0.1
3+
export DOCKER_SWARM_ADVERTISE_ADDR=127.0.0.1
4+
export RAFIKI_VERSION=0.0.9
5+
export RAFIKI_ADDR=127.0.0.1
56
export ADMIN_EXT_PORT=3000
67
export ADMIN_WEB_EXT_PORT=3001
78
export ADVISOR_EXT_PORT=3002
9+
export POSTGRES_EXT_PORT=5433
10+
export REDIS_EXT_PORT=6380
11+
export DATA_WORKDIR_PATH=$PWD/data # Shares a data folder with containers
12+
export LOGS_WORKDIR_PATH=$PWD/logs # Shares a folder with containers that stores components' logs
813

914
# Internal credentials for Rafiki's components
1015
export POSTGRES_USER=rafiki
@@ -22,7 +27,8 @@ export REDIS_HOST=rafiki_cache
2227
export REDIS_PORT=6379
2328
export PREDICTOR_PORT=3003
2429
export ADMIN_WEB_HOST=rafiki_admin_web
25-
export LOCAL_WORKDIR_PATH=$PWD
30+
export DATA_DOCKER_WORKDIR_PATH=/root/rafiki/data
31+
export LOGS_DOCKER_WORKDIR_PATH=/root/rafiki/logs
2632
export DOCKER_WORKDIR_PATH=/root/rafiki
2733
export CONDA_ENVIORNMENT=rafiki
2834

@@ -34,8 +40,8 @@ export RAFIKI_IMAGE_WORKER=rafikiai/rafiki_worker
3440
export RAFIKI_IMAGE_PREDICTOR=rafikiai/rafiki_predictor
3541

3642
# Docker images for dependent services
37-
export IMAGE_POSTGRES=postgres:10.5
38-
export IMAGE_REDIS=redis:5.0-rc
43+
export IMAGE_POSTGRES=postgres:10.5-alpine
44+
export IMAGE_REDIS=redis:5.0.3-alpine3.8
3945

4046
# Utility configuration
4147
export PYTHONPATH=$PWD # Ensures that `rafiki` module can be imported at project root

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Prerequisites: MacOS or Linux environment
2424
bash scripts/stop.sh
2525
```
2626
27-
More instructions are available in [Rafiki's Developer Guide](https://nginyc.github.io/rafiki/docs/latest/docs/src/dev/setup.html).
27+
More instructions are available in [Rafiki's Developer Guide](https://nginyc.github.io/rafiki/docs/latest/docs/src/dev).
2828

2929
## Acknowledgements
3030

docs/src/dev/development.rst

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@ By default, you can connect to the PostgreSQL DB using a PostgreSQL client (e.g
2727

2828
::
2929

30-
POSTGRES_HOST=localhost
31-
POSTGRES_PORT=5433
30+
RAFIKI_ADDR=127.0.0.1
31+
POSTGRES_EXT_PORT=5433
3232
POSTGRES_USER=rafiki
3333
POSTGRES_DB=rafiki
3434
POSTGRES_PASSWORD=rafiki
@@ -46,8 +46,8 @@ You can connect to Redis DB with `rebrow <https://github.com/marians/rebrow>`_:
4646

4747
::
4848

49-
REDIS_HOST=rafiki_cache
50-
REDIS_PORT=6379
49+
RAFIKI_ADDR=127.0.0.1
50+
REDIS_EXT_PORT=6380
5151

5252
Building Images Locally
5353
--------------------------------------------------------------------
@@ -85,14 +85,32 @@ Build & view Rafiki's Sphinx documentation on your machine with the following co
8585
Troubleshooting
8686
--------------------------------------------------------------------
8787

88-
While building Rafiki's images locally, if you encounter an error like "No space left on device", you might be running out of space allocated for Docker. Try removing all containers & images:
88+
While building Rafiki's images locally, if you encounter errors like "No space left on device",
89+
you might be running out of space allocated for Docker. Try one of the following:
8990

90-
.. code-block:: shell
91+
::
92+
93+
# Prunes dangling images
94+
docker system prune
95+
96+
::
9197

9298
# Delete all containers
9399
docker rm $(docker ps -a -q)
94100
# Delete all images
95101
docker rmi $(docker images -q)
96102

97103
From Mac Mojave onwards, due to Mac's new `privacy protection feature <https://www.howtogeek.com/361707/how-macos-mojaves-privacy-protection-works/>`_,
98-
you might need to explicitly give Docker *Full Disk Access*, restart Docker, or even do a factory reset of Docker.
104+
you might need to explicitly give Docker *Full Disk Access*, restart Docker, or even do a factory reset of Docker.
105+
106+
107+
Using Rafiki Admin's HTTP interface
108+
--------------------------------------------------------------------
109+
110+
To make calls to the HTTP endpoints of Rafiki Admin, you'll need first authenticate with email & password
111+
against the `POST /tokens` endpoint to obtain an authentication token `token`,
112+
and subsequently add the `Authorization` header for every other call:
113+
114+
::
115+
116+
Authorization: Bearer {{token}}

docs/src/dev/setup.rst

Lines changed: 44 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -31,68 +31,77 @@ To destroy Rafiki's complete stack:
3131
3232
bash scripts/stop.sh
3333
34-
Adding Nodes to Rafiki
34+
Scaling Rafiki
3535
--------------------------------------------------------------------
3636

37+
Rafiki's default setup runs on a single node, and only runs on CPUs.
38+
3739
Rafiki has with its dynamic stack (e.g. train workers, inference workes, predictors)
38-
running as `Docker Swarm Services <https://docs.docker.com/engine/swarm/services/>`_.
40+
running as `Docker Swarm Services <https://docs.docker.com/engine/swarm/services/>`_.
41+
It runs in a `Docker routing-mesh overlay network <https://docs.docker.com/network/overlay/>`_,
42+
using it for networking amongst nodes.
3943

40-
Horizontal scaling can be done by adding more nodes to the swarm.
44+
Rafiki's workers run in Docker containers that extend the Docker image ``nvidia/cuda:9.0-runtime-ubuntu16.04``,
45+
and are capable of leveraging on `CUDA-Capable GPUs <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions>`_
46+
across Rafiki's nodes.
4147

42-
Perform the following for *each* worker node to be added:
48+
To scale Rafiki horizontally and enable running on GPUs, do the following:
4349

44-
1. Connect the node to the same network as the master, so that the node can `join the master's Docker Swarm <https://docs.docker.com/engine/swarm/join-nodes/>`_.
50+
1. If Rafiki is running, stop Rafiki, and have the master node leave its Docker Swarm
4551

46-
2. Configure the node with the script:
52+
2. Put every worker node and the master node into a common network,
53+
and change ``DOCKER_SWARM_ADVERTISE_ADDR`` in ``.env.sh`` to the IP address of the master node
54+
in *the network that your worker nodes are in*
4755

48-
.. code-block:: shell
56+
3. For every node, including the master node, ensure the `firewall rules
57+
allow TCP & UDP traffic on ports 2377, 7946 and 4789
58+
<https://docs.docker.com/network/overlay/#operations-for-all-overlay-networks>`_
4959

50-
bash scripts/setup_node.sh
60+
4. For every node that has GPUs:
5161

62+
4.1. `Install NVIDIA drivers <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for CUDA *9.0* or above
5263

53-
Exposing Rafiki Publicly
54-
--------------------------------------------------------------------
64+
4.2. `Install nvidia-docker2 <https://github.com/NVIDIA/nvidia-docker>`_
65+
66+
4.3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here <https://lukeyeager.github.io/2018/01/22/setting-the-default-docker-runtime-to-nvidia.html>`_)
5567

56-
Rafiki runs in a `Docker routing-mesh overlay network <https://docs.docker.com/network/overlay/>`_, with
57-
Rafiki Admin and Rafiki Admin Web running only on the master node.
68+
5. Start Rafiki with ``bash scripts/start.sh``
5869

59-
Edit the following line in ``.env.sh`` with the IP address of the master node in the network you intend to expose Rafiki:
70+
6. For every worker node, have the node `join the master node's Docker Swarm <https://docs.docker.com/engine/swarm/join-nodes/>`_
6071

61-
.. code-block:: shell
72+
7. On the *master* node, for *every* node (including the master node), configure it with the script:
6273

63-
export RAFIKI_IP_ADDRESS=127.0.0.1
74+
::
75+
76+
bash scripts/setup_node.sh
6477

65-
Re-deploy Rafiki. Rafiki Admin and Rafiki Admin Web will be available at that IP address over ports 3000 and 3001,
66-
assuming incoming connections to these ports are allowed.
6778

68-
Enabling GPU for Rafiki's Workers
79+
Exposing Rafiki Publicly
6980
--------------------------------------------------------------------
7081

71-
Rafiki's workers run in Docker containers that extend the Docker image ``nvidia/cuda:9.0-runtime-ubuntu16.04``,
72-
and are capable of leveraging on `CUBA-Capable GPUs <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions>`_
73-
across Rafiki's nodes.
82+
Rafiki Admin and Rafiki Admin Web runs on the master node.
83+
Change ``RAFIKI_ADDR`` in ``.env.sh`` to the IP address of the master node
84+
in the network you intend to expose Rafiki in.
7485

75-
Rafiki's default setup would only configure its workers to run on CPUs across Rafiki's nodes. To allow model
76-
training in workers to run on GPUs, perform the following configuration on *each* node in Rafiki:
86+
Example:
87+
88+
::
7789

78-
1. `Install NVIDIA drivers <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for CUDA *9.0* or above
90+
export RAFIKI_ADDR=172.28.176.35
7991

80-
2. `Install nvidia-docker2 <https://github.com/NVIDIA/nvidia-docker>`_
92+
Re-deploy Rafiki. Rafiki Admin and Rafiki Admin Web will be available at that IP address,
93+
over ports 3000 and 3001 (by default), assuming incoming connections to these ports are allowed.
8194

82-
3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here <https://lukeyeager.github.io/2018/01/22/setting-the-default-docker-runtime-to-nvidia.html>`_)
8395

8496
Reading Rafiki's logs
8597
--------------------------------------------------------------------
8698

87-
You can read logs of Rafiki Admin, Rafiki Advisor & any of Rafiki's services at in the project's `./logs` directory.
99+
By default, you can read logs of Rafiki Admin, Rafiki Advisor & any of Rafiki's services
100+
in `./logs` directory at the root of the project's directory of the master node.
88101

89-
Using Rafiki Admin's HTTP interface
90-
--------------------------------------------------------------------
91-
92-
To make calls to the HTTP endpoints of Rafiki Admin, you'll need first authenticate with email & password
93-
against the `POST /tokens` endpoint to obtain an authentication token `token`,
94-
and subsequently add the `Authorization` header for every other call:
95102

96-
::
103+
Troubleshooting
104+
--------------------------------------------------------------------
97105

98-
Authorization: Bearer {{token}}
106+
Q: There seems to be connectivity issues amongst containers across nodes!
107+
A: Ensure that containers are able to communicate with one another through the Docker Swarm overlay network: https://docs.docker.com/network/network-tutorial-overlay/#use-an-overlay-network-for-standalone-containers

docs/src/python/rafiki.constants.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,7 @@ rafiki.constants
1414
.. autoclass:: rafiki.constants.BudgetType
1515

1616
.. autoclass:: rafiki.constants.UserType
17+
18+
.. autoclass:: rafiki.constants.ModelDependency
19+
20+
.. autoclass:: rafiki.constants.ModelAccessRight

docs/src/user/client-create-inference-job.include.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
To create an model deployment job, you need to submit the app name associated with a *completed* train job.
1+
To create an model deployment job, you need to submit the app name associated with a *stopped* train job.
22
The inference job would be created from the best trials from the train job.
33

44
Example:
@@ -13,8 +13,8 @@ Example:
1313
1414
{'app': 'fashion_mnist_app',
1515
'app_version': 1,
16-
'id': '74b8f43a-c4f8-4ebc-a643-18a879dbbd1d',
17-
'predictor_host': '127.0.0.1:30000',
18-
'train_job_id': '3f3b3bdd-43ac-4354-99a5-d4d86006b68a'}
16+
'id': '0477d03c-d312-48c5-8612-f9b37b368949',
17+
'predictor_host': '127.0.0.1:30001',
18+
'train_job_id': 'ec4db479-b9b2-4289-8086-52794ffc71c8'}
1919
2020
.. seealso:: :meth:`rafiki.client.Client.create_inference_job`

docs/src/user/client-create-train-job.include.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@ Example:
2121

2222
.. code-block:: python
2323
24-
{'app': 'fashion_mnist_app',
25-
'app_version': 1,
26-
'id': '3f3b3bdd-43ac-4354-99a5-d4d86006b68a'}
24+
{'app': 'fashion_mnist_app',
25+
'app_version': 1,
26+
'id': 'ec4db479-b9b2-4289-8086-52794ffc71c8'}
2727
2828
.. note::
2929

docs/src/user/client-list-inference-jobs.include.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,13 @@ Example:
88

99
.. code-block:: python
1010
11-
[{'app': 'fashion_mnist_app',
11+
{'app': 'fashion_mnist_app',
1212
'app_version': 1,
13-
'datetime_started': 'Sun, 18 Nov 2018 10:04:13 GMT',
13+
'datetime_started': 'Mon, 17 Dec 2018 07:15:12 GMT',
1414
'datetime_stopped': None,
15-
'id': '74b8f43a-c4f8-4ebc-a643-18a879dbbd1d',
15+
'id': '0477d03c-d312-48c5-8612-f9b37b368949',
1616
'predictor_host': '127.0.0.1:30000',
1717
'status': 'RUNNING',
18-
'train_job_id': '3f3b3bdd-43ac-4354-99a5-d4d86006b68a'}]
18+
'train_job_id': 'ec4db479-b9b2-4289-8086-52794ffc71c8'}
1919
2020
.. seealso:: :meth:`rafiki.client.Client.get_inference_jobs_of_app`

docs/src/user/client-list-models.include.rst

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,20 +8,22 @@ Example:
88

99
.. code-block:: python
1010
11-
[{'datetime_created': 'Sun, 18 Nov 2018 09:56:03 GMT',
11+
[{'access_right': 'PRIVATE',
12+
'datetime_created': 'Mon, 17 Dec 2018 07:06:03 GMT',
1213
'dependencies': {'tensorflow': '1.12.0'},
13-
'docker_image': 'rafikiai/rafiki_worker:0.0.7',
14+
'docker_image': 'rafikiai/rafiki_worker:0.0.9',
1415
'model_class': 'TfFeedForward',
1516
'name': 'TfFeedForward',
1617
'task': 'IMAGE_CLASSIFICATION',
17-
'user_id': '9fdefa23-c838-4c56-8eb5-f625ff4245ab'},
18-
{'datetime_created': 'Sun, 18 Nov 2018 09:56:04 GMT',
18+
'user_id': 'fb5671f1-c673-40e7-b53a-9208eb1ccc50'},
19+
{'access_right': 'PRIVATE',
20+
'datetime_created': 'Mon, 17 Dec 2018 07:06:03 GMT',
1921
'dependencies': {'scikit-learn': '0.20.0'},
20-
'docker_image': 'rafikiai/rafiki_worker:0.0.7',
22+
'docker_image': 'rafikiai/rafiki_worker:0.0.9',
2123
'model_class': 'SkDt',
2224
'name': 'SkDt',
2325
'task': 'IMAGE_CLASSIFICATION',
24-
'user_id': '9fdefa23-c838-4c56-8eb5-f625ff4245ab'}]
26+
'user_id': 'fb5671f1-c673-40e7-b53a-9208eb1ccc50'}]
2527
2628
.. seealso:: :meth:`rafiki.client.Client.get_models_of_task`
2729

docs/src/user/client-list-train-jobs.include.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,9 @@ Example:
1111
[{'app': 'fashion_mnist_app',
1212
'app_version': 1,
1313
'budget': {'MODEL_TRIAL_COUNT': 2},
14-
'datetime_completed': None,
15-
'datetime_started': 'Sun, 18 Nov 2018 09:56:36 GMT',
16-
'id': '3f3b3bdd-43ac-4354-99a5-d4d86006b68a',
14+
'datetime_started': 'Mon, 17 Dec 2018 07:08:05 GMT',
15+
'datetime_stopped': None,
16+
'id': 'ec4db479-b9b2-4289-8086-52794ffc71c8',
1717
'status': 'RUNNING',
1818
'task': 'IMAGE_CLASSIFICATION',
1919
'test_dataset_uri': 'https://github.com/nginyc/rafiki-datasets/blob/master/fashion_mnist/fashion_mnist_for_image_classification_test.zip?raw=true',

0 commit comments

Comments
 (0)