Skip to content
This repository was archived by the owner on Feb 20, 2024. It is now read-only.

Commit 883c38e

Browse files
authored
Merge pull request #104 from nginyc/chore/improve_docs
Improve documentation
2 parents 6744d1d + 1d27ec6 commit 883c38e

18 files changed

+146
-90
lines changed

.env.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,3 +46,7 @@ export IMAGE_REDIS=redis:5.0.3-alpine3.8
4646

4747
# Utility configuration
4848
export PYTHONPATH=$PWD # Ensures that `rafiki` module can be imported at project root
49+
50+
# Set alias for correct PIP & python
51+
alias pip='pip3.6'
52+
alias python='python3.6'

README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,17 +8,19 @@ Read Rafiki's full documentation at https://nginyc.github.io/rafiki/docs/latest
88

99
Prerequisites: MacOS or Linux environment
1010

11-
1. Install Docker 18
11+
1. Install Docker 18 ([Ubuntu](https://docs.docker.com/install/linux/docker-ce/ubuntu/), [MacOS](https://docs.docker.com/docker-for-mac/install/)) and, if required, add your user to `docker` group ([Linux](https://docs.docker.com/install/linux/linux-postinstall/>))
1212

13-
2. Install Python 3.6
13+
2. Install Python 3.6 ([Ubuntu](http://ubuntuhandbook.org/index.php/2017/07/install-python-3-6-1-in-ubuntu-16-04-lts/), [MacOS](https://www.python.org/downloads/mac-osx/))
1414

15-
3. Setup Rafiki's complete stack with the init script:
15+
3. Clone this project (e.g. with [Git](https://git-scm.com/downloads>))
16+
17+
4. Setup Rafiki's complete stack with the setup script:
1618

1719
```sh
1820
bash scripts/start.sh
1921
```
2022

21-
4. To destroy Rafiki's complete stack:
23+
To destroy Rafiki's complete stack:
2224
2325
```sh
2426
bash scripts/stop.sh
@@ -29,7 +31,7 @@ More instructions are available in [Rafiki's Developer Guide](https://nginyc.git
2931

3032
## Issues
3133

32-
Report the issues at [JIRA](https://issues.apache.org/jira/browse/SINGA) or [Github](https://github.com/nginyc/rafiki/issues)
34+
Report any issues at [Apache SINGA's JIRA](https://issues.apache.org/jira/browse/SINGA) or [Rafiki's Github Issues](https://github.com/nginyc/rafiki/issues)
3335

3436

3537
## Acknowledgements

conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
# -- Project information -----------------------------------------------------
2121

2222
project = 'Rafiki'
23-
copyright = '2018, nginyc, cadmusthefounder, nudles'
23+
copyright = '2019, nginyc, cadmusthefounder, nudles'
2424
author = 'nginyc, cadmusthefounder, nudles'
2525

2626
# The short X.Y version

docs/src/dev/architecture.rst

Lines changed: 45 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,63 @@
11
.. _`architecture`:
22

3-
Architecture
3+
Rafiki's Architecture
44
====================================================================
55

6-
.. contents:: Table of Contents
6+
Rafiki’s system architecture consists of 3 static components, 2 central databases, 3 types of dynamic components, and 1 client-side SDK,
7+
which can be illustrated with a 3-layer architecture diagram.
78

8-
User Roles
9-
--------------------------------------------------------------------
10-
11-
.. figure:: ../images/system-context-diagram.jpg
9+
.. figure:: ../images/container-diagram.png
1210
:align: center
13-
:width: 500px
14-
15-
System Context Diagram for Rafiki
11+
:width: 1200px
1612

17-
There are 4 user roles:
13+
Architecture of Rafiki
1814

19-
- *Rafiki Admin* manages users
20-
- *Model Developer* manages model templates
21-
- *App Developer* manages train & inference jobs
22-
- *App User* makes queries to deployed models
2315

24-
System Components
25-
--------------------------------------------------------------------
16+
Static Stack of Rafiki
17+
---------------------------------------------------------------------
2618

27-
.. figure:: ../images/container-diagram.jpg
28-
:align: center
29-
:width: 1200px
19+
Rafiki’s static stack consists of the following:
20+
21+
*Rafiki Admin* (*Python/Flask*) is the centrepiece of Rafiki. It is a multi-threaded HTTP server which presents a unified REST API over HTTP that fully administrates the Rafiki instance. When users send requests to Rafiki Admin, it handles these requests by accordingly modifying Rafiki’s Metadata Store or deploying/stopping the dynamic components of Rafiki’s stack (i.e. workers for model training & serving).
22+
23+
*Rafiki Metadata Store* (*PostgreSQL*) is Rafiki’s centralized, persistent database for user metadata, job metadata, worker metadata and model templates.
24+
25+
*Rafiki Advisor* (*Python/Flask*) is Rafiki’s advisor as described in the earlier sections. It is a single-threaded HTTP server. It accepts new advisory sessions from multiple Rafiki Train Workers, generates proposals of Knobs for them, and receives feedback for completed Trials in a Train Job.
3026

31-
Container Diagram for Rafiki
27+
*Rafiki Cache* (*Redis*) is Rafiki’s temporary in-memory store for the implementation of fast asynchronous cross-worker communication, in a way that decouples senders from receivers. It synchronizes the back-and-forth of queries & predictions between multiple Rafiki Inference Workers and a single Rafiki Predictor for an Inference Job.
3228

33-
Static Components of Rafiki
34-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29+
*Rafiki Web Admin* (*NodeJS/ExpressJS*) is a HTTP server that serves Rafiki’s web front-end to users, allowing Application Developers to survey their jobs on a friendly web GUI.
30+
31+
*Rafiki Client* (*Python*) is Rafiki’s client-side Python SDK to simplify communication with Admin.
32+
33+
34+
Dynamic Stack of Rafiki
35+
---------------------------------------------------------------------
36+
37+
On the other hand, Rafiki’s dynamic stack consists of a dynamic pool of workers.
38+
Internally within Rafiki’s architecture, Admin adopts master-slave relationships with these workers, managing the deployment and termination of these workers in real-time depending on Train Job and Inference Job requests, as well as the stream of events it receives from its workers.
39+
When a worker is deployed, it is configured with the identifier for an associated job, and once it starts running, it would first initialize itself by pulling the job’s metadata from Metadata Store before starting on its task.
40+
41+
The types of workers are as follows:
42+
43+
*Rafiki Train Workers* (*Python*) train models for Train Jobs by conducting Trials. In a single Train Job, there could be multiple Train Workers concurrently training models.
44+
45+
*Rafiki Predictors* (*Python/Flask*) are multi-threaded HTTP servers that receive queries from Application Users and respond with predictions as part of an Inference Job. It does this through producer-consumer relationships with multiple Rafiki Inference Workers. If necessary, it performs model ensembling on predictions received from different workers.
46+
47+
*Rafiki Inference Workers* (*Python*) serve models for Inference Jobs. In a single Inference Job, there could be multiple Inference Workers concurrently making predictions for a single batch of queries.
3548

36-
These components make up Rafiki's static stack.
3749

38-
- *Admin* is a HTTP server that handles requests from users, and accordingly updates Rafiki's database or deploys components (e.g workers, predictors) based on these requests
39-
- *Admin Web* is a HTTP server that serves a Web UI for Admin
40-
- *Client* is a client-side Python SDK for sending requests to Admin
41-
- *Advisor* is a HTTP server that generates proposals of knobs during training
42-
- *Database* is Rafiki's main store for user, train job, inference job, model templates, and trained model data, including model parameters
43-
- *Cache* is Rafiki's temporary store for queries & predictions during inference
50+
Container Orchestration Strategy
51+
---------------------------------------------------------------------
4452

45-
Dynamic Components of Rafiki
46-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
53+
All of Rafiki's components' environment and configuration has been fully specified as a replicable, portable Docker image publicly available as Dockerfiles and on `Rafiki’s own Docker Hub account <https://hub.docker.com/u/rafikiai>`__.
4754

48-
These components are dynamically deployed or stopped by Admin depending on the statuses of train or inference jobs.
55+
When an instance of Rafiki is deployed on the master node, a `Docker Swarm <https://docs.docker.com/engine/swarm/key-concepts/>`__ is initialized and all of Rafiki's components run within a single `Docker routing-mesh overlay network <https://docs.docker.com/network/overlay/>`__.
56+
Subsequently, Rafiki can be horizontally scaled by adding more worker nodes to the Docker Swarm. Dynamically-deployed workers run as `Docker Swarm Services <https://docs.docker.com/engine/swarm/services/>`__
57+
and are placed in a resource-aware manner.
4958

50-
- Each *Train Worker* is a Python program that trains models associated with a train job,
51-
- Each *Inference Worker* is a Python program that makes batch predictions with trained models associated with an inference job
52-
- Each *Predictor* is a HTTP server that receives queries from users and responds with predictions, associated with an inference job
5359

60+
Distributed File System Strategy
61+
---------------------------------------------------------------------
62+
All components depend on a shared file system across multiple nodes, powered by *Network File System* (*NFS*).
63+
Each component written in Python continually writes logs to this shared file system.

docs/src/dev/folder-structure.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ Folder Structure
3131

3232
- `db/`
3333

34-
Code for Rafiki's *Database* as an abstract data access layer
34+
Code for Rafiki's *Metadata Store* as an abstract data access layer
3535

3636
- `cache/`
3737

docs/src/dev/setup.rst

Lines changed: 34 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,14 @@ Quick Setup
1313

1414
We assume development or deployment in a MacOS or Linux environment.
1515

16-
1. Install Docker 18 & Python 3.6
16+
1. Install Docker 18 (`Ubuntu <https://docs.docker.com/install/linux/docker-ce/ubuntu/>`__, `MacOS <https://docs.docker.com/docker-for-mac/install/>`__)
17+
and, if required, add your user to ``docker`` group (`Linux <https://docs.docker.com/install/linux/linux-postinstall/>`__).
1718

18-
2. Clone the project at https://github.com/nginyc/rafiki
19+
2. Install Python 3.6 (`Ubuntu <http://ubuntuhandbook.org/index.php/2017/07/install-python-3-6-1-in-ubuntu-16-04-lts/>`__, `MacOS <https://www.python.org/downloads/mac-osx/>`__)
1920

20-
3. Setup Rafiki's complete stack with the init script:
21+
3. Clone the project at https://github.com/nginyc/rafiki (e.g. with `Git <https://git-scm.com/downloads>`__)
22+
23+
4. Setup Rafiki's complete stack with the setup script:
2124

2225
.. code-block:: shell
2326
@@ -34,42 +37,45 @@ To destroy Rafiki's complete stack:
3437
Scaling Rafiki
3538
--------------------------------------------------------------------
3639

37-
Rafiki's default setup runs on a single node, and only runs on CPUs.
40+
Rafiki's default setup runs on a single machine and only runs its workloads on CPUs.
41+
42+
Rafiki's model training workers run in Docker containers that extend the Docker image ``nvidia/cuda:9.0-runtime-ubuntu16.04``,
43+
and are capable of leveraging on `CUDA-Capable GPUs <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions>`__
44+
45+
Scaling Rafiki horizontally and enabling GPU usage involves setting up *Network File System* (*NFS*) at a common path across all nodes,
46+
installing & configuring the default Docker runtime to `nvidia` for each GPU-bearing node, and putting all these nodes into a single Docker Swarm.
3847

39-
Rafiki has with its dynamic stack (e.g. train workers, inference workes, predictors)
40-
running as `Docker Swarm Services <https://docs.docker.com/engine/swarm/services/>`_.
41-
It runs in a `Docker routing-mesh overlay network <https://docs.docker.com/network/overlay/>`_,
42-
using it for networking amongst nodes.
48+
.. seealso:: :ref:`architecture`
4349

44-
Rafiki's workers run in Docker containers that extend the Docker image ``nvidia/cuda:9.0-runtime-ubuntu16.04``,
45-
and are capable of leveraging on `CUDA-Capable GPUs <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions>`_
46-
across Rafiki's nodes.
50+
To run Rafiki on multiple machines with GPUs, do the following:
4751

48-
To scale Rafiki horizontally and enable running on GPUs, do the following:
52+
1. If Rafiki is running, stop Rafiki with ``bash scripts/stop.sh``
4953

50-
1. If Rafiki is running, stop Rafiki, and have the master node leave its Docker Swarm
54+
2. Have all nodes `leave any Docker Swarm <https://docs.docker.com/engine/reference/commandline/swarm_leave/>`__ they are in
5155

52-
2. Put every worker node and the master node into a common network,
53-
and change ``DOCKER_SWARM_ADVERTISE_ADDR`` in ``.env.sh`` to the IP address of the master node
54-
in *the network that your worker nodes are in*
56+
3. Set up NFS such that the *master node is a NFS host*, *other nodes are NFS clients*, and the master node *shares an ancestor directory
57+
containing Rafiki's project directory*. `Here are instructions for Ubuntu <https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-ubuntu-16-04>`__
5558

56-
3. For every node, including the master node, ensure the `firewall rules
59+
4. All nodes should be in a common network. On the *master node*, change ``DOCKER_SWARM_ADVERTISE_ADDR`` in the project's ``.env.sh`` to the IP address of the master node
60+
in *the network that your nodes are in*
61+
62+
5. For *each node* (including the master node), ensure the `firewall rules
5763
allow TCP & UDP traffic on ports 2377, 7946 and 4789
5864
<https://docs.docker.com/network/overlay/#operations-for-all-overlay-networks>`_
5965

60-
4. For every node that has GPUs:
66+
6. For *each node that has GPUs*:
6167

62-
4.1. `Install NVIDIA drivers <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for CUDA *9.0* or above
68+
6.1. `Install NVIDIA drivers <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`__ for CUDA *9.0* or above
6369

64-
4.2. `Install nvidia-docker2 <https://github.com/NVIDIA/nvidia-docker>`_
70+
6.2. `Install nvidia-docker2 <https://github.com/NVIDIA/nvidia-docker>`__
6571

66-
4.3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here <https://lukeyeager.github.io/2018/01/22/setting-the-default-docker-runtime-to-nvidia.html>`_)
72+
6.3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here <https://lukeyeager.github.io/2018/01/22/setting-the-default-docker-runtime-to-nvidia.html>`__)
6773

68-
5. Start Rafiki with ``bash scripts/start.sh``
74+
7. On the *master node*, start Rafiki with ``bash scripts/start.sh``
6975

70-
6. For every worker node, have the node `join the master node's Docker Swarm <https://docs.docker.com/engine/swarm/join-nodes/>`_
76+
8. For *each worker node*, have the node `join the master node's Docker Swarm <https://docs.docker.com/engine/swarm/join-nodes/>`__
7177

72-
7. On the *master* node, for *every* node (including the master node), configure it with the script:
78+
9. On the *master* node, for *each node* (including the master node), configure it with the script:
7379

7480
::
7581

@@ -96,12 +102,13 @@ over ports 3000 and 3001 (by default), assuming incoming connections to these po
96102
Reading Rafiki's logs
97103
--------------------------------------------------------------------
98104

99-
By default, you can read logs of Rafiki Admin, Rafiki Advisor & any of Rafiki's services
100-
in `./logs` directory at the root of the project's directory of the master node.
105+
By default, you can read logs of Rafiki Admin, Rafiki Advisor & any of Rafiki's workers
106+
in ``./logs`` directory at the root of the project's directory of the master node.
101107

102108

103109
Troubleshooting
104110
--------------------------------------------------------------------
105111

106112
Q: There seems to be connectivity issues amongst containers across nodes!
107-
A: Ensure that containers are able to communicate with one another through the Docker Swarm overlay network: https://docs.docker.com/network/network-tutorial-overlay/#use-an-overlay-network-for-standalone-containers
113+
114+
A: `Ensure that containers are able to communicate with one another through the Docker Swarm overlay network <https://docs.docker.com/network/network-tutorial-overlay/#use-an-overlay-network-for-standalone-containers>`__
-123 KB
Binary file not shown.
445 KB
Loading
Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,10 @@
1-
1. Install Python 3.6
1+
1. Install Python 3.6 (`Ubuntu <http://ubuntuhandbook.org/index.php/2017/07/install-python-3-6-1-in-ubuntu-16-04-lts/>`__, `MacOS <https://www.python.org/downloads/mac-osx/>`__)
22

3-
2. Clone the project at https://github.com/nginyc/rafiki
3+
2. Clone the project at https://github.com/nginyc/rafiki (e.g. with `Git <https://git-scm.com/downloads>`__)
44

55
3. Within the project's root folder, install Rafiki Client's Python dependencies by running:
66

77
::
88

9-
pip install -r ./rafiki/client/requirements.txt
9+
pip3.6 install -r ./rafiki/client/requirements.txt
1010

11-
4. Set ``$PYTHONPATH`` to the project's root folder:
12-
13-
::
14-
15-
export PYTHONPATH=$PWD

docs/src/user/creating-models.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,10 +72,11 @@ To illustrate how to write models on Rafiki, we have written the following:
7272
- Sample pre-processing logic to convert common dataset formats to Rafiki's own dataset formats in `./examples/datasets/ <https://github.com/nginyc/rafiki/tree/master/examples/datasets/>`_
7373
- Sample models in `./examples/models/ <https://github.com/nginyc/rafiki/tree/master/examples/models/>`_
7474

75-
To start testing your model, first install the Python dependencies at ``rafiki/model/requirements.txt``:
75+
To start testing your model, first run the following:
7676

7777
.. code-block:: shell
7878
79+
source .env.sh
7980
pip install -r rafiki/model/requirements.txt
8081
8182
@@ -100,7 +101,7 @@ Example: Testing Models for ``IMAGE_CLASSIFICATION``
100101
.. code-block:: shell
101102
102103
python examples/models/image_classification/SkDt.py
103-
python examples/models/image_classification/TfSingleHiddenLayer.py
104+
python examples/models/image_classification/TfFeedForward.py
104105
105106
106107
Example: Testing Models for ``POS_TAGGING``

0 commit comments

Comments
 (0)