You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Feb 20, 2024. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,17 +8,19 @@ Read Rafiki's full documentation at https://nginyc.github.io/rafiki/docs/latest
8
8
9
9
Prerequisites: MacOS or Linux environment
10
10
11
-
1. Install Docker 18
11
+
1. Install Docker 18 ([Ubuntu](https://docs.docker.com/install/linux/docker-ce/ubuntu/), [MacOS](https://docs.docker.com/docker-for-mac/install/)) and, if required, add your user to `docker` group ([Linux](https://docs.docker.com/install/linux/linux-postinstall/>))
3. Setup Rafiki's complete stack with the init script:
15
+
3. Clone this project (e.g. with [Git](https://git-scm.com/downloads>))
16
+
17
+
4. Setup Rafiki's complete stack with the setup script:
16
18
17
19
```sh
18
20
bash scripts/start.sh
19
21
```
20
22
21
-
4. To destroy Rafiki's complete stack:
23
+
To destroy Rafiki's complete stack:
22
24
23
25
```sh
24
26
bash scripts/stop.sh
@@ -29,7 +31,7 @@ More instructions are available in [Rafiki's Developer Guide](https://nginyc.git
29
31
30
32
## Issues
31
33
32
-
Report the issues at [JIRA](https://issues.apache.org/jira/browse/SINGA) or [Github](https://github.com/nginyc/rafiki/issues)
34
+
Report any issues at [Apache SINGA's JIRA](https://issues.apache.org/jira/browse/SINGA) or [Rafiki's Github Issues](https://github.com/nginyc/rafiki/issues)
*Rafiki Admin* (*Python/Flask*) is the centrepiece of Rafiki. It is a multi-threaded HTTP server which presents a unified REST API over HTTP that fully administrates the Rafiki instance. When users send requests to Rafiki Admin, it handles these requests by accordingly modifying Rafiki’s Metadata Store or deploying/stopping the dynamic components of Rafiki’s stack (i.e. workers for model training & serving).
22
+
23
+
*Rafiki Metadata Store* (*PostgreSQL*) is Rafiki’s centralized, persistent database for user metadata, job metadata, worker metadata and model templates.
24
+
25
+
*Rafiki Advisor* (*Python/Flask*) is Rafiki’s advisor as described in the earlier sections. It is a single-threaded HTTP server. It accepts new advisory sessions from multiple Rafiki Train Workers, generates proposals of Knobs for them, and receives feedback for completed Trials in a Train Job.
30
26
31
-
Container Diagram for Rafiki
27
+
*Rafiki Cache* (*Redis*) is Rafiki’s temporary in-memory store for the implementation of fast asynchronous cross-worker communication, in a way that decouples senders from receivers. It synchronizes the back-and-forth of queries & predictions between multiple Rafiki Inference Workers and a single Rafiki Predictor for an Inference Job.
*Rafiki Web Admin* (*NodeJS/ExpressJS*) is a HTTP server that serves Rafiki’s web front-end to users, allowing Application Developers to survey their jobs on a friendly web GUI.
30
+
31
+
*Rafiki Client* (*Python*) is Rafiki’s client-side Python SDK to simplify communication with Admin.
On the other hand, Rafiki’s dynamic stack consists of a dynamic pool of workers.
38
+
Internally within Rafiki’s architecture, Admin adopts master-slave relationships with these workers, managing the deployment and termination of these workers in real-time depending on Train Job and Inference Job requests, as well as the stream of events it receives from its workers.
39
+
When a worker is deployed, it is configured with the identifier for an associated job, and once it starts running, it would first initialize itself by pulling the job’s metadata from Metadata Store before starting on its task.
40
+
41
+
The types of workers are as follows:
42
+
43
+
*Rafiki Train Workers* (*Python*) train models for Train Jobs by conducting Trials. In a single Train Job, there could be multiple Train Workers concurrently training models.
44
+
45
+
*Rafiki Predictors* (*Python/Flask*) are multi-threaded HTTP servers that receive queries from Application Users and respond with predictions as part of an Inference Job. It does this through producer-consumer relationships with multiple Rafiki Inference Workers. If necessary, it performs model ensembling on predictions received from different workers.
46
+
47
+
*Rafiki Inference Workers* (*Python*) serve models for Inference Jobs. In a single Inference Job, there could be multiple Inference Workers concurrently making predictions for a single batch of queries.
35
48
36
-
These components make up Rafiki's static stack.
37
49
38
-
- *Admin* is a HTTP server that handles requests from users, and accordingly updates Rafiki's database or deploys components (e.g workers, predictors) based on these requests
39
-
- *Admin Web* is a HTTP server that serves a Web UI for Admin
40
-
- *Client* is a client-side Python SDK for sending requests to Admin
41
-
- *Advisor* is a HTTP server that generates proposals of knobs during training
42
-
- *Database* is Rafiki's main store for user, train job, inference job, model templates, and trained model data, including model parameters
43
-
- *Cache* is Rafiki's temporary store for queries & predictions during inference
All of Rafiki's components' environment and configuration has been fully specified as a replicable, portable Docker image publicly available as Dockerfiles and on `Rafiki’s own Docker Hub account <https://hub.docker.com/u/rafikiai>`__.
47
54
48
-
These components are dynamically deployed or stopped by Admin depending on the statuses of train or inference jobs.
55
+
When an instance of Rafiki is deployed on the master node, a `Docker Swarm <https://docs.docker.com/engine/swarm/key-concepts/>`__ is initialized and all of Rafiki's components run within a single `Docker routing-mesh overlay network <https://docs.docker.com/network/overlay/>`__.
56
+
Subsequently, Rafiki can be horizontally scaled by adding more worker nodes to the Docker Swarm. Dynamically-deployed workers run as `Docker Swarm Services <https://docs.docker.com/engine/swarm/services/>`__
57
+
and are placed in a resource-aware manner.
49
58
50
-
- Each *Train Worker* is a Python program that trains models associated with a train job,
51
-
- Each *Inference Worker* is a Python program that makes batch predictions with trained models associated with an inference job
52
-
- Each *Predictor* is a HTTP server that receives queries from users and responds with predictions, associated with an inference job
Rafiki's default setup runs on a single node, and only runs on CPUs.
40
+
Rafiki's default setup runs on a single machine and only runs its workloads on CPUs.
41
+
42
+
Rafiki's model training workers run in Docker containers that extend the Docker image ``nvidia/cuda:9.0-runtime-ubuntu16.04``,
43
+
and are capable of leveraging on `CUDA-Capable GPUs <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions>`__
44
+
45
+
Scaling Rafiki horizontally and enabling GPU usage involves setting up *Network File System* (*NFS*) at a common path across all nodes,
46
+
installing & configuring the default Docker runtime to `nvidia` for each GPU-bearing node, and putting all these nodes into a single Docker Swarm.
38
47
39
-
Rafiki has with its dynamic stack (e.g. train workers, inference workes, predictors)
40
-
running as `Docker Swarm Services <https://docs.docker.com/engine/swarm/services/>`_.
41
-
It runs in a `Docker routing-mesh overlay network <https://docs.docker.com/network/overlay/>`_,
42
-
using it for networking amongst nodes.
48
+
.. seealso:: :ref:`architecture`
43
49
44
-
Rafiki's workers run in Docker containers that extend the Docker image ``nvidia/cuda:9.0-runtime-ubuntu16.04``,
45
-
and are capable of leveraging on `CUDA-Capable GPUs <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions>`_
46
-
across Rafiki's nodes.
50
+
To run Rafiki on multiple machines with GPUs, do the following:
47
51
48
-
To scale Rafiki horizontally and enable running on GPUs, do the following:
52
+
1. If Rafiki is running, stop Rafiki with ``bash scripts/stop.sh``
49
53
50
-
1. If Rafiki is running, stop Rafiki, and have the master node leave its Docker Swarm
54
+
2. Have all nodes `leave any Docker Swarm <https://docs.docker.com/engine/reference/commandline/swarm_leave/>`__ they are in
51
55
52
-
2. Put every worker node and the master node into a common network,
53
-
and change ``DOCKER_SWARM_ADVERTISE_ADDR`` in ``.env.sh`` to the IP address of the master node
54
-
in *the network that your worker nodes are in*
56
+
3. Set up NFS such that the *master node is a NFS host*, *other nodes are NFS clients*, and the master node *shares an ancestor directory
57
+
containing Rafiki's project directory*. `Here are instructions for Ubuntu <https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-ubuntu-16-04>`__
55
58
56
-
3. For every node, including the master node, ensure the `firewall rules
59
+
4. All nodes should be in a common network. On the *master node*, change ``DOCKER_SWARM_ADVERTISE_ADDR`` in the project's ``.env.sh`` to the IP address of the master node
60
+
in *the network that your nodes are in*
61
+
62
+
5. For *each node* (including the master node), ensure the `firewall rules
57
63
allow TCP & UDP traffic on ports 2377, 7946 and 4789
4.3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here <https://lukeyeager.github.io/2018/01/22/setting-the-default-docker-runtime-to-nvidia.html>`_)
72
+
6.3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here <https://lukeyeager.github.io/2018/01/22/setting-the-default-docker-runtime-to-nvidia.html>`__)
67
73
68
-
5. Start Rafiki with ``bash scripts/start.sh``
74
+
7. On the *master node*, start Rafiki with ``bash scripts/start.sh``
69
75
70
-
6. For every worker node, have the node `join the master node's Docker Swarm <https://docs.docker.com/engine/swarm/join-nodes/>`_
76
+
8. For *each worker node*, have the node `join the master node's Docker Swarm <https://docs.docker.com/engine/swarm/join-nodes/>`__
71
77
72
-
7. On the *master* node, for *every* node (including the master node), configure it with the script:
78
+
9. On the *master* node, for *each node* (including the master node), configure it with the script:
73
79
74
80
::
75
81
@@ -96,12 +102,13 @@ over ports 3000 and 3001 (by default), assuming incoming connections to these po
Q: There seems to be connectivity issues amongst containers across nodes!
107
-
A: Ensure that containers are able to communicate with one another through the Docker Swarm overlay network: https://docs.docker.com/network/network-tutorial-overlay/#use-an-overlay-network-for-standalone-containers
113
+
114
+
A: `Ensure that containers are able to communicate with one another through the Docker Swarm overlay network <https://docs.docker.com/network/network-tutorial-overlay/#use-an-overlay-network-for-standalone-containers>`__
Copy file name to clipboardExpand all lines: docs/src/user/creating-models.rst
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -72,10 +72,11 @@ To illustrate how to write models on Rafiki, we have written the following:
72
72
- Sample pre-processing logic to convert common dataset formats to Rafiki's own dataset formats in `./examples/datasets/ <https://github.com/nginyc/rafiki/tree/master/examples/datasets/>`_
73
73
- Sample models in `./examples/models/ <https://github.com/nginyc/rafiki/tree/master/examples/models/>`_
74
74
75
-
To start testing your model, first install the Python dependencies at ``rafiki/model/requirements.txt``:
75
+
To start testing your model, first run the following:
76
76
77
77
.. code-block:: shell
78
78
79
+
source .env.sh
79
80
pip install -r rafiki/model/requirements.txt
80
81
81
82
@@ -100,7 +101,7 @@ Example: Testing Models for ``IMAGE_CLASSIFICATION``
0 commit comments