You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Feb 20, 2024. It is now read-only.
While building Rafiki's images locally, if you encounter an error like "No space left on device", you might be running out of space allocated for Docker. Try removing all containers & images:
88
+
While building Rafiki's images locally, if you encounter errors like "No space left on device",
89
+
you might be running out of space allocated for Docker. Try one of the following:
89
90
90
-
.. code-block:: shell
91
+
::
92
+
93
+
# Prunes dangling images
94
+
docker system prune
95
+
96
+
::
91
97
92
98
# Delete all containers
93
99
docker rm $(docker ps -a -q)
94
100
# Delete all images
95
101
docker rmi $(docker images -q)
96
102
97
103
From Mac Mojave onwards, due to Mac's new `privacy protection feature <https://www.howtogeek.com/361707/how-macos-mojaves-privacy-protection-works/>`_,
98
-
you might need to explicitly give Docker *Full Disk Access*, restart Docker, or even do a factory reset of Docker.
104
+
you might need to explicitly give Docker *Full Disk Access*, restart Docker, or even do a factory reset of Docker.
Rafiki's default setup runs on a single node, and only runs on CPUs.
38
+
37
39
Rafiki has with its dynamic stack (e.g. train workers, inference workes, predictors)
38
-
running as `Docker Swarm Services <https://docs.docker.com/engine/swarm/services/>`_.
40
+
running as `Docker Swarm Services <https://docs.docker.com/engine/swarm/services/>`_.
41
+
It runs in a `Docker routing-mesh overlay network <https://docs.docker.com/network/overlay/>`_,
42
+
using it for networking amongst nodes.
39
43
40
-
Horizontal scaling can be done by adding more nodes to the swarm.
44
+
Rafiki's workers run in Docker containers that extend the Docker image ``nvidia/cuda:9.0-runtime-ubuntu16.04``,
45
+
and are capable of leveraging on `CUDA-Capable GPUs <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions>`_
46
+
across Rafiki's nodes.
41
47
42
-
Perform the following for *each* worker node to be added:
48
+
To scale Rafiki horizontally and enable running on GPUs, do the following:
43
49
44
-
1. Connect the node to the same network as the master, so that the node can `join the master's Docker Swarm<https://docs.docker.com/engine/swarm/join-nodes/>`_.
50
+
1. If Rafiki is running, stop Rafiki, and have the masternode leave its Docker Swarm
45
51
46
-
2. Configure the node with the script:
52
+
2. Put every worker node and the master node into a common network,
53
+
and change ``DOCKER_SWARM_ADVERTISE_ADDR`` in ``.env.sh`` to the IP address of the master node
54
+
in *the network that your worker nodes are in*
47
55
48
-
.. code-block:: shell
56
+
3. For every node, including the master node, ensure the `firewall rules
57
+
allow TCP & UDP traffic on ports 2377, 7946 and 4789
4.3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here <https://lukeyeager.github.io/2018/01/22/setting-the-default-docker-runtime-to-nvidia.html>`_)
55
67
56
-
Rafiki runs in a `Docker routing-mesh overlay network <https://docs.docker.com/network/overlay/>`_, with
57
-
Rafiki Admin and Rafiki Admin Web running only on the master node.
68
+
5. Start Rafiki with ``bash scripts/start.sh``
58
69
59
-
Edit the following line in ``.env.sh`` with the IP address of the master node in the network you intend to expose Rafiki:
70
+
6. For every worker node, have the node `join the master node's Docker Swarm <https://docs.docker.com/engine/swarm/join-nodes/>`_
60
71
61
-
.. code-block:: shell
72
+
7. On the *master* node, for *every* node (including the master node), configure it with the script:
62
73
63
-
export RAFIKI_IP_ADDRESS=127.0.0.1
74
+
::
75
+
76
+
bash scripts/setup_node.sh
64
77
65
-
Re-deploy Rafiki. Rafiki Admin and Rafiki Admin Web will be available at that IP address over ports 3000 and 3001,
66
-
assuming incoming connections to these ports are allowed.
Re-deploy Rafiki. Rafiki Admin and Rafiki Admin Web will be available at that IP address,
93
+
over ports 3000 and 3001 (by default), assuming incoming connections to these ports are allowed.
81
94
82
-
3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here <https://lukeyeager.github.io/2018/01/22/setting-the-default-docker-runtime-to-nvidia.html>`_)
Q: There seems to be connectivity issues amongst containers across nodes!
107
+
A: Ensure that containers are able to communicate with one another through the Docker Swarm overlay network: https://docs.docker.com/network/network-tutorial-overlay/#use-an-overlay-network-for-standalone-containers
0 commit comments