nginyc
diff --git a/‎.env.sh‎
Lines changed: 4 additions & 0 deletions b/‎.env.sh‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 7 additions & 5 deletions b/‎README.md‎
Lines changed: 7 additions & 5 deletions
diff --git a/‎conf.py‎
Lines changed: 1 addition & 1 deletion b/‎conf.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/src/dev/architecture.rst‎
Lines changed: 45 additions & 35 deletions b/‎docs/src/dev/architecture.rst‎
Lines changed: 45 additions & 35 deletions
diff --git a/‎docs/src/dev/folder-structure.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/src/dev/folder-structure.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/src/dev/setup.rst‎
Lines changed: 34 additions & 27 deletions b/‎docs/src/dev/setup.rst‎
Lines changed: 34 additions & 27 deletions
diff --git a/‎docs/src/images/container-diagram.jpg‎
-123 KB b/‎docs/src/images/container-diagram.jpg‎
-123 KB
diff --git a/‎docs/src/images/container-diagram.png‎
445 KB b/‎docs/src/images/container-diagram.png‎
445 KB
diff --git a/‎docs/src/user/client-installation.include.rst‎
Lines changed: 3 additions & 8 deletions b/‎docs/src/user/client-installation.include.rst‎
Lines changed: 3 additions & 8 deletions
diff --git a/‎docs/src/user/creating-models.rst‎
Lines changed: 3 additions & 2 deletions b/‎docs/src/user/creating-models.rst‎
Lines changed: 3 additions & 2 deletions
@@ -46,3 +46,7 @@ export IMAGE_REDIS=redis:5.0.3-alpine3.8
 
 # Utility configuration
 export PYTHONPATH=$PWD # Ensures that `rafiki` module can be imported at project root
+
+# Set alias for correct PIP & python
+alias pip='pip3.6'
+alias python='python3.6'
@@ -8,17 +8,19 @@ Read Rafiki's full documentation at https://nginyc.github.io/rafiki/docs/latest
 
 Prerequisites: MacOS or Linux environment
 
-1. Install Docker 18
+1. Install Docker 18 ([Ubuntu](https://docs.docker.com/install/linux/docker-ce/ubuntu/), [MacOS](https://docs.docker.com/docker-for-mac/install/)) and, if required, add your user to `docker` group ([Linux](https://docs.docker.com/install/linux/linux-postinstall/>))
 
-2. Install Python 3.6
+2. Install Python 3.6 ([Ubuntu](http://ubuntuhandbook.org/index.php/2017/07/install-python-3-6-1-in-ubuntu-16-04-lts/), [MacOS](https://www.python.org/downloads/mac-osx/))
 
-3. Setup Rafiki's complete stack with the init script:
+3. Clone this project (e.g. with [Git](https://git-scm.com/downloads>))
+
+4. Setup Rafiki's complete stack with the setup script:
 
     ```sh
     bash scripts/start.sh
     ```
 
-4. To destroy Rafiki's complete stack:
+To destroy Rafiki's complete stack:
 
     ```sh
     bash scripts/stop.sh
@@ -29,7 +31,7 @@ More instructions are available in [Rafiki's Developer Guide](https://nginyc.git
 
 ## Issues
 
-Report the issues at [JIRA](https://issues.apache.org/jira/browse/SINGA) or [Github](https://github.com/nginyc/rafiki/issues)
+Report any issues at [Apache SINGA's JIRA](https://issues.apache.org/jira/browse/SINGA) or [Rafiki's Github Issues](https://github.com/nginyc/rafiki/issues)
 
 
 ## Acknowledgements
 
@@ -20,7 +20,7 @@
 # -- Project information -----------------------------------------------------
 
 project = 'Rafiki'
-copyright = '2018, nginyc, cadmusthefounder, nudles'
+copyright = '2019, nginyc, cadmusthefounder, nudles'
 author = 'nginyc, cadmusthefounder, nudles'
 
 # The short X.Y version
 
@@ -1,53 +1,63 @@
 .. _`architecture`:
 
-Architecture
+Rafiki's Architecture
 ====================================================================
 
-.. contents:: Table of Contents
+Rafiki’s system architecture consists of 3 static components, 2 central databases, 3 types of dynamic components, and 1 client-side SDK, 
+which can be illustrated with a 3-layer architecture diagram.
 
-User Roles
---------------------------------------------------------------------
-
-.. figure:: ../images/system-context-diagram.jpg
+.. figure:: ../images/container-diagram.png
     :align: center
-    :width: 500px
-    
-    System Context Diagram for Rafiki
+    :width: 1200px
 
-There are 4 user roles:
+    Architecture of Rafiki
 
-- *Rafiki Admin* manages users
-- *Model Developer* manages model templates
-- *App Developer* manages train & inference jobs
-- *App User* makes queries to deployed models
 
-System Components
---------------------------------------------------------------------
+Static Stack of Rafiki
+---------------------------------------------------------------------
 
-.. figure:: ../images/container-diagram.jpg
-    :align: center
-    :width: 1200px
+Rafiki’s static stack consists of the following:
+
+    *Rafiki Admin* (*Python/Flask*) is the centrepiece of Rafiki. It is a multi-threaded HTTP server which presents a unified REST API over HTTP that fully administrates the Rafiki instance. When users send requests to Rafiki Admin, it handles these requests by accordingly modifying Rafiki’s Metadata Store or deploying/stopping the dynamic components of Rafiki’s stack (i.e. workers for model training & serving).
+
+    *Rafiki Metadata Store* (*PostgreSQL*) is Rafiki’s centralized, persistent database for user metadata, job metadata, worker metadata and model templates. 
+
+    *Rafiki Advisor* (*Python/Flask*) is Rafiki’s advisor as described in the earlier sections. It is a single-threaded HTTP server. It accepts new advisory sessions from multiple Rafiki Train Workers, generates proposals of Knobs for them, and receives feedback for completed Trials in a Train Job. 
 
-    Container Diagram for Rafiki
+    *Rafiki Cache* (*Redis*) is Rafiki’s temporary in-memory store for the implementation of fast asynchronous cross-worker communication, in a way that decouples senders from receivers. It synchronizes the back-and-forth of queries & predictions between multiple Rafiki Inference Workers and a single Rafiki Predictor for an Inference Job.
 
-Static Components of Rafiki
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    *Rafiki Web Admin* (*NodeJS/ExpressJS*) is a HTTP server that serves Rafiki’s web front-end to users, allowing Application Developers to survey their jobs on a friendly web GUI. 
+
+    *Rafiki Client* (*Python*) is Rafiki’s client-side Python SDK to simplify communication with Admin.
+
+
+Dynamic Stack of Rafiki
+---------------------------------------------------------------------
+
+On the other hand, Rafiki’s dynamic stack consists of a dynamic pool of workers. 
+Internally within Rafiki’s architecture, Admin adopts master-slave relationships with these workers, managing the deployment and termination of these workers in real-time depending on Train Job and Inference Job requests, as well as the stream of events it receives from its workers. 
+When a worker is deployed, it is configured with the identifier for an associated job, and once it starts running, it would first initialize itself by pulling the job’s metadata from Metadata Store before starting on its task.
+
+The types of workers are as follows:
+
+    *Rafiki Train Workers* (*Python*) train models for Train Jobs by conducting Trials. In a single Train Job, there could be multiple Train Workers concurrently training models.
+    
+    *Rafiki Predictors* (*Python/Flask*) are multi-threaded HTTP servers that receive queries from Application Users and respond with predictions as part of an Inference Job. It does this through  producer-consumer relationships with multiple Rafiki Inference Workers. If necessary, it performs model ensembling on predictions received from different workers.
+    
+    *Rafiki Inference Workers* (*Python*) serve models for Inference Jobs. In a single Inference Job, there could be multiple Inference Workers concurrently making predictions for a single batch of queries.
 
-These components make up Rafiki's static stack.
 
-- *Admin* is a HTTP server that handles requests from users, and accordingly updates Rafiki's database or deploys components (e.g workers, predictors) based on these requests
-- *Admin Web* is a HTTP server that serves a Web UI for Admin
-- *Client* is a client-side Python SDK for sending requests to Admin
-- *Advisor* is a HTTP server that generates proposals of knobs during training
-- *Database* is Rafiki's main store for user, train job, inference job, model templates, and trained model data, including model parameters
-- *Cache* is Rafiki's temporary store for queries & predictions during inference
+Container Orchestration Strategy
+---------------------------------------------------------------------
 
-Dynamic Components of Rafiki
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+All of Rafiki's components' environment and configuration has been fully specified as a replicable, portable Docker image publicly available as Dockerfiles and on `Rafiki’s own Docker Hub account <https://hub.docker.com/u/rafikiai>`__.
 
-These components are dynamically deployed or stopped by Admin depending on the statuses of train or inference jobs.
+When an instance of Rafiki is deployed on the master node, a `Docker Swarm <https://docs.docker.com/engine/swarm/key-concepts/>`__ is initialized and all of Rafiki's components run within a single `Docker routing-mesh overlay network <https://docs.docker.com/network/overlay/>`__.
+Subsequently, Rafiki can be horizontally scaled by adding more worker nodes to the Docker Swarm. Dynamically-deployed workers run as `Docker Swarm Services <https://docs.docker.com/engine/swarm/services/>`__
+and are placed in a resource-aware manner.
 
-- Each *Train Worker* is a Python program that trains models associated with a train job,
-- Each *Inference Worker* is a Python program that makes batch predictions with trained models associated with an inference job
-- Each *Predictor* is a HTTP server that receives queries from users and responds with predictions, associated with an inference job
 
+Distributed File System Strategy
+---------------------------------------------------------------------
+All components depend on a shared file system across multiple nodes, powered by *Network File System* (*NFS*). 
+Each component written in Python continually writes logs to this shared file system.
@@ -31,7 +31,7 @@ Folder Structure
 
     - `db/`
 
-        Code for Rafiki's *Database* as an abstract data access layer
+        Code for Rafiki's *Metadata Store* as an abstract data access layer
 
     - `cache/`
 
 
@@ -13,11 +13,14 @@ Quick Setup
 
 We assume development or deployment in a MacOS or Linux environment.
 
-1. Install Docker 18 & Python 3.6
+1. Install Docker 18 (`Ubuntu <https://docs.docker.com/install/linux/docker-ce/ubuntu/>`__, `MacOS <https://docs.docker.com/docker-for-mac/install/>`__)
+   and, if required, add your user to ``docker`` group (`Linux <https://docs.docker.com/install/linux/linux-postinstall/>`__).
 
-2. Clone the project at https://github.com/nginyc/rafiki
+2. Install Python 3.6 (`Ubuntu <http://ubuntuhandbook.org/index.php/2017/07/install-python-3-6-1-in-ubuntu-16-04-lts/>`__, `MacOS <https://www.python.org/downloads/mac-osx/>`__)
 
-3. Setup Rafiki's complete stack with the init script:
+3. Clone the project at https://github.com/nginyc/rafiki (e.g. with `Git <https://git-scm.com/downloads>`__)
+
+4. Setup Rafiki's complete stack with the setup script:
 
     .. code-block:: shell
 
@@ -34,42 +37,45 @@ To destroy Rafiki's complete stack:
 Scaling Rafiki
 --------------------------------------------------------------------
 
-Rafiki's default setup runs on a single node, and only runs on CPUs.
+Rafiki's default setup runs on a single machine and only runs its workloads on CPUs.
+
+Rafiki's model training workers run in Docker containers that extend the Docker image ``nvidia/cuda:9.0-runtime-ubuntu16.04``,
+and are capable of leveraging on `CUDA-Capable GPUs <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions>`__
+
+Scaling Rafiki horizontally and enabling GPU usage involves setting up *Network File System* (*NFS*) at a common path across all nodes,
+installing & configuring the default Docker runtime to `nvidia` for each GPU-bearing node, and putting all these nodes into a single Docker Swarm.
 
-Rafiki has with its dynamic stack (e.g. train workers, inference workes, predictors) 
-running as `Docker Swarm Services <https://docs.docker.com/engine/swarm/services/>`_. 
-It runs in a `Docker routing-mesh overlay network <https://docs.docker.com/network/overlay/>`_,
-using it for networking amongst nodes.
+.. seealso:: :ref:`architecture`
 
-Rafiki's workers run in Docker containers that extend the Docker image ``nvidia/cuda:9.0-runtime-ubuntu16.04``,
-and are capable of leveraging on `CUDA-Capable GPUs <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions>`_
-across Rafiki's nodes.
+To run Rafiki on multiple machines with GPUs, do the following:
 
-To scale Rafiki horizontally and enable running on GPUs, do the following:
+1. If Rafiki is running, stop Rafiki with ``bash scripts/stop.sh``
 
-1. If Rafiki is running, stop Rafiki, and have the master node leave its Docker Swarm
+2. Have all nodes `leave any Docker Swarm <https://docs.docker.com/engine/reference/commandline/swarm_leave/>`__ they are in
 
-2. Put every worker node and the master node into a common network,
-   and change ``DOCKER_SWARM_ADVERTISE_ADDR`` in ``.env.sh`` to the IP address of the master node
-   in *the network that your worker nodes are in*
+3. Set up NFS such that the *master node is a NFS host*, *other nodes are NFS clients*, and the master node *shares an ancestor directory 
+   containing Rafiki's project directory*. `Here are instructions for Ubuntu <https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-ubuntu-16-04>`__
 
-3. For every node, including the master node, ensure the `firewall rules 
+4. All nodes should be in a common network. On the *master node*, change ``DOCKER_SWARM_ADVERTISE_ADDR`` in the project's ``.env.sh`` to the IP address of the master node
+   in *the network that your nodes are in*
+
+5. For *each node* (including the master node), ensure the `firewall rules 
    allow TCP & UDP traffic on ports 2377, 7946 and 4789 
    <https://docs.docker.com/network/overlay/#operations-for-all-overlay-networks>`_
 
-4. For every node that has GPUs:
+6. For *each node that has GPUs*:
 
-    4.1. `Install NVIDIA drivers <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for CUDA *9.0* or above
+    6.1. `Install NVIDIA drivers <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`__ for CUDA *9.0* or above
 
-    4.2. `Install nvidia-docker2 <https://github.com/NVIDIA/nvidia-docker>`_
+    6.2. `Install nvidia-docker2 <https://github.com/NVIDIA/nvidia-docker>`__
 
-    4.3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here <https://lukeyeager.github.io/2018/01/22/setting-the-default-docker-runtime-to-nvidia.html>`_)
+    6.3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here <https://lukeyeager.github.io/2018/01/22/setting-the-default-docker-runtime-to-nvidia.html>`__)
 
-5. Start Rafiki with ``bash scripts/start.sh``
+7. On the *master node*, start Rafiki with ``bash scripts/start.sh``
 
-6. For every worker node, have the node `join the master node's Docker Swarm <https://docs.docker.com/engine/swarm/join-nodes/>`_
+8. For *each worker node*, have the node `join the master node's Docker Swarm <https://docs.docker.com/engine/swarm/join-nodes/>`__
 
-7. On the *master* node, for *every* node (including the master node), configure it with the script:
+9. On the *master* node, for *each node* (including the master node), configure it with the script:
 
     ::    
 
@@ -96,12 +102,13 @@ over ports 3000 and 3001 (by default), assuming incoming connections to these po
 Reading Rafiki's logs
 --------------------------------------------------------------------
 
-By default, you can read logs of Rafiki Admin, Rafiki Advisor & any of Rafiki's services
-in `./logs` directory at the root of the project's directory of the master node. 
+By default, you can read logs of Rafiki Admin, Rafiki Advisor & any of Rafiki's workers
+in ``./logs`` directory at the root of the project's directory of the master node. 
 
 
 Troubleshooting
 --------------------------------------------------------------------
 
 Q: There seems to be connectivity issues amongst containers across nodes!
-A: Ensure that containers are able to communicate with one another through the Docker Swarm overlay network: https://docs.docker.com/network/network-tutorial-overlay/#use-an-overlay-network-for-standalone-containers
+
+A: `Ensure that containers are able to communicate with one another through the Docker Swarm overlay network <https://docs.docker.com/network/network-tutorial-overlay/#use-an-overlay-network-for-standalone-containers>`__
@@ -1,15 +1,10 @@
-1. Install Python 3.6
+1. Install Python 3.6 (`Ubuntu <http://ubuntuhandbook.org/index.php/2017/07/install-python-3-6-1-in-ubuntu-16-04-lts/>`__, `MacOS <https://www.python.org/downloads/mac-osx/>`__)
 
-2. Clone the project at https://github.com/nginyc/rafiki
+2. Clone the project at https://github.com/nginyc/rafiki (e.g. with `Git <https://git-scm.com/downloads>`__)
 
 3. Within the project's root folder, install Rafiki Client's Python dependencies by running:
 
     ::
 
-        pip install -r ./rafiki/client/requirements.txt
+        pip3.6 install -r ./rafiki/client/requirements.txt
 
-4. Set ``$PYTHONPATH`` to the project's root folder:
-
-    ::
-
-        export PYTHONPATH=$PWD
@@ -72,10 +72,11 @@ To illustrate how to write models on Rafiki, we have written the following:
     - Sample pre-processing logic to convert common dataset formats to Rafiki's own dataset formats in `./examples/datasets/ <https://github.com/nginyc/rafiki/tree/master/examples/datasets/>`_ 
     - Sample models in `./examples/models/ <https://github.com/nginyc/rafiki/tree/master/examples/models/>`_
 
-To start testing your model, first install the Python dependencies at ``rafiki/model/requirements.txt``:
+To start testing your model, first run the following:
 
     .. code-block:: shell
 
+        source .env.sh
         pip install -r rafiki/model/requirements.txt
 
 
@@ -100,7 +101,7 @@ Example: Testing Models for ``IMAGE_CLASSIFICATION``
     .. code-block:: shell
 
         python examples/models/image_classification/SkDt.py
-        python examples/models/image_classification/TfSingleHiddenLayer.py
+        python examples/models/image_classification/TfFeedForward.py
 
 
 Example: Testing Models for ``POS_TAGGING``