-
Notifications
You must be signed in to change notification settings - Fork 4
Executing many runs in parallel with ssh cluster targets
If you have a set of machines accessible via SSH, 3X provides a simple way to execute a large number of runs on them in parallel. All you need to do is defining a new target as ssh-cluster type with a list of hostnames, then starting the queue, and periodically retrieving results back as you want until all runs are done. 3X will take all planned runs in the queue and distribute them across the machines you have listed based on how busy each machine is. As opposed to the ssh type targets, the runs are executed remotely in complete isolation with ssh-cluster types, so you no longer need to worry about your laptop running 3X losing power or Wi-Fi connection for example.
Use the following command to create a new target.
3x target TARGET define ssh-cluster REMOTE... SHARED_PATH [NAME[=VALUE]]...For example,
3x target mycluster define ssh-cluster user@foo.example.org:tmp user@bar.example.org:/localdisk/tmp/user differentuser@baz.example.org:/ssd/tmp/user /shared/users/user/3x-sharedwill create a target named mycluster that executes runs on three machines:
- foo.example.org (with login
userand keeping temporary files under~/tmp/) - bar.example.org (with login
userand keeping temporary files under/localdisk/tmp/user/) - baz.example.org (with login
differentuserand keeping temporary files under/ssd/tmp/user/)
keeping all shared data under the path /shared/users/user/3x-shared/ (last argument) where each machine will read from and write to for executing individual runs.
Any environment variables necessary for execution can be passed as argument after the shared path.
Note that 3X must be installed on all the machines of the ssh-cluster target for the remote execution to work, i.e., 3x executable should be on PATH of each machine when logged into it.
If it's not installed already, you can use the following command for example to copy the current executable to the shared path and configure the target to use the absolute path to it (in 3x-path file under the target's directory). These should be run from the root of the 3X repository.
path_to_3x=/shared/users/user/3x-shared/3x
scp "$(type -p 3x)" user@foo.example.org:$path_to_3x
echo $path_to_3x >run/target/foo/3x-pathNext, use the standard commands to configure the target of the current queue to the ssh-cluster type just created.
3x target TARGETThen, start the execution of planned runs in the queue on the target.
3x startThis will first create a clone of the experiment repository under the shared path via one of the machine, then send subsets of runs to all accessible machines in the target, and start the execution in parallel. Note that this command will end after setting up and initiating the execution, and won't wait for all the runs to finish.
Finally, use the following command to retrieve results of finished runs.
3x sync3X will not synchronize automatically, so no status in the GUI or CLI will update on its own unless this command is run manually. If you want to retrieve the results periodically, say every five minutes or 30 seconds, use the following shell script:
while :; do 3x sync; sleep 5m; done # every five minuteswhile :; do 3x sync; sleep 30s; done # every 30 secondsOnce 3x sync finds all runs have finished execution, it will perform necessary clean up on the cluster, and mark the queue as stopped.