* added interactive support

ajtao · ajtao · commit fafd31e1598d · 2020-08-25T12:03:21.000-07:00
* bugfix to only save hparams if rank0
 * fixed up examples
diff --git a/README.md b/README.md
@@ -33,29 +33,31 @@ But with runx, you would simply define a yaml that defines lists of hyperparams
 
 Start by creating a yaml file called `sweep.yml`:
 ```yml
-cmd: 'python train.py'
+CMD: 'python train.py'
 
-hparams:
+HPARAMS:
   lr: [0.01, 0.02]
   solver: ['sgd', 'adam']
 ```
 
 Now you can run the sweep with runx:
 
 ```bash
-> python -m runx.runx sweep.yml
+ > python -m runx.runx sweep.yml -i
 
 python train.py --lr 0.01 --solver sgd
 python train.py --lr 0.01 --solver adam
 python train.py --lr 0.02 --solver sgd
-python train.py --lr 0.02 --solver adam
+python train.py --lr 0.02 --solver adam 
 ```
 You can see that runx automatically computes the cross product of all hyperparameters, which in this
 case results in 4 runs. It then builds commandlines by concatenating the hyperparameters with
 the training command.
 
-runx is intended to be used to launch batch jobs to a farm. Because running many training runs
-interactively would take a long time! 
+-n - this means don't run, just print the command
+-i - interactive (as opposed to batch)
+
+runx is intended to be used to launch batch jobs to a farm. 
 Farm support is simple. Create a .runx file that configures the farm:
 
 ```yaml
diff --git a/examples/.runx b/examples/.runx
diff --git a/examples/.runx_example b/examples/.runx_example
@@ -0,0 +1,14 @@
+# Please copy this template to .runx and at least update LOGROOT.
+# If you'd like to use runx to launch jobs to your farm, you need to
+# provide a batch job launcher (SUBMIT_CMD) and list under RESOURCES
+# any arguments for that command, such as the docker image, etc.
+
+LOGROOT: path_to_logs
+FARM: myfarm
+
+myfarm:
+    SUBMIT_CMD: <my_farm_launcher>
+    RESOURCES:
+        image: <some docker image>
+        gpu: 1
+        mem: 32
diff --git a/examples/README.md b/examples/README.md
@@ -1,32 +1,12 @@
 Examples of using runx.
+Please create a .runx file from .runx_example first.
 
-In both these examples, we use the `-n` flag so that the commands don't actually execute, but are instead printed to show you what would be normally run.
+Interactive:
+  > python -m runx.runx mnist.yml -i -n   -> dry run
+  > python -m runx.runx mnist.yml -i      -> real run
 
-```bash
-> python -m runx.runx mnist.yml -n
-
-submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/mnist/imaginary-quoll_2020.03.27_10.39/code; PYTHONPATH=/home/logs/mnist/imaginary-quoll_2020.03.27_10.39/code exec python mnist.py --lr 0.01 --momentum 0.5 --logdir /home/logs/mnist/imaginary-quoll_2020.03.27_10.39 '
-submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/mnist/solid-viper_2020.03.27_10.39/code; PYTHONPATH=/home/logs/mnist/solid-viper_2020.03.27_10.39/code exec python mnist.py --lr 0.01 --momentum 0.25 --logdir /home/logs/mnist/solid-viper_2020.03.27_10.39 '
-submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/mnist/stereotyped-catfish_2020.03.27_10.39/code; PYTHONPATH=/home/logs/mnist/stereotyped-catfish_2020.03.27_10.39/code exec python mnist.py --lr 0.015 --momentum 0.5 --logdir /home/logs/mnist/stereotyped-catfish_2020.03.27_10.39 '
-submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/mnist/expert-okapi_2020.03.27_10.39/code; PYTHONPATH=/home/logs/mnist/expert-okapi_2020.03.27_10.39/code exec python mnist.py --lr 0.015 --momentum 0.25 --logdir /home/logs/mnist/expert-okapi_2020.03.27_10.39 '
-submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/mnist/shrewd-ostrich_2020.03.27_10.39/code; PYTHONPATH=/home/logs/mnist/shrewd-ostrich_2020.03.27_10.39/code exec python mnist.py --lr 0.02 --momentum 0.5 --logdir /home/logs/mnist/shrewd-ostrich_2020.03.27_10.39 '
-submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/mnist/umber-spoonbill_2020.03.27_10.39/code; PYTHONPATH=/home/logs/mnist/umber-spoonbill_2020.03.27_10.39/code exec python mnist.py --lr 0.02 --momentum 0.25 --logdir /home/logs/mnist/umber-spoonbill_2020.03.27_10.39 '
-```
-
-mnist_multi.yml
-```bash
-> python -m runx.runx mnist_multi.yml -n
-
-submit_job --image hw-adlr-docker/atao/superslomo:v2 --partition volta-dcg-short --gpu 1 --cpu 8 --mem 64 --duration 1 --command 'cd /home/logs/mnist_multi/ubiquitous-fulmar_2020.03.27_10.41/code; PYTHONPATH=/home/logs/mnist_multi/ubiquitous-fulmar_2020.03.27_10.41/code exec python mnist.py --TAG_NAME foo --lr 0.01 --momentum 0.5 --logdir /home/logs/mnist_multi/ubiquitous-fulmar_2020.03.27_10.41 '
-submit_job --image hw-adlr-docker/atao/superslomo:v2 --partition volta-dcg-short --gpu 1 --cpu 8 --mem 64 --duration 1 --command 'cd /home/logs/mnist_multi/psychedelic-albatross_2020.03.27_10.41/code; PYTHONPATH=/home/logs/mnist_multi/psychedelic-albatross_2020.03.27_10.41/code exec python mnist.py --TAG_NAME foo --lr 0.01 --momentum 0.25 --logdir /home/logs/mnist_multi/psychedelic-albatross_2020.03.27_10.41 '
-submit_job --image hw-adlr-docker/atao/superslomo:v2 --partition volta-dcg-short --gpu 1 --cpu 8 --mem 64 --duration 1 --command 'cd /home/logs/mnist_multi/classic-viper_2020.03.27_10.41/code; PYTHONPATH=/home/logs/mnist_multi/classic-viper_2020.03.27_10.41/code exec python mnist.py --TAG_NAME bar --lr 0.02 --momentum 0.25 --logdir /home/logs/mnist_multi/classic-viper_2020.03.27_10.41 '
-submit_job --image hw-adlr-docker/atao/superslomo:v2 --partition volta-dcg-short --gpu 1 --cpu 8 --mem 64 --duration 1 --command 'cd /home/logs/mnist_multi/prehistoric-malamute_2020.03.27_10.41/code; PYTHONPATH=/home/logs/mnist_multi/prehistoric-malamute_2020.03.27_10.41/code exec python mnist.py --TAG_NAME bar --lr 0.02 --momentum 0.12 --logdir /home/logs/mnist_multi/prehistoric-malamute_2020.03.27_10.41 '
-```
-
-```bash
-> python -m runx.runx imgnet.yml -n
-
-submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/imgnet/famous-albatross_2020.03.27_10.38/code; PYTHONPATH=/home/logs/imgnet/famous-albatross_2020.03.27_10.38/code exec python imgnet.py /data/ImageNet --lr 0.1 --logdir /home/logs/imgnet/famous-albatross_2020.03.27_10.38 '
-submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/imgnet/piquant-ara_2020.03.27_10.38/code; PYTHONPATH=/home/logs/imgnet/piquant-ara_2020.03.27_10.38/code exec python imgnet.py /data/ImageNet --lr 0.05 --logdir /home/logs/imgnet/piquant-ara_2020.03.27_10.38 '
-```
+Batch runs:
+  > python -m runx.runx mnist.yml -n      -> dry run
+  > python -m runx.runx mnist.yml         -> real run
 
+Can also try imgnet.yml.
diff --git a/examples/imgnet.yml b/examples/imgnet.yml
@@ -1,4 +1,4 @@
-CMD: "cd LOGDIR/code; PYTHONPATH=LOGDIR/code exec python imgnet.py"
+CMD: "python imgnet.py"
 
 HPARAMS:
    lr: [0.1, 0.05]
diff --git a/examples/mnist.py b/examples/mnist.py
@@ -98,7 +98,7 @@ def main():
                         help='input batch size for training (default: 64)')
     parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                         help='input batch size for testing (default: 1000)')
-    parser.add_argument('--epochs', type=int, default=20, metavar='N',
+    parser.add_argument('--epochs', type=int, default=2, metavar='N',
                         help='number of epochs to train (default: 10)')
     parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                         help='learning rate (default: 0.01)')
diff --git a/examples/mnist.yml b/examples/mnist.yml
@@ -1,4 +1,4 @@
-CMD: cd LOGDIR/code; PYTHONPATH=LOGDIR/code exec python mnist.py
+CMD: "python mnist.py"
 
 HPARAMS:
    lr: [0.01, 0.02]
diff --git a/examples/mnist_multi.yml b/examples/mnist_multi.yml
@@ -1,12 +1,4 @@
-RESOURCES:
-   image: my-docker-image:1.0
-   partition: batch
-   gpu: 1
-   cpu: 8
-   mem: 64
-   duration: 1
-
-CMD: cd LOGDIR/code; PYTHONPATH=LOGDIR/code exec python mnist.py
+CMD: "python mnist.py"
 
 HPARAMS: [
   {
diff --git a/runx/logx.py b/runx/logx.py
@@ -110,7 +110,7 @@ def initialize(self, logdir=None, coolname=False, hparams=None,
         if not os.path.isdir(self.logdir):
             os.makedirs(self.logdir, exist_ok=True)
 
-        if hparams is not None:
+        if hparams is not None and self.rank0:
             save_hparams(hparams, self.logdir)
 
         # Tensorboard file
diff --git a/runx/runx.py b/runx/runx.py
@@ -45,13 +45,18 @@
 
 
 parser = argparse.ArgumentParser(description='Experiment runner')
-parser.add_argument('exp_yml', type=str, help='experiment yaml file')
-parser.add_argument('--tag', type=str, default=None, help='tag label for run')
-parser.add_argument('--no_run', '-n', action='store_true', help='don\'t run')
+parser.add_argument('exp_yml', type=str,
+                    help='experiment yaml file')
+parser.add_argument('--tag', type=str, default=None,
+                    help='tag label for run')
+parser.add_argument('--no_run', '-n', action='store_true',
+                    help='don\'t run')
+parser.add_argument('--interactive', '-i', action='store_true',
+                    help='run interactively instead of submitting to farm')
 parser.add_argument('--no_cooldir', action='store_true',
                     help='no coolname, no datestring')
-parser.add_argument('--farm', type=str, default=None, help=(
-    'Select farm for workstation submission'))
+parser.add_argument('--farm', type=str, default=None,
+                    help='Select farm for workstation submission')
 args = parser.parse_args()
 
 
@@ -83,7 +88,6 @@ def expand_hparams(hparams):
                 cmd += '--{} '.format(field)
         elif val != 'None':
             cmd += '--{} {} '.format(field, val)
-    cmd += '\''
     return cmd
 
 
@@ -97,26 +101,45 @@ def exec_cmd(cmd):
         print(message)
 
 
-def construct_cmd(cmd, hparams, resources, job_name, logdir):
+def construct_cmd(cmd, hparams):
     """
-    Expand the hyperparams into a commandline
+    :cmd: farm submission command
+    :hparams: hyperparams for training command
     """
-    ######################################################################
-    # You may wish to customize this function for your own needs
-    ######################################################################
-    if os.environ.get('NVIDIA_INTERNAL'):
-        cmd += '--name {} '.format(job_name)
-        if 'submit_job' in cmd:
-            cmd += '--cd_to_logdir '
-            cmd += '--logdir {}/logs '.format(logdir)
+    cmd += ' ' + expand_hparams(hparams)
+    return cmd
 
-    cmd += expand_resources(resources)
-    cmd += expand_hparams(hparams)
 
-    if args.no_run:
-        print(cmd)
+def make_farm_cmd(submit_cmd, train_cmd, job_name, resources, logdir):
+    """
+    This function builds a farm submission command.
 
-    return cmd
+    We have some custom code here that's important to make this work
+    with Nvidia's farm, but it should be easy to customize this routine
+    for your needs.
+
+    :submit_cmd: The executable to submit your job to the farm
+
+    All of the following inputs are passed as arguments to the submit_cmd:
+
+    :resources: Resources for the submission command, things like
+                node count, GPUs, memory, etc ...
+    :job_name: The name of the job
+    :logdir: the target log directory
+    :train_cmd: the training command itself
+    """
+    preface = f'cd {logdir}/code; PYTHONPATH={logdir}/code; exec '
+    cmd = preface + train_cmd
+    
+    if os.environ.get('NVIDIA_INTERNAL'):
+        submit_cmd += '--name {} '.format(job_name)
+        submit_cmd += '--logdir {}/gcf_log '.format(logdir)
+
+    submit_cmd += expand_resources(resources)
+
+    if os.environ.get('NVIDIA_INTERNAL'):
+        submit_cmd += f'--command \' {cmd} \''
+    return submit_cmd
 
 
 def save_cmd(cmd, logdir):
@@ -239,6 +262,9 @@ def hacky_substitutions(hparams, resource_copy, logdir, runroot):
     if 'SUBMIT_JOB.NODES' in hparams:
         resource_copy['nodes'] = hparams['SUBMIT_JOB.NODES']
         del hparams['SUBMIT_JOB.NODES']
+    if 'SUBMIT_JOB.PARTITION' in hparams:
+        resource_copy['partition'] = hparams['SUBMIT_JOB.PARTITION']
+        del hparams['SUBMIT_JOB.PARTITION']
 
     # Record the directory from whence the experiments were launched
     hparams_out['srcdir'] = runroot
@@ -261,6 +287,10 @@ def get_tag(hparams):
         del hparams['RUNX.TAG']
 
 
+def skip_run(hparams):
+    return 'RUNX.SKIP' in hparams and hparams['RUNX.SKIP']
+
+
 def get_code_ignore_patterns(experiment):
     if 'CODE_IGNORE_PATTERNS' in experiment:
         code_ignore_patterns = experiment['CODE_IGNORE_PATTERNS']
@@ -282,7 +312,6 @@ def run_yaml(experiment, exp_name, runroot):
 
     # Build the args that the submit_cmd will see
     yaml_hparams = OrderedDict()
-    yaml_hparams['command'] = '\'{}'.format(experiment['CMD'])
 
     # Add yaml_hparams
     for k, v in experiment['HPARAMS'].items():
@@ -298,29 +327,42 @@ def run_yaml(experiment, exp_name, runroot):
 
         # hparams to use for experiment
         hparams = {k: v for k, v in zip(hparam_keys, hparam_vals)}
+        if skip_run(hparams):
+            continue
         get_tag(hparams)
 
         job_name, logdir, coolname, expdir = make_cool_names(exp_name, logroot)
         resource_copy = resources.copy()
         hparams_out = hacky_substitutions(hparams, resource_copy, logdir,
                                           runroot)
-        cmd = construct_cmd(submit_cmd, hparams,
-                            resource_copy, job_name, logdir)
+        experiment_cmd = experiment['CMD']
+        cmd = construct_cmd(experiment_cmd, hparams)
+
+        if args.interactive:
+            if args.no_run:
+                print(cmd)
+                continue
+            else:
+                exec_cmd(cmd)
+        else:
+            cmd = make_farm_cmd(submit_cmd, cmd, job_name, resource_copy, logdir)
+
+            if args.no_run:
+                print(cmd)
+                continue
 
-        if not args.no_run:
             # copy code to NFS-mounted share
             copy_code(logdir, runroot, code_ignore_patterns)
 
             # save some meta-data from run
             save_cmd(cmd, logdir)
-            save_hparams(hparams_out, logdir)
 
             subprocess.call(['chmod', '-R', 'a+rw', expdir])
             os.chdir(logdir)
 
             print('Submitting job {}'.format(job_name))
             exec_cmd(cmd)
-
+            
 
 def run_experiment(exp_fn):
     """
diff --git a/runx/utils.py b/runx/utils.py
@@ -45,14 +45,14 @@ def read_item(config, key):
     return config[key]
 
 
-def read_global_config():
+def read_config_file():
+    local_config_fn = './.runx'
     home = os.path.expanduser('~')
-    cwd_config_fn = './.runx'
-    home_config_fn = '{}/.config/runx.yml'.format(home)
-    if os.path.isfile(cwd_config_fn):
-        config_fn = cwd_config_fn
-    elif os.path.exists(home_config_fn):
-        config_fn = home_config_fn
+    global_config_fn = '{}/.config/runx.yml'.format(home)
+    if os.path.isfile(local_config_fn):
+        config_fn = local_config_fn
+    elif os.path.exists(global_config_fn):
+        config_fn = global_config_fn
     else:
         raise('can\'t find file ./.runx or ~/.config/runx.yml config files')
     if 'FullLoader' in dir(yaml):
@@ -67,7 +67,7 @@ def read_config(args_farm):
     read the global config
     pull the farm portion and merge with global config
     '''
-    global_config = read_global_config()
+    global_config = read_config_file()
 
     merged_config = {}
     merged_config['LOGROOT'] = read_item(global_config, 'LOGROOT')
@@ -92,7 +92,7 @@ def read_config(args_farm):
 
 
 def get_logroot():
-    global_config = read_global_config()
+    global_config = read_config_file()
     return read_item(global_config, 'LOGROOT')
 
 

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-CMD: "cd LOGDIR/code; PYTHONPATH=LOGDIR/code exec python imgnet.py"`
	`1`	`+CMD: "python imgnet.py"`
`2`	`2`
`3`	`3`	`HPARAMS:`
`4`	`4`	`lr: [0.1, 0.05]`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-CMD: cd LOGDIR/code; PYTHONPATH=LOGDIR/code exec python mnist.py`
	`1`	`+CMD: "python mnist.py"`
`2`	`2`
`3`	`3`	`HPARAMS:`
`4`	`4`	`lr: [0.01, 0.02]`