Skip to content

Commit fafd31e

Browse files
committed
* added interactive support
* bugfix to only save hparams if rank0 * fixed up examples
1 parent 9c9fb9f commit fafd31e

File tree

11 files changed

+113
-92
lines changed

11 files changed

+113
-92
lines changed

README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -33,29 +33,31 @@ But with runx, you would simply define a yaml that defines lists of hyperparams
3333

3434
Start by creating a yaml file called `sweep.yml`:
3535
```yml
36-
cmd: 'python train.py'
36+
CMD: 'python train.py'
3737

38-
hparams:
38+
HPARAMS:
3939
lr: [0.01, 0.02]
4040
solver: ['sgd', 'adam']
4141
```
4242
4343
Now you can run the sweep with runx:
4444
4545
```bash
46-
> python -m runx.runx sweep.yml
46+
> python -m runx.runx sweep.yml -i
4747

4848
python train.py --lr 0.01 --solver sgd
4949
python train.py --lr 0.01 --solver adam
5050
python train.py --lr 0.02 --solver sgd
51-
python train.py --lr 0.02 --solver adam
51+
python train.py --lr 0.02 --solver adam
5252
```
5353
You can see that runx automatically computes the cross product of all hyperparameters, which in this
5454
case results in 4 runs. It then builds commandlines by concatenating the hyperparameters with
5555
the training command.
5656

57-
runx is intended to be used to launch batch jobs to a farm. Because running many training runs
58-
interactively would take a long time!
57+
-n - this means don't run, just print the command
58+
-i - interactive (as opposed to batch)
59+
60+
runx is intended to be used to launch batch jobs to a farm.
5961
Farm support is simple. Create a .runx file that configures the farm:
6062

6163
```yaml

examples/.runx

Lines changed: 0 additions & 9 deletions
This file was deleted.

examples/.runx_example

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Please copy this template to .runx and at least update LOGROOT.
2+
# If you'd like to use runx to launch jobs to your farm, you need to
3+
# provide a batch job launcher (SUBMIT_CMD) and list under RESOURCES
4+
# any arguments for that command, such as the docker image, etc.
5+
6+
LOGROOT: path_to_logs
7+
FARM: myfarm
8+
9+
myfarm:
10+
SUBMIT_CMD: <my_farm_launcher>
11+
RESOURCES:
12+
image: <some docker image>
13+
gpu: 1
14+
mem: 32

examples/README.md

Lines changed: 8 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,12 @@
11
Examples of using runx.
2+
Please create a .runx file from .runx_example first.
23

3-
In both these examples, we use the `-n` flag so that the commands don't actually execute, but are instead printed to show you what would be normally run.
4+
Interactive:
5+
> python -m runx.runx mnist.yml -i -n -> dry run
6+
> python -m runx.runx mnist.yml -i -> real run
47
5-
```bash
6-
> python -m runx.runx mnist.yml -n
7-
8-
submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/mnist/imaginary-quoll_2020.03.27_10.39/code; PYTHONPATH=/home/logs/mnist/imaginary-quoll_2020.03.27_10.39/code exec python mnist.py --lr 0.01 --momentum 0.5 --logdir /home/logs/mnist/imaginary-quoll_2020.03.27_10.39 '
9-
submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/mnist/solid-viper_2020.03.27_10.39/code; PYTHONPATH=/home/logs/mnist/solid-viper_2020.03.27_10.39/code exec python mnist.py --lr 0.01 --momentum 0.25 --logdir /home/logs/mnist/solid-viper_2020.03.27_10.39 '
10-
submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/mnist/stereotyped-catfish_2020.03.27_10.39/code; PYTHONPATH=/home/logs/mnist/stereotyped-catfish_2020.03.27_10.39/code exec python mnist.py --lr 0.015 --momentum 0.5 --logdir /home/logs/mnist/stereotyped-catfish_2020.03.27_10.39 '
11-
submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/mnist/expert-okapi_2020.03.27_10.39/code; PYTHONPATH=/home/logs/mnist/expert-okapi_2020.03.27_10.39/code exec python mnist.py --lr 0.015 --momentum 0.25 --logdir /home/logs/mnist/expert-okapi_2020.03.27_10.39 '
12-
submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/mnist/shrewd-ostrich_2020.03.27_10.39/code; PYTHONPATH=/home/logs/mnist/shrewd-ostrich_2020.03.27_10.39/code exec python mnist.py --lr 0.02 --momentum 0.5 --logdir /home/logs/mnist/shrewd-ostrich_2020.03.27_10.39 '
13-
submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/mnist/umber-spoonbill_2020.03.27_10.39/code; PYTHONPATH=/home/logs/mnist/umber-spoonbill_2020.03.27_10.39/code exec python mnist.py --lr 0.02 --momentum 0.25 --logdir /home/logs/mnist/umber-spoonbill_2020.03.27_10.39 '
14-
```
15-
16-
mnist_multi.yml
17-
```bash
18-
> python -m runx.runx mnist_multi.yml -n
19-
20-
submit_job --image hw-adlr-docker/atao/superslomo:v2 --partition volta-dcg-short --gpu 1 --cpu 8 --mem 64 --duration 1 --command 'cd /home/logs/mnist_multi/ubiquitous-fulmar_2020.03.27_10.41/code; PYTHONPATH=/home/logs/mnist_multi/ubiquitous-fulmar_2020.03.27_10.41/code exec python mnist.py --TAG_NAME foo --lr 0.01 --momentum 0.5 --logdir /home/logs/mnist_multi/ubiquitous-fulmar_2020.03.27_10.41 '
21-
submit_job --image hw-adlr-docker/atao/superslomo:v2 --partition volta-dcg-short --gpu 1 --cpu 8 --mem 64 --duration 1 --command 'cd /home/logs/mnist_multi/psychedelic-albatross_2020.03.27_10.41/code; PYTHONPATH=/home/logs/mnist_multi/psychedelic-albatross_2020.03.27_10.41/code exec python mnist.py --TAG_NAME foo --lr 0.01 --momentum 0.25 --logdir /home/logs/mnist_multi/psychedelic-albatross_2020.03.27_10.41 '
22-
submit_job --image hw-adlr-docker/atao/superslomo:v2 --partition volta-dcg-short --gpu 1 --cpu 8 --mem 64 --duration 1 --command 'cd /home/logs/mnist_multi/classic-viper_2020.03.27_10.41/code; PYTHONPATH=/home/logs/mnist_multi/classic-viper_2020.03.27_10.41/code exec python mnist.py --TAG_NAME bar --lr 0.02 --momentum 0.25 --logdir /home/logs/mnist_multi/classic-viper_2020.03.27_10.41 '
23-
submit_job --image hw-adlr-docker/atao/superslomo:v2 --partition volta-dcg-short --gpu 1 --cpu 8 --mem 64 --duration 1 --command 'cd /home/logs/mnist_multi/prehistoric-malamute_2020.03.27_10.41/code; PYTHONPATH=/home/logs/mnist_multi/prehistoric-malamute_2020.03.27_10.41/code exec python mnist.py --TAG_NAME bar --lr 0.02 --momentum 0.12 --logdir /home/logs/mnist_multi/prehistoric-malamute_2020.03.27_10.41 '
24-
```
25-
26-
```bash
27-
> python -m runx.runx imgnet.yml -n
28-
29-
submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/imgnet/famous-albatross_2020.03.27_10.38/code; PYTHONPATH=/home/logs/imgnet/famous-albatross_2020.03.27_10.38/code exec python imgnet.py /data/ImageNet --lr 0.1 --logdir /home/logs/imgnet/famous-albatross_2020.03.27_10.38 '
30-
submit_job --gpu 2 --cpu 16 --mem 128 --command 'cd /home/logs/imgnet/piquant-ara_2020.03.27_10.38/code; PYTHONPATH=/home/logs/imgnet/piquant-ara_2020.03.27_10.38/code exec python imgnet.py /data/ImageNet --lr 0.05 --logdir /home/logs/imgnet/piquant-ara_2020.03.27_10.38 '
31-
```
8+
Batch runs:
9+
> python -m runx.runx mnist.yml -n -> dry run
10+
> python -m runx.runx mnist.yml -> real run
3211
12+
Can also try imgnet.yml.

examples/imgnet.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
CMD: "cd LOGDIR/code; PYTHONPATH=LOGDIR/code exec python imgnet.py"
1+
CMD: "python imgnet.py"
22

33
HPARAMS:
44
lr: [0.1, 0.05]

examples/mnist.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ def main():
9898
help='input batch size for training (default: 64)')
9999
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
100100
help='input batch size for testing (default: 1000)')
101-
parser.add_argument('--epochs', type=int, default=20, metavar='N',
101+
parser.add_argument('--epochs', type=int, default=2, metavar='N',
102102
help='number of epochs to train (default: 10)')
103103
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
104104
help='learning rate (default: 0.01)')

examples/mnist.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
CMD: cd LOGDIR/code; PYTHONPATH=LOGDIR/code exec python mnist.py
1+
CMD: "python mnist.py"
22

33
HPARAMS:
44
lr: [0.01, 0.02]

examples/mnist_multi.yml

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,4 @@
1-
RESOURCES:
2-
image: my-docker-image:1.0
3-
partition: batch
4-
gpu: 1
5-
cpu: 8
6-
mem: 64
7-
duration: 1
8-
9-
CMD: cd LOGDIR/code; PYTHONPATH=LOGDIR/code exec python mnist.py
1+
CMD: "python mnist.py"
102

113
HPARAMS: [
124
{

runx/logx.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ def initialize(self, logdir=None, coolname=False, hparams=None,
110110
if not os.path.isdir(self.logdir):
111111
os.makedirs(self.logdir, exist_ok=True)
112112

113-
if hparams is not None:
113+
if hparams is not None and self.rank0:
114114
save_hparams(hparams, self.logdir)
115115

116116
# Tensorboard file

runx/runx.py

Lines changed: 69 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,18 @@
4545

4646

4747
parser = argparse.ArgumentParser(description='Experiment runner')
48-
parser.add_argument('exp_yml', type=str, help='experiment yaml file')
49-
parser.add_argument('--tag', type=str, default=None, help='tag label for run')
50-
parser.add_argument('--no_run', '-n', action='store_true', help='don\'t run')
48+
parser.add_argument('exp_yml', type=str,
49+
help='experiment yaml file')
50+
parser.add_argument('--tag', type=str, default=None,
51+
help='tag label for run')
52+
parser.add_argument('--no_run', '-n', action='store_true',
53+
help='don\'t run')
54+
parser.add_argument('--interactive', '-i', action='store_true',
55+
help='run interactively instead of submitting to farm')
5156
parser.add_argument('--no_cooldir', action='store_true',
5257
help='no coolname, no datestring')
53-
parser.add_argument('--farm', type=str, default=None, help=(
54-
'Select farm for workstation submission'))
58+
parser.add_argument('--farm', type=str, default=None,
59+
help='Select farm for workstation submission')
5560
args = parser.parse_args()
5661

5762

@@ -83,7 +88,6 @@ def expand_hparams(hparams):
8388
cmd += '--{} '.format(field)
8489
elif val != 'None':
8590
cmd += '--{} {} '.format(field, val)
86-
cmd += '\''
8791
return cmd
8892

8993

@@ -97,26 +101,45 @@ def exec_cmd(cmd):
97101
print(message)
98102

99103

100-
def construct_cmd(cmd, hparams, resources, job_name, logdir):
104+
def construct_cmd(cmd, hparams):
101105
"""
102-
Expand the hyperparams into a commandline
106+
:cmd: farm submission command
107+
:hparams: hyperparams for training command
103108
"""
104-
######################################################################
105-
# You may wish to customize this function for your own needs
106-
######################################################################
107-
if os.environ.get('NVIDIA_INTERNAL'):
108-
cmd += '--name {} '.format(job_name)
109-
if 'submit_job' in cmd:
110-
cmd += '--cd_to_logdir '
111-
cmd += '--logdir {}/logs '.format(logdir)
109+
cmd += ' ' + expand_hparams(hparams)
110+
return cmd
112111

113-
cmd += expand_resources(resources)
114-
cmd += expand_hparams(hparams)
115112

116-
if args.no_run:
117-
print(cmd)
113+
def make_farm_cmd(submit_cmd, train_cmd, job_name, resources, logdir):
114+
"""
115+
This function builds a farm submission command.
118116
119-
return cmd
117+
We have some custom code here that's important to make this work
118+
with Nvidia's farm, but it should be easy to customize this routine
119+
for your needs.
120+
121+
:submit_cmd: The executable to submit your job to the farm
122+
123+
All of the following inputs are passed as arguments to the submit_cmd:
124+
125+
:resources: Resources for the submission command, things like
126+
node count, GPUs, memory, etc ...
127+
:job_name: The name of the job
128+
:logdir: the target log directory
129+
:train_cmd: the training command itself
130+
"""
131+
preface = f'cd {logdir}/code; PYTHONPATH={logdir}/code; exec '
132+
cmd = preface + train_cmd
133+
134+
if os.environ.get('NVIDIA_INTERNAL'):
135+
submit_cmd += '--name {} '.format(job_name)
136+
submit_cmd += '--logdir {}/gcf_log '.format(logdir)
137+
138+
submit_cmd += expand_resources(resources)
139+
140+
if os.environ.get('NVIDIA_INTERNAL'):
141+
submit_cmd += f'--command \' {cmd} \''
142+
return submit_cmd
120143

121144

122145
def save_cmd(cmd, logdir):
@@ -239,6 +262,9 @@ def hacky_substitutions(hparams, resource_copy, logdir, runroot):
239262
if 'SUBMIT_JOB.NODES' in hparams:
240263
resource_copy['nodes'] = hparams['SUBMIT_JOB.NODES']
241264
del hparams['SUBMIT_JOB.NODES']
265+
if 'SUBMIT_JOB.PARTITION' in hparams:
266+
resource_copy['partition'] = hparams['SUBMIT_JOB.PARTITION']
267+
del hparams['SUBMIT_JOB.PARTITION']
242268

243269
# Record the directory from whence the experiments were launched
244270
hparams_out['srcdir'] = runroot
@@ -261,6 +287,10 @@ def get_tag(hparams):
261287
del hparams['RUNX.TAG']
262288

263289

290+
def skip_run(hparams):
291+
return 'RUNX.SKIP' in hparams and hparams['RUNX.SKIP']
292+
293+
264294
def get_code_ignore_patterns(experiment):
265295
if 'CODE_IGNORE_PATTERNS' in experiment:
266296
code_ignore_patterns = experiment['CODE_IGNORE_PATTERNS']
@@ -282,7 +312,6 @@ def run_yaml(experiment, exp_name, runroot):
282312

283313
# Build the args that the submit_cmd will see
284314
yaml_hparams = OrderedDict()
285-
yaml_hparams['command'] = '\'{}'.format(experiment['CMD'])
286315

287316
# Add yaml_hparams
288317
for k, v in experiment['HPARAMS'].items():
@@ -298,29 +327,42 @@ def run_yaml(experiment, exp_name, runroot):
298327

299328
# hparams to use for experiment
300329
hparams = {k: v for k, v in zip(hparam_keys, hparam_vals)}
330+
if skip_run(hparams):
331+
continue
301332
get_tag(hparams)
302333

303334
job_name, logdir, coolname, expdir = make_cool_names(exp_name, logroot)
304335
resource_copy = resources.copy()
305336
hparams_out = hacky_substitutions(hparams, resource_copy, logdir,
306337
runroot)
307-
cmd = construct_cmd(submit_cmd, hparams,
308-
resource_copy, job_name, logdir)
338+
experiment_cmd = experiment['CMD']
339+
cmd = construct_cmd(experiment_cmd, hparams)
340+
341+
if args.interactive:
342+
if args.no_run:
343+
print(cmd)
344+
continue
345+
else:
346+
exec_cmd(cmd)
347+
else:
348+
cmd = make_farm_cmd(submit_cmd, cmd, job_name, resource_copy, logdir)
349+
350+
if args.no_run:
351+
print(cmd)
352+
continue
309353

310-
if not args.no_run:
311354
# copy code to NFS-mounted share
312355
copy_code(logdir, runroot, code_ignore_patterns)
313356

314357
# save some meta-data from run
315358
save_cmd(cmd, logdir)
316-
save_hparams(hparams_out, logdir)
317359

318360
subprocess.call(['chmod', '-R', 'a+rw', expdir])
319361
os.chdir(logdir)
320362

321363
print('Submitting job {}'.format(job_name))
322364
exec_cmd(cmd)
323-
365+
324366

325367
def run_experiment(exp_fn):
326368
"""

runx/utils.py

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -45,14 +45,14 @@ def read_item(config, key):
4545
return config[key]
4646

4747

48-
def read_global_config():
48+
def read_config_file():
49+
local_config_fn = './.runx'
4950
home = os.path.expanduser('~')
50-
cwd_config_fn = './.runx'
51-
home_config_fn = '{}/.config/runx.yml'.format(home)
52-
if os.path.isfile(cwd_config_fn):
53-
config_fn = cwd_config_fn
54-
elif os.path.exists(home_config_fn):
55-
config_fn = home_config_fn
51+
global_config_fn = '{}/.config/runx.yml'.format(home)
52+
if os.path.isfile(local_config_fn):
53+
config_fn = local_config_fn
54+
elif os.path.exists(global_config_fn):
55+
config_fn = global_config_fn
5656
else:
5757
raise('can\'t find file ./.runx or ~/.config/runx.yml config files')
5858
if 'FullLoader' in dir(yaml):
@@ -67,7 +67,7 @@ def read_config(args_farm):
6767
read the global config
6868
pull the farm portion and merge with global config
6969
'''
70-
global_config = read_global_config()
70+
global_config = read_config_file()
7171

7272
merged_config = {}
7373
merged_config['LOGROOT'] = read_item(global_config, 'LOGROOT')
@@ -92,7 +92,7 @@ def read_config(args_farm):
9292

9393

9494
def get_logroot():
95-
global_config = read_global_config()
95+
global_config = read_config_file()
9696
return read_item(global_config, 'LOGROOT')
9797

9898

0 commit comments

Comments
 (0)