[BUG] Learning rate is not passed to network scripts #22

shishaochen · 2017-09-19T08:21:02Z

From benchmark.py and configs/*.config, we know dlbench provide capability of changing learning rate.
However, only Caffe, Torch, MXNet accepts learning rate argument while CNTK, TensorFlow ignores them.

# tools/cntk/cntkbm.py has no lr argument defined.
# tools/tensorflow/tensorflow.py has no lr argument defined.

Furthermore, the learning rate is not the same when running benchmark. For example, TensorFlow uses constant value while MXNet's learning rate will change during training.

# From tools/mxnet/common/fit.py
steps = [epoch_size * (x-begin_epoch) for x in step_epochs if x-begin_epoch > 0] # Default value of step_epochs is '200,250' from tools/mxnet/train_cifa10.py
return (lr, mx.lr_scheduler.MultiFactorScheduler(step=steps, factor=args.lr_factor))
......
optimizer_params = {'learning_rate': lr,
            'momentum' : args.mom,
            'wd' : args.wd,
            'lr_scheduler': lr_scheduler} # This scheduler will change learning rate during training

Please let all tools support learning rate parameter or just delete learning rate from config.

The text was updated successfully, but these errors were encountered:

shyhuai · 2017-09-19T17:02:07Z

The schedule of learning rate of MXNet is not used. Please check the code: https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/common/fit.py#L8. The parameter of lr_factor is set to be None. For CNTK and TF, we set them to be fixed.

shishaochen · 2017-09-19T22:54:57Z

@shyhuai But you set the default value of lr_factor to 0.1 at https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/common/fit.py#L63.

train.add_argument('--lr-factor', type=float, default=0.1, help='the ratio to reduce lr on each step')

So, no matter whether we explicitly set lr_factor in command-line arguments, argparse.ArgumentParser will always set the lr_factor. Check the log of MXNet MNIST:

INFO:root:start with arguments Namespace(batch_size=1024, data_dir='/home/shaocs/dlbench/dataset/mxnet/mnist', disp_batches=100, gpus='0', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=60000, num_layers=None, num_nodes=1, optimizer='sgd', test_io=0, top_k=0, wd=1e-05)
......
INFO:root:Update[586]: Change learning rate to 5.00000e-03 # This is printed after 8 epochs

shyhuai · 2017-09-20T04:23:02Z

@shishaochen Thanks for you feedback. Since we set lr_factor=1 in the script of: mxnetbm.py, the learning rate will not be changed during training. If you use the script of mxnetbm.py, it could be no problem. Here is the log for your reference: http://dlbench.comp.hkbu.edu.hk/logs/?f=mxnet-fc-fcn5-gpu0-K80-b512-Tue_Mar__7_10:52:06_2017-gpu20.log. In case of misunderstanding, I have revised the code to set the default value to None. Thank again for your report.

shishaochen · 2017-09-20T05:01:16Z

@shyhuai Sorry. I cannot find "factor" set in https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/mxnetbm.py. Maybe you set it locally but the change is not committed yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Learning rate is not passed to network scripts #22

[BUG] Learning rate is not passed to network scripts #22

shishaochen commented Sep 19, 2017

shyhuai commented Sep 19, 2017

shishaochen commented Sep 19, 2017 •

edited

Loading

shyhuai commented Sep 20, 2017

shishaochen commented Sep 20, 2017

[BUG] Learning rate is not passed to network scripts #22

[BUG] Learning rate is not passed to network scripts #22

Comments

shishaochen commented Sep 19, 2017

shyhuai commented Sep 19, 2017

shishaochen commented Sep 19, 2017 • edited Loading

shyhuai commented Sep 20, 2017

shishaochen commented Sep 20, 2017

shishaochen commented Sep 19, 2017 •

edited

Loading