Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Learning rate is not passed to network scripts #22

Open
shishaochen opened this issue Sep 19, 2017 · 4 comments
Open

[BUG] Learning rate is not passed to network scripts #22

shishaochen opened this issue Sep 19, 2017 · 4 comments

Comments

@shishaochen
Copy link

From benchmark.py and configs/*.config, we know dlbench provide capability of changing learning rate.
However, only Caffe, Torch, MXNet accepts learning rate argument while CNTK, TensorFlow ignores them.

# tools/cntk/cntkbm.py has no lr argument defined.
# tools/tensorflow/tensorflow.py has no lr argument defined.

Furthermore, the learning rate is not the same when running benchmark. For example, TensorFlow uses constant value while MXNet's learning rate will change during training.

# From tools/mxnet/common/fit.py
steps = [epoch_size * (x-begin_epoch) for x in step_epochs if x-begin_epoch > 0] # Default value of step_epochs is '200,250' from tools/mxnet/train_cifa10.py
return (lr, mx.lr_scheduler.MultiFactorScheduler(step=steps, factor=args.lr_factor))
......
optimizer_params = {'learning_rate': lr,
            'momentum' : args.mom,
            'wd' : args.wd,
            'lr_scheduler': lr_scheduler} # This scheduler will change learning rate during training

Please let all tools support learning rate parameter or just delete learning rate from config.

@shyhuai
Copy link
Collaborator

shyhuai commented Sep 19, 2017

The schedule of learning rate of MXNet is not used. Please check the code: https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/common/fit.py#L8. The parameter of lr_factor is set to be None. For CNTK and TF, we set them to be fixed.

@shishaochen
Copy link
Author

shishaochen commented Sep 19, 2017

@shyhuai But you set the default value of lr_factor to 0.1 at https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/common/fit.py#L63.

train.add_argument('--lr-factor', type=float, default=0.1, help='the ratio to reduce lr on each step')

So, no matter whether we explicitly set lr_factor in command-line arguments, argparse.ArgumentParser will always set the lr_factor. Check the log of MXNet MNIST:

INFO:root:start with arguments Namespace(batch_size=1024, data_dir='/home/shaocs/dlbench/dataset/mxnet/mnist', disp_batches=100, gpus='0', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=60000, num_layers=None, num_nodes=1, optimizer='sgd', test_io=0, top_k=0, wd=1e-05)
......
INFO:root:Update[586]: Change learning rate to 5.00000e-03 # This is printed after 8 epochs

@shyhuai
Copy link
Collaborator

shyhuai commented Sep 20, 2017

@shishaochen Thanks for you feedback. Since we set lr_factor=1 in the script of: mxnetbm.py, the learning rate will not be changed during training. If you use the script of mxnetbm.py, it could be no problem. Here is the log for your reference: http://dlbench.comp.hkbu.edu.hk/logs/?f=mxnet-fc-fcn5-gpu0-K80-b512-Tue_Mar__7_10:52:06_2017-gpu20.log. In case of misunderstanding, I have revised the code to set the default value to None. Thank again for your report.

@shishaochen
Copy link
Author

@shyhuai Sorry. I cannot find "factor" set in https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/mxnetbm.py. Maybe you set it locally but the change is not committed yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants