You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to share a stochastic first order optimizer idea and implementation for L-BFGS, in case you might find it interesting or even practical in your research (congrats also on your PhD & new role at fb!)
The idea behind the algorithm is pretty simple in brief:
when optimizing a stochastic empirical loss function, use bootstrap sampling over mini-batches of data to estimate the function value, the variance, and sampling bias
run a first-order solver using a line search that treats the variance estimates from the bootstapped function/gradient evaluations as numerical uncertainties, such that the line search routine looks for step sizes points that satisfy the Wolfe conditions within the uncertainty level
I admittedly haven't run particularly thorough experiments on this, but did use a small convnet as a toy problem and it looks like the convergence of the bootstrapped L-BFGS algorithm can have the same or better convergence as ADAM up to a point, measured by the decrease in training loss vs the # of function/gradient evaluations [ie data points accessed], e.g. here
There's no formal convergence criteria implemented, and empirically it looks like the optimizer can get within a radius of a minimum, but then just start oscillating a bit without making substantive progress.
The real practical downside to all this is that neither tensorflow nor pytorch are particularly efficient at evaluating per-example gradients... so unfortunately even if the trace above looks sorta neat, in reality you have to wait quite a while to get it from bootstrapping, compared to ADAM (which I've just taken as the ~best off-the-shelf algo) since the boots implementation above doesn't have an efficient way of evaluating the jacobian matrix. (It wouldn't be quite so bad in a map-reduce framework where per-example gradients have to be materialized mind you...)
Anyway, hope it's of interest, and all the best!
The text was updated successfully, but these errors were encountered:
Hi Michael,
I wanted to share a stochastic first order optimizer idea and implementation for L-BFGS, in case you might find it interesting or even practical in your research (congrats also on your PhD & new role at fb!)
The idea behind the algorithm is pretty simple in brief:
I implemented the idea above in the following repository using
tensorflow
:https://github.com/mbhynes/boots
I admittedly haven't run particularly thorough experiments on this, but did use a small convnet as a toy problem and it looks like the convergence of the bootstrapped L-BFGS algorithm can have the same or better convergence as ADAM up to a point, measured by the decrease in training loss vs the # of function/gradient evaluations [ie data points accessed], e.g. here
There's no formal convergence criteria implemented, and empirically it looks like the optimizer can get within a radius of a minimum, but then just start oscillating a bit without making substantive progress.
The real practical downside to all this is that neither
tensorflow
norpytorch
are particularly efficient at evaluating per-example gradients... so unfortunately even if the trace above looks sorta neat, in reality you have to wait quite a while to get it from bootstrapping, compared to ADAM (which I've just taken as the ~best off-the-shelf algo) since theboots
implementation above doesn't have an efficient way of evaluating the jacobian matrix. (It wouldn't be quite so bad in a map-reduce framework where per-example gradients have to be materialized mind you...)Anyway, hope it's of interest, and all the best!
The text was updated successfully, but these errors were encountered: