-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathmodule4.html
406 lines (270 loc) · 12.2 KB
/
module4.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
<!DOCTYPE html>
<html>
<head>
<title>Optimization for deep learning [Marc Lelarge]</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<link rel="stylesheet" href="./assets/katex.min.css">
<link rel="stylesheet" type="text/css" href="./assets/slides.css">
<link rel="stylesheet" type="text/css" href="./assets/grid.css">
</head>
<body>
<textarea id="source">
class: center, middle, title-slide
count: false
# Module 4:
## Optimization for deep learning
<br/><br/>
.bold[Marc Lelarge]
---
# (1) Optimization and deep learning
## Gradient descent, stochastic gradient descent and mini-batch SGD
# (2) Gradient descent optimization algorithms
## Momentum, Nesterov accelerated gradient, Adagrad, RMSProp, Adam, AMSGrad
# (3) PyTorch optimizers
---
# (1) Some warnings about optimization in deep learning?
The objective function of an optimization algorithm is usually a loss function based on the training data set, hence the goal of optimization is to reduce the _training error_.
--
count: false
However, the goal of (deep) learning is to reduce the _generalization error_.
--
count: false
In order to reduce the generalization error, we need to pay attention to _overfitting_ in addition to using the optimization algorithm to reduce the training error.
--
count: false
In this course, we focus specifically on the _performance_ of the optimization algorithm in minimizing the objective function, rather than the model’s generalization error.
In the next lessons, we will see techniques to avoid _overfitting_.
--
count: false
No theorem in this lecture (for theorems, see a convex optimization course). We focus on intuitions because deep learning optimization problems are not convex and we hope our _intuition_ built on convex problems will be useful!
---
# (1) What do we optimize?
Recall the simple linear regression where the goal is to minimize the .bold[cost function]:
$$
J(\theta) = \frac{1}{2}\sum\_{i=1}^m(y(i)-\theta^T x(i))^2.
$$
--
count: false
For a deep neural network $F(.;\theta)$ with parameters $\theta$, we minimize the objective function:
$$
J(\theta) = \frac{1}{2}\sum\_{i=1}^m(y(i)-F(x(i);\theta))^2,
$$
in the paremter $\theta$ on the _training set_.
This problem is not convex anymore but we can still compute the gradient of the cost function $\nabla J(\theta)$ with respect to the parameters (well PyTorch do it for us!).
---
# (1) Gradient descent variants
Idea: update the parameters in the opposite direction of the gradient.
The _learning rate_ $\eta$ determines the size of the steps.
--
count: false
## Batch gradient descent
$$
\theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t)
$$
--
count: false
## Stochastic gradient descent
$$
\theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t;x(i),y(i))
$$
--
count: false
## Mini-batch gradient descent
$$
\theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t;x(i:i+b),y(i:i+b)),
$$
where $b$ is the batch size.
---
# (1) Challenges
Mini-batch gradient descent is the algorithm of choice when training a neural network. The term SGD is usually employed also when mini-batches are used!
--
count: false
- Choosing a proper learning rate can be difficult. How to adapt the learning rate during training?
- Why applying the same learning rate to all parameters updates?
- How to escape saddle points where the gradient is close to zero in all dimension?
.center[
<img src="images/module4/Saddle_point.png" style="width: 400px;" />
]
---
count:false
# (1) Challenges
Mini-batch gradient descent is the algorithm of choice when training a neural network. The term SGD is usually employed also when mini-batches are used!
- Choosing a proper learning rate can be difficult. How to adapt the learning rate during training?
- Why applying the same learning rate to all parameters updates?
- How to escape saddle points where the gradient is close to zero in all dimension?
In the rest of the lecture, we will introduce _modifications to SGD_.
Also we will apply these modifications to mini-batch gradient descent, we omit the explicit reference to the batch $x(i:i+b),y(i:i+b)$. Hence we write for SGD:
$$
\theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t),
$$
but we need to keep in mind that updates are performed for every mini-batch of $b$ training samples.
---
# (2) Gradient descent optimization algorithms
A nice [survey](http://ruder.io/optimizing-gradient-descent/) by Sebastian Ruder
Please run the [jupyter notebook](https://colab.research.google.com/github/dataflowr/notebooks/blob/master/Module4/04_gradient_descent_optimization_algorithms_empty.ipynb) in parallel as you will need to code the optimization algorithms as soon as we see them.
We will deal with the toy problem: minimize in $x_1,x_2$, the function:
$$
f(x\_1, x\_2) = 0.1 x\_1^2+2 x\_2^2.
$$
--
count: false
You will need to implement all variations of SGD.
Have a look at the `train_2d()` function and the Gradient descent code to understand what this function does.
---
# (2) Momentum
Accelerating SGD by dampening oscillations, i.e. by averaging the last values of the gradient.
--
count: false
$$\begin{aligned}
v\_{t+1} &= \gamma v\_{t}+ \eta \nabla J(\theta\_t)\\\\
\theta\_{t+1} &= \theta\_t - v\_{t+1}
\end{aligned}$$
--
count: false
Why does it work?
With $g\_t = \nabla J(\theta\_t)$, we have for any $k>0$:
$$
v\_{t+1} = \gamma^k v\_{t-k} +\eta \underbrace{\sum\_{i=0}^k \gamma^i g\_{t-i}}_{\text{average of last gradients}}
$$
Typical value for $\gamma= 0.9$.
---
# (2) Nesterov accelerated gradient
With momentum, we first compute the gradient and then make a step following our momentum and add the gradient. Nesterov proposed to first make the step following the momentum and then adjusting by computing the gradient locally:
$$\begin{aligned}
v\_{t+1} &= \gamma v\_{t}+ \eta \nabla J(\theta\_t-\gamma v\_{t})\\\\
\theta\_{t+1} &= \theta\_t - v\_{t+1}
\end{aligned}$$
.center[
<img src="images/module4/nesterov_update_vector.png" style="width: 400px;" />
]
Source for the image: [G. Hinton's lecture 6c](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
.citation[Nesterov, A method for solving the convex programming problem with convergence rate O(1/k^2), Dokl. Akad. Nauk SSSR 1983]
---
# (2) Adagrad
We would like to adapt our updates to each individual parameter, i.e. have a different decreasing learning rate for each parameter.
$$\begin{aligned}
s\_{t+1,i} &= s\_{t,i} + \nabla J(\theta\_t)\_i^2\\\\
\theta\_{t+1,i} &= \theta\_{t,i} - \frac{\eta}{\sqrt{s\_{t+1,i}+\epsilon}}\nabla J(\theta\_t)\_i
\end{aligned}$$
--
count: false
No manual tuning of the learning rate.
Typical default values: $\eta=0.01$ and $\epsilon = 1e-8$.
.citation[Duchi et al., [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](http://jmlr.org/papers/v12/duchi11a.html), JMLR 2011]
---
# (2) RMSProp
Problem with Adagrad, learning rate goes to zero and never forgets about the past.
--
count: false
Idea proposed by G. Hinton in his coursera class: use exponential average.
$$\begin{aligned}
s\_{t+1,i} &= \gamma s\_{t,i} + (1-\gamma) \nabla J(\theta\_t)\_i^2\\\\
\theta\_{t+1,i} &= \theta\_{t,i} - \frac{\eta}{\sqrt{s\_{t+1,i}+\epsilon}}\nabla J(\theta\_t)\_i
\end{aligned}$$
--
count: false
With a slight abuse of notation, we re-write the update as follows:
$$\begin{aligned}
s\_{t+1} &= \gamma s\_{t} + (1-\gamma) \nabla J(\theta\_t)^2\\\\
\theta\_{t+1} &= \theta\_{t} - \frac{\eta}{\sqrt{s\_{t+1}+\epsilon}}\nabla J(\theta\_t)
\end{aligned}$$
Typical values: $\gamma = 0.9$ and $\eta = 0.001$.
.citation[Hinton [Coursera lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) ]
---
# (2) Adam
Mixing ideas from RMSProp and momentum, we get Adam = Adaptive moment Estimation.
$$\begin{aligned}
m\_{t+1} &= \beta\_1 m\_t + (1-\beta\_1) \nabla J(\theta\_t)\\\\
v\_{t+1} &= \beta\_2 v\_t + (1-\beta\_2) \nabla J(\theta\_t)^2\\\\
\hat{m}\_{t+1} &= \frac{m\_{t+1}}{1-\beta\_1^{t+1}}\\\\
\hat{v}\_{t+1} &= \frac{v\_{t+1}}{1-\beta\_2^{t+1}}\\\\
\theta\_{t+1} &= \theta\_{t} - \frac{\eta}{\sqrt{\hat{v}\_{t+1}}+\epsilon} \hat{m}\_{t+1}
\end{aligned}$$
$\hat{m}\_t$ and $\hat{v}\_t$ are estimates for the first and second moments of the gradients. Because $m_0=v_0=0$, these estimates are biased towards $0$, the factors $(1-\beta^{t+1})^{-1}$ are here to counteract these biases.
--
count: false
Typical values: $\beta_1=0.9$, $\beta_2 =0.999$ and $\epsilon=1e-8$.
.citation[Kingma et al. , [Adam: a Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980), ICLR 2015]
---
# (2) AMSGrad
Sometimes, Adam forgets too fast, hence to fix it, we replace the moving average by a $\max$:
$$\begin{aligned}
m\_{t+1} &= \beta\_1 m\_t + (1-\beta\_1) \nabla J(\theta\_t)\\\\
v\_{t+1} &= \beta\_2 v\_t + (1-\beta\_2) \nabla J(\theta\_t)^2\\\\
\hat{v}\_{t+1} &= \max\left(\hat{v}\_{t}, v\_{t+1}\right)\\\\
\theta\_{t+1} &= \theta\_{t} - \frac{\eta}{\sqrt{\hat{v}\_{t+1}}+\epsilon} m\_{t+1}
\end{aligned}$$
.citation[Reddi et al.,[On the Convergence of Adam and Beyond](https://openreview.net/forum?id=ryQu7f-RZ) ICLR 2018]
---
# (3) [PyTorch optimizers](https://pytorch.org/docs/stable/optim.html)
All have similar constructor `torch.optim.*(params, lr=..., momentum=...)`.
Default values are different for all optimizer, check the doc.
`params` should be an iterable (like a list) containing the parameters to optimize over.
It can be obtained from any module with `module.parameters()`.
The `step` method updates the internal state of the optimizer according to the `grad` attributes of the `params`, and updates the latter according to the internal state.
```py
criterion = nn.NLLLoss()
optimizer_vgg = torch.optim.SGD(model_vgg.classifier[6].parameters(),lr = 0.001)
def train_model(model,dataloader,size,epochs=1,optimizer=None):
model.train()
for epoch in range(epochs):
for inputs,classes in dataloader:
inputs = inputs.to(device)
classes = classes.to(device)
outputs = model(inputs)
loss = criterion(outputs,classes)
optimizer.zero_grad()
loss.backward()
optimizer.step()
```
---
class: end-slide, center
count: false
The end.
</textarea>
<script src="./assets/remark-latest.min.js"></script>
<script src="./assets/auto-render.min.js"></script>
<script src="./assets/katex.min.js"></script>
<script type="text/javascript">
function getParameterByName(name, url) {
if (!url) url = window.location.href;
name = name.replace(/[\[\]]/g, "\\$&");
var regex = new RegExp("[?&]" + name + "(=([^&#]*)|&|#|$)"),
results = regex.exec(url);
if (!results) return null;
if (!results[2]) return '';
return decodeURIComponent(results[2].replace(/\+/g, " "));
}
//var options = {sourceUrl: getParameterByName("p"),
// highlightLanguage: "python",
// // highlightStyle: "tomorrow",
// // highlightStyle: "default",
// highlightStyle: "github",
// // highlightStyle: "googlecode",
// // highlightStyle: "zenburn",
// highlightSpans: true,
// highlightLines: true,
// ratio: "16:9"};
var options = {sourceUrl: getParameterByName("p"),
highlightLanguage: "python",
highlightStyle: "github",
highlightSpans: true,
highlightLines: true,
ratio: "16:9",
slideNumberFormat: (current, total) => `
<div class="progress-bar-container">${current}/${total} <br/><br/>
<div class="progress-bar" style="width: ${current/total*100}%"></div>
</div>
`};
var renderMath = function() {
renderMathInElement(document.body, {delimiters: [ // mind the order of delimiters(!?)
{left: "$$", right: "$$", display: true},
{left: "$", right: "$", display: false},
{left: "\\[", right: "\\]", display: true},
{left: "\\(", right: "\\)", display: false},
]});
}
var slideshow = remark.create(options, renderMath);
</script>
</body>
</html>