forked from RussTedrake/underactuated
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpolicy_search.html
188 lines (142 loc) · 8.85 KB
/
policy_search.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
<!DOCTYPE html>
<html>
<head>
<title>Underactuated Robotics: Policy
Search</title>
<meta name="Underactuated Robotics: Policy
Search" content="text/html; charset=utf-8;" />
<link rel="canonical" href="http://underactuated.mit.edu/policy_search.html" />
<script src="https://hypothes.is/embed.js" async></script>
<script type="text/javascript" src="htmlbook/book.js"></script>
<script src="htmlbook/mathjax-config.js" defer></script>
<script type="text/javascript" id="MathJax-script" defer
src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js">
</script>
<script>window.MathJax || document.write('<script type="text/javascript" src="htmlbook/MathJax/es5/tex-chtml.js" defer><\/script>')</script>
<link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
<script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
<script>hljs.initHighlightingOnLoad();</script>
<link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
</head>
<body onload="loadChapter('underactuated');">
<div data-type="titlepage">
<header>
<h1><a href="index.html" style="text-decoration:none;">Underactuated Robotics</a></h1>
<p data-type="subtitle">Algorithms for Walking, Running, Swimming, Flying, and Manipulation</p>
<p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
<p style="font-size: 14px; text-align: right;">
© Russ Tedrake, 2021<br/>
Last modified <span id="last_modified"></span>.</br>
<script>
var d = new Date(document.lastModified);
document.getElementById("last_modified").innerHTML = d.getFullYear() + "-" + (d.getMonth()+1) + "-" + d.getDate();</script>
<a href="misc.html">How to cite these notes, use annotations, and give feedback.</a><br/>
</p>
</header>
</div>
<p><b>Note:</b> These are working notes used for <a
href="http://underactuated.csail.mit.edu/Spring2021/">a course being taught
at MIT</a>. They will be updated throughout the Spring 2021 semester. <a
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture videos are available on YouTube</a>.</p>
<table style="width:100%;"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=feedback_motion_planning.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=robust.html>Next Chapter</a></td>
</tr></table>
<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 12"><h1>Policy
Search</h1>
<p>So far, most of our recommendations for control design have been
relatively "local" -- leveraging trajectory planning/optimization as a tool
and our ability to locally stabilize trajectories for even very complex
systems using linear optimal control. This is in stark contrast to the
dynamic programming / value iteration methods that we started with, which
attempt to solve for a control policy for every possible state;
unfortunately, the dynamic programming methods as presented are restricted to
relatively low dimensional state spaces. What is missing so far is
algorithms for synthesizing feedback controllers that scale to large state
spaces and produce controllers that are, hopefully, less "local" than
trajectory stabilization.</p>
<p>In this chapter, we will explore another very natural idea: let us
parameterize a controller with some decision variables, and then search over
those decision variables directly in order to achieve a task and/or optimize
a performance objective. We'll refer to this broad class of methods as
"policy search" or, when optimization methods are used, "policy
optimization". </p>
<section><h1>Problem formulation</h1>
<p>Consider a static full-state feedback policy, $$\bu =
\bpi_\balpha(\bx),$$ where $\bpi$ is potentially a nonlinear function, and
$\balpha$ is the vector of parameters that describe the controller. The
control might take time as an input, or might even have it's own internal
state, but let's start with this simple form. </p>
<p>How should we write an objective function for optimizing $\balpha$? The
approach that we used for trajectory optimization is quite reasonable --
the objective was typically to minimize an integral cost over some time
horizon (be it finite or infinite). But in trajectory optimization, the
cost is only ever defined based on forward simulation from a single initial
condition. We used the same additive cost structures in dynamic
programming, where the Hamilton-Bellman-Jacobi equation provided optimality
conditions for optimizing an additive cost from <i>every</i> initial
condition; at least in the idealized equations, we were able to get away
with saying $\forall \bx, \minimize_\bu ...$. </p>
<p>But now we are playing a different game. If we are searching over the
some finitely parameterized policy, $\bpi_{\balpha}$, we can almost never
expect to be optimal for every state -- and we need to somehow define the
relevant importance of different states. For finite-time, a distribution
over initial conditions. For infinite horizon, what really matters is the
stationary distribution (which depends on the policy). Let's start with
the distribution over initial conditions.</p>
</section>
<section><h1>Controller parameterizations</h1>
<p><i>more coming soon...</i></p>
<p>It might be tempting to optimize an LQR problem by searching directly over the feedback parameters, ${\bf K}$. It was shown only quite recently that this can achieve the optimal ${\bf K}$ <elib>Fazel18</elib>. But other, and likely more efficient parameterizations are also known<elib>Roberts11</elib>.</p>
<p>The case of output feedback is not as nice. Searching directly over static output feedback controllers, $\bu = -{\bf K}\by$, is known to be hard -- even the set of stabilizing ${\bf K}$ matrices can be disconnected. We can see that with a simple example (given to me once during a conversation with Alex Megretski).
</p>
<example><h1>Parameterizations of Static Output Feedback</h1>
<p>Consider the single-input, single-output LTI system $$\dot{\bx} = {\bf A}\bx + {\bf B} u, \quad y = {\bf C}\bx,$$ with $${\bf A} = \begin{bmatrix} 0 & 0 & 2 \\ 1 & 0 & 0 \\ 0 & 1 & 0\end{bmatrix}, \quad {\bf B} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \quad {\bf C} = \begin{bmatrix} 1 & 1 & 3 \end{bmatrix}.$$ Here the linear static-output-feedback policy can be written as $u = -ky$, with a single scalar parameter $k$. </p>
<p>Go ahead and make a plot of the maximum eigenvalue of the closed-loop system, as a function of $k$. The system is only stable when this maximum eigenvalue is less than zero. You'll find the set of stabilizing $k$'s is a disconnected set.</p>
</example>
</section>
<section><h1>Trajectory-based policy search</h1>
</section>
<section><h1>Lyapunov-based approaches to policy
search.</h1>
</section>
<section><h1>Approximate Dynamic Programming</h1>
</section>
<!-- TODO:
Guided Policy Search, etc (see Robert V's review);
PILCO and PIPPS (arxiv from Doya).
-->
</chapter>
<!-- EVERYTHING BELOW THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<div id="references"><section><h1>References</h1>
<ol>
<li id=Fazel18>
<span class="author">Maryam Fazel and Rong Ge and Sham M. Kakade and Mehran Mesbahi</span>,
<span class="title">"Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator"</span>,
, <span class="year">2018</span>.
</li><br>
<li id=Roberts11>
<span class="author">John Roberts and Ian Manchester and Russ Tedrake</span>,
<span class="title">"Feedback Controller Parameterizations for Reinforcement Learning"</span>,
<span class="publisher">Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)</span>, <span class="year">2011</span>.
[ <a href="http://groups.csail.mit.edu/robotics-center/public_papers/Roberts11.pdf">link</a> ]
</li><br>
</ol>
</section><p/>
</div>
<table style="width:100%;"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=feedback_motion_planning.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=robust.html>Next Chapter</a></td>
</tr></table>
<div id="footer">
<hr>
<table style="width:100%;">
<tr><td><a href="https://accessibility.mit.edu/">Accessibility</a></td><td style="text-align:right">© Russ
Tedrake, 2021</td></tr>
</table>
</div>
</body>
</html>