Series Foreword |
|
xiii | (2) |
Preface |
|
xv | |
I The Problem |
|
1 | (86) |
|
|
3 | (22) |
|
1.1 Reinforcement Learning |
|
|
3 | (3) |
|
|
6 | (1) |
|
1.3 Elements of Reinforcement Learning |
|
|
7 | (3) |
|
1.4 An Extended Example: Tic-Tac-Toe |
|
|
10 | (5) |
|
|
15 | (1) |
|
1.6 History of Reinforcement Learning |
|
|
16 | (7) |
|
1.7 Bibliographical Remarks |
|
|
23 | (2) |
|
|
25 | (26) |
|
2.1 An n-Armed Bandit Problem |
|
|
26 | (1) |
|
|
27 | (3) |
|
2.3 Softmax Action Selection |
|
|
30 | (1) |
|
2.4 Evaluation Versus Instruction |
|
|
31 | (5) |
|
2.5 Incremental Implementation |
|
|
36 | (2) |
|
2.6 Tracking a Nonstationary Problem |
|
|
38 | (1) |
|
2.7 Optimistic Initial Values |
|
|
39 | (2) |
|
2.8 Reinforcement Comparison |
|
|
41 | (2) |
|
|
43 | (2) |
|
|
45 | (1) |
|
|
46 | (2) |
|
2.12 Bibliographical and Historical Remarks |
|
|
48 | (3) |
|
3 The Reinforcement Learning Problem |
|
|
51 | (36) |
|
3.1 The Agent-Environment Interface |
|
|
51 | (5) |
|
|
56 | (1) |
|
|
57 | (3) |
|
3.4 Unified Notation for Episodic and Continuing Tasks |
|
|
60 | (1) |
|
|
61 | (5) |
|
3.6 Markov Decision Processes |
|
|
66 | (2) |
|
|
68 | (7) |
|
3.8 Optimal Value Functions |
|
|
75 | (5) |
|
3.9 Optimality and Approximation |
|
|
80 | (1) |
|
|
81 | (2) |
|
3.11 Bibliographical and Historical Remarks |
|
|
83 | (4) |
II Elementary Solution Methods |
|
87 | (74) |
|
|
89 | (22) |
|
|
90 | (3) |
|
|
93 | (4) |
|
|
97 | (3) |
|
|
100 | (3) |
|
4.5 Asynchronous Dynamic Programming |
|
|
103 | (2) |
|
4.6 Generalized Policy Iteration |
|
|
105 | (2) |
|
4.7 Efficiency of Dynamic Programming |
|
|
107 | (1) |
|
|
108 | (1) |
|
4.9 Bibliographical and Historical Remarks |
|
|
109 | (2) |
|
|
111 | (22) |
|
5.1 Monte Carlo Policy Evaluation |
|
|
112 | (4) |
|
5.2 Monte Carlo Estimation of Action Values |
|
|
116 | (2) |
|
|
118 | (4) |
|
5.4 On-Policy Monte Carlo Control |
|
|
122 | (2) |
|
5.5 Evaluating One Policy While Following Another |
|
|
124 | (2) |
|
5.6 Off-Policy Monte Carlo Control |
|
|
126 | (2) |
|
5.7 Incremental Implementation |
|
|
128 | (1) |
|
|
129 | (2) |
|
5.9 Bibliographical and Historical Remarks |
|
|
131 | (2) |
|
6 Temporal-Difference Learning |
|
|
133 | (28) |
|
|
133 | (5) |
|
6.2 Advantages of TD Prediction Methods |
|
|
138 | (3) |
|
|
141 | (4) |
|
6.4 Sarsa: On-Policy TD Control |
|
|
145 | (3) |
|
6.5 Q-Learning: Off-Policy TD Control |
|
|
148 | (3) |
|
|
151 | (2) |
|
6.7 R-Learning for Undiscounted Continuing Tasks |
|
|
153 | (3) |
|
6.8 Games, Afterstates, and Other Special Cases |
|
|
156 | (1) |
|
|
157 | (1) |
|
6.10 Bibliographical and Historical Remarks |
|
|
158 | (3) |
III A Unified View |
|
161 | (130) |
|
|
163 | (30) |
|
|
164 | (5) |
|
7.2 The Forward View of TD(Frequency) |
|
|
169 | (4) |
|
7.3 The Backward View of TD(Frequency) |
|
|
173 | (3) |
|
7.4 Equivalence of Forward and Backward Views |
|
|
176 | (3) |
|
|
179 | (3) |
|
|
182 | (3) |
|
7.7 Eligibility Traces for Actor-Critic Methods |
|
|
185 | (1) |
|
|
186 | (3) |
|
7.9 Implementation Issues |
|
|
189 | (1) |
|
|
189 | (1) |
|
|
190 | (1) |
|
7.12 Bibliographical and Historical Remarks |
|
|
191 | (2) |
|
8 Generalization and Function Approximation |
|
|
193 | (34) |
|
8.1 Value Prediction with Function Approximation |
|
|
194 | (3) |
|
8.2 Gradient-Descent Methods |
|
|
197 | (3) |
|
|
200 | (10) |
|
8.4 Control with Function Approximation |
|
|
210 | (6) |
|
8.5 Off-Policy Bootstrapping |
|
|
216 | (4) |
|
|
220 | (2) |
|
|
222 | (1) |
|
8.8 Bibliographical and Historical Remarks |
|
|
223 | (4) |
|
|
227 | (28) |
|
|
227 | (3) |
|
9.2 Integrating Planning, Acting, and Learning |
|
|
230 | (5) |
|
9.3 When the Model Is Wrong |
|
|
235 | (3) |
|
|
238 | (4) |
|
9.5 Full vs. Sample Backups |
|
|
242 | (4) |
|
|
246 | (4) |
|
|
250 | (2) |
|
|
252 | (2) |
|
9.9 Bibliographical and Historical Remarks |
|
|
254 | (1) |
|
10 Dimensions of Reinforcement Learning |
|
|
255 | (6) |
|
|
255 | (3) |
|
10.2 Other Frontier Dimensions |
|
|
258 | (3) |
|
|
261 | (30) |
|
|
261 | (6) |
|
11.2 Samuel's Checkers Player |
|
|
267 | (3) |
|
|
270 | (4) |
|
11.4 Elevator Dispatching |
|
|
274 | (5) |
|
11.5 Dynamic Channel Allocation |
|
|
279 | (4) |
|
|
283 | (8) |
References |
|
291 | (22) |
Summary of Notation |
|
313 | (2) |
Index |
|
315 | |