ORF 544 – Stochastic Optimization and Learning

Professor Warren Powell

Department of Operations Research and Financial Engineering
Spring, 2019

Fridays, 1:30-4:30

If you are interested in the course, please sign up at

ORF 544 signup sheet

Making decisions under uncertainty is a universal human activity, something we have all done our entire lives, and generally something we do every day.  The study of this rich problem area can be organized under a broad umbrella called “stochastic optimization.”  Normally taught as a mathematically deep topic, ORF 544 will be taught with a primary emphasis on  Jungle proper modeling, and the design and analysis of practical algorithms.

The course will present a unified framework for stochastic optimization that cuts across the many communities that contribute to the general problem of the design and control of systems under uncertainty (I call this the “jungle of stochastic optimization”). This modeling framework has been tested in problem domains spanning transportation and logistics, energy, health, finance, internet search, and even the laboratory sciences.

Audience: The course is appropriate for students in operations research, economics, computer science, applied math, and any field of engineering (e.g. for students interested in engineering controls). It is open to undergraduates with a strong interest in models and algorithms.  The course requires a basic background in probability and statistics at the undergraduate level (e.g. ORF 245 – if you have ORF 309, even better). There is a small amount of material where a background in linear programming is useful, but this will not be required on problem sets or exams. I will occasionally bridge to more advanced probabilistic concepts, but this will be aimed at students without this background and is not required.

Readings: The course will be taught from a new book, Stochastic Optimization and Learning: A Unified Framework, that is being written. To download the book:

Click here to download book (updated Jan 7 2018)

Please enter any comments/suggestions at


The book will introduce students to different notational systems, and will cover problems, modeling frameworks and algorithmic strategies from all of the books shown above.

Course themes:

  • Covering the major problem classes in stochastic optimization, differentiated on the nature of
    • State variables
    • Decision variables
    • Exogenous information
    • Transition functions
    • Objective functions:
      • Finite/infinite horizon
      • Derivative-free, derivative based
      • Convexity, continuity, …
      • Online vs. offline
      • Uncertainty operators
  • Proper notation – We will cover different notational systems, and discuss issues such as the proper modeling of states and time, modeling uncertainty, and expressing the objective function.
  • We will provide a unified framework covering:
    • Staging: from static to sequential to contextual
    • Timing: from finite horizon to infinite horizon
    • From offline to online
  • The major solution strategies:
    • Stochastic (policy) search (derivative-free and derivative-based)
      • Analytical policy function approximations
      • Parametric cost function approximations
    • Lookahead approximations
      • Solving an approximate lookahead model
      • Using value function approximations
  • Approximation strategies
    • Function approximations (using machine learning)
    • Sampled approximations
  • Bayesian and frequentist modeling
  • The use of Monte Carlo simulation methods
    • Solution via the use of sampled models
    • Search via sample path methods (e.g. stochastic gradients)
  • Uncertainty operators
    • Types (expectations, risk measures, robust operators)
    • Effect on modeling and computation.

We will illustrate ideas through a combination of standard models and real applications. Some of the standard models will include:

  • Newsvendor problem
  • Inventory/storage problems
  • Optimal stopping
  • Ranking and selection/bandit problems
  • Stochastic shortest paths
  • Stochastic linear programs
  • Statistical model fitting (offline and online)
  • Linear quadratic control
  • Black-box optimization
    • Simulation optimization (optimizing discrete event simulators)
    • Optimization of expensive, deterministic functions

Real applications will be drawn based on the interests of the class.


Course outline:

Week 1 – Overview and canonical problems

Week 2 – Approximation strategies

Week 3 – The basic stochastic optimization problem and derivative-based stochastic optimization

Week 4 – Derivative-free stochastic optimization

Week 5 – State-dependent applications and modeling

Week 6 – Modeling uncertainty and designing policies

Week 7 – Policy search

Week 8 – Discrete Markov decision processes

Week 9 – Structured dynamic programming and backward ADP

Week 10-11 – Forward approximate dynamic programming

Week 12 – Direct lookahead policies

Week 1 – Sept 13

Overview of computational stochastic optimization


  • The major problem classes in stochastic optimization:
    • Types of state variables
      • Initial state, prior, context
      • Physical, informational and knowledge (belief) states
      • Pre- and post-decision states
      • Flat vs. factored representations
    • Types of decisions
      • Scalar
        • Discrete/continuous
      • Vector
        • Discrete/continuous/mixed
      • Subset selection
      • Multiattribute
    • Types of uncertainty that arises in real applications, and how to model them
    • What we know about transition functions
    • Objective functions
  • Sequencing decisions and information
    • Static (decision-information)
    • Contextual-static (information-decision-information)
    • Two-stage (decision-information-decision)
    • Fully sequential
      • Finite horizon
      • Infinite horizon
  • Notational systems:
    • Discrete Markov decision processes
    • Stochastic control
    • Stochastic programming
    • Stochastic search
  • Offline and online optimization
  • Types of uncertainty
    • Classes of uncertainty
    • Types of distributions
  • Transition functions
    • Model-based:
      • Transition matrix known – common on CS/RL community.
      • Transition function known (but transition matrix not computable) – Common in controls community.
    • Model-free
      • Transition function unknown
  • Objectives
    • Finite/infinite horizon
    • Derivative-free, derivative based
    • Inexpensive/expensive
    • Mathematical properties (convexity, continuity, monotonicity, …)
    • Online vs. offline
    • Uncertainty operators (expectation, risk measures, robust,…)
      • Expected cost/reward
        • Discounted
        • Average reward
      • Risk measures
        • Coherent risk measures
        • Time consistency
      • Quantiles
      • Robust optimization
    • Metric
      • Cost/reward
        • Undiscounted
        • Discounted
        • Average
      • Opportunity cost
      • Regret
      • Probability correct selection
      • Entropy minimization
      • ….
  • The myths of stochastic optimization
    • Dynamic programming suffers from the curse of dimensionality
    • Solving a (multistage) stochastic program is “optimal”
    • Markov vs. history-dependent dynamic programs
    • An asymptotically optimal algorithm is provides “good” solutions in practice (that is, finite time)
    • If a policy enjoys a tight regret bound it is therefore “good”
    • Understanding that a variable is F_t-measurable is important


Readings (* indicates required readings)

*Optimization under Uncertainty: A Unified Framework, Chapter 1

*Optimization under Uncertainty: A Unified Framework, Chapter 2



Canonical problems

In this lecture I will cover a number of classical, canonical problems that are familiar to different communities working in stochastic optimization. Examples include:

  • The fundamental stochastic optimization problem (final and cumulative reward formulations)
  • The multiarmed bandit problem
  • Decision trees
  • Inventory/storage problems
  • Two-stage stochastic programming
  • Chance-constrained optimization
  • Stochastic shortest paths
  • Optimal stopping
  • Markov decision problems
  • Optimal control
  • Multistage stochastic programming
  • Dynamic assignment problem

We will close by presenting a simple canonical framework for all of these problems.

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapter 2

Week 2 – Sept 27-29 – Approximation strategies

Stochastic optimization often (although not always) depends on using statistical methods to approximate functions. The presentation will cover some of the most popular methods used in statistics and machine learning, with emphasis on adaptive/recursive learning methods.


  • Approximation challenges in stochastic optimization
    • Expectations of (possibly unknown) functions
    • Policies
    • Value function approximations
  • Lookup table approximations
    • Frequentist and Bayesian updating strategies
    • Correlated beliefs
    • Hierarchical representations
  • Parametric models
    • Recursive linear models
    • Sparse additive models (brief)
    • Nonlinear models
    • Neural networks
  • Nonparametric models
    • Kernel methods
    • Deep neural networks (brief)
    • Support vector machines
  • Hybrids
    • Locally linear models
    • Tree regression

Readings (* indicates required readings)

*Optimization under Uncertainty: A Unified Framework, Chapter 3


Week 3

Oct 4 – The quintessential stochastic optimization problem

The “quintessential” stochastic optimization problem is the starting point in stochastic optimization. Here we introduce different flavors of the problem and discuss variations of objective functions.


  • Introduction to the fundamental stochastic optimization problem
  • Deterministic methods for stochastic optimization
    • A “stochastic” shortest path problem
    • Newsvendor with known demands
    • Optimal control (LQR)
  • Solving sampled representations
    • Monte Carlo sampling
    • Creating sampled belief models
    • Sample average approximation
  • Two-stage stochastic programming
  • Adaptive learning
    • Modeling adaptive learning problems
    • Objective functions for learning
    • Designing policies

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapter 4


Oct 6 – Derivative-based stochastic optimization

This is arguably the most fundamental of all stochastic optimization problems, which is typically written

\min_x \E F(x,W)

where W is a random variable. This week, we consider the case where we can compute derivatives of the function F(x,W) given W. This is both an important problem in its own right, as well as the foundation for performing policy search for sequential problems.


  • Problem classes
    • Newsvendor
    • General nonlinear (convex) optimization
    • Linear programs
  • Stochastic approximation methods
    • Stochastic gradients
    • Variations
      • Smoothing
      • Mirror descent
  • Asymptotic convergence analysis
  • Finite-time formulation and analysis

Derivative-based stochastic optimization is both an important problem in its own right (static stochastic optimization), as well as the foundational problem for performing policy search for sequential problems. Furthermore, finite-horizon versions can themselves be formulated as dynamic programs.

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapter 5 and 6.

W. B. Powell, Approximate Dynamic Programming – Chapter 8 – Policy search – Sections 8.2 and 8.3 and the proofs in section 8.9.

Shapiro, Dentcheva, Ruszczynski, Lecture notes on stochastic programming, 2nd edition (2014). – Section 5.9.

A. George, W. B. Powell, Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming, Machine Learning, 2006. – An overview of popular deterministic and stochastic stepsize formulas, and a new “optimal” stepsize formula based on signal processing concepts.

Mark Broadie, Deniz Cicek, Assaf Zeevi – General Bounds and Finite-Time Improvement for the Kiefer-Wolfowitz Stochastic Approximation Algorithm – This paper relaxes a key condition for stepsizes, namely the requirement that the sum of the squares of stepsizes has to be bounded.

Week 4 -Oct 11-13 – Derivative-free stochastic optimization

Although we pointed out that derivative-based stochastic optimization can be formulated as a dynamic program, it is in the setting of derivative-free stochastic optimization where we are going to first have to depend on the concepts of sequential learning systems, with a rich set of belief models that we explore using a full pallete of policies. We are going to use this setting of pure learning problems to introduce these concepts, which we return to later in the setting of sequential problems with physical sates.



  • Optimization models
    • Offline (ranking and selection)
    • Online (“multiarmed bandit problems”)
  • Problem classes
    • Discrete alternatives
    • Continuous actions
    • Contextual problems
    • Stationary/nonstationary
  • Belief models
    • Lookup table
    • Bayesian vs. frequentist
  • Formulation as a dynamic program
    • For offline objective
    • For online objective
  • Policies for offline learning
    • PFAs – Boltzmann (soft max), epsilon-greedy
    • CFAs – Interval estimation, UCB
    • VFAs – Response surface methods, value function approximation of knowledge state
    • Lookaheads:
      • Thompson sampling
      • Value of information policies (EI, knowledge gradient, kriging)
  • Finite time vs. asymptotic analysis
  • More advanced belief models
    • Linear models
    • Nonlinear models
    • Semi-parametric and nonparametric models
  • Approximating the knowledge gradient
    • Sampled belief model
    • Resampling
    • Laplace approximation
  • Exploration vs. exploitation using parametric belief models
  • Balancing optimization and parameter estimation
  • Optimal computing budget allocation (OCBA) for simulation optimization

Derivative-free stochastic optimization is the companion of derivative-based stochastic optimization, and represents an important class of problems, both in their own right, and as a strategy for performing policy search for sequential problems (dynamic programs).

Offline versions of this problem are known in the literature as “ranking and selection” problems. Online versions (where costs/rewards are accumulated as the search progresses), are widely known as “multiarmed bandit problems.” Both can be formulated using the classical language of dynamic programming with appropriate objective functions.

Try to pay attention to whether a method is frequentist or Bayesian, as well as online or offline. Papers within a particular community do not always make these assumptions explicit. For example, “bandit” papers are almost always (99.9 percent) for online problems.

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapter 7

Readings for offline learning:

For a general introduction (written at a undergraduate level), see:

Powell and Ryzhov, Optimal Learning – Chapter 4 – The ranking and selection problem [Entire chapter]

Powell and Ryzhov, Optimal Learning – Chapter 5 – The knowledge gradient policy [Exclude section 5.5]

A summary of a number of heuristics and a their experimental evaluation, see:

Wang, Powell – Finite-time analysis for the knowledge-gradient policy and a new testing environment for optimal learning – Focus on section 6 and onward, describing the MOLTE testing environment.

For a nice theoretical analysis of the knowledge gradient policy, see:

P. Frazier, W. Powell, S. Dayanik – A Knowledge Gradient Policy for Sequential Information Collection – This paper provides a rigorous model and the asymptotic analysis of the knowledge gradient policy. This is a good paper for learning how to properly model a Bayesian-based optimal learning problem. Read this paper thoroughly.

Frazier, P. I., and W. B. Powell, “Paradoxes in Learning and the Marginal Value of Information,” Decision Analysis, Vol. 7, No. 4, pp. 378-403, 2010. – Establishes the nonconcavity of the value of information. This happens a lot, so pay attention to this. Think about the limitations of the proposed KG(*) policy.

Recent paper on rate of convergence of expected value of information (EI) policies

Ryzhov – On the convergence rates of expected improvement methods March 2016 (to appear in Operations Research) – Highlights:

  • Convergence analysis is done with sampling ratios rather than regret bounds. This characterizes the relative rate of sampling of different options.
  • Paper develops sampling ratios for several EI variants.
  • For problems with known variance, shows that sampling ratios for EI and OCBA are the same.

Readings for online learning:

Auer, Cesa-Bianchi, Fischer – Finite-time analysis of the Multiarmed Bandit Problem – Very nice paper building on the seminal paper by Lai and Robbins, who had shown that there are policies that asymptotically achieve the minimal regret. This paper shows that there are policies that can uniformly achieve the minimal regret. The paper only considers problems where the outcomes are 0 or 1.

*Bubeck and Cesa-Bianchi – Regret Analysis of Stochastic and Non-Stochastic Multi-Armed Bandit Problems – Read Sections 1, 2 and skim 7.

*Powell and Ryzhov, Optimal Learning – Chapter 6 – Introduction to bandit problems and an overview of the major solution approaches (Gittins indices, knowledge gradient for online learning, UCB policies)

For a sample of the literature for simulation-optimization (frequentist):

Chun-Hung Chen, Donghai He, Michael Fu, Loo Hay Lee, Efficient Simulation Budget Allocation for Selecting an Optimal Subset – This is a sample of the literature for optimizing across a discrete set of designs using Monte Carlo simulation. This is a frequentist methodology, designed for efficiently using finite computing resources to manage expensive simulations of a relatively small set of designs.

Hong, J., & Nelson, B. L. (2006). Discrete Optimization via Simulation Using COMPASS. Operations Research, 54(1), 115–129. doi:10.1287/opre.1050.0237 – Illustrates a frequentist-based method for general stochastic search, popular in the simulation-optimization community.

Hong, L., & Nelson, B. L. (2007). A framework for locally convergent random-search algorithms for discrete optimization via simulation. ACM Transactions on Modeling and Computer Simulation, 17(4), 1–22. doi:10.1145/1276927.1276932 – [Skim the convergence proof.]

More advanced belief models:

D. Negoescu, P. Frazier and W. B. Powell, “The Knowledge Gradient Algorithm for Sequencing Experiments in Drug Discovery,” Informs Journal on Computing, Vol. 23, No. 3, pp. 346-363, 2011. (Online supplement)– The knowledge gradient with a linear belief model. This is the first KG paper using a parametric belief model.

M. Mes, W.B. Powell, P. Frazier, Hierarchical Knowledge Gradient for Sequential Sampling – This illustrates a nonparametric belief model constructed by using hierarchical estimates of the unknown function. Note that the knowledge gradient has to anticipate both the change in the weights, as well as the change in the estimates.

Si Chen, K-R G. Reyes, M. Gupta, M. C. McAlpine, W. B. Powell, “Optimal Learning in Experimental Design Using the Knowledge Gradient Policy with Application to Characterizing Nanoemulsion Stability,” SIAM J. Uncertainty Quantification, Vol. 3, pp. 320-345, 2015. – Nonlinear belief model with discrete prior. [Sections 1-2, but primarily section 2.4, which is where the idea of discrete priors is presented.]

Y. Wang, W.B. Powell, Knowledge Gradient for Sequential Decisions with Stochastic Binary Feedbacks – This paper illustrates using the Laplace approximation to handle a nonlinear belief model. This represents an alternative approximation strategy to the use of discrete priors. Focus on the Laplace approximation in sections 3-4.

Leonora Bianchi, Marco Dorigo, Luca Maria Gambardella, Walter J. Gutjahr, A survey on metaheuristics for stochastic combinatorial optimization, Natural Computing, Vol 8, No. 2, pp. 239-287, 2009.


Week 5 – Oct 18-20 – Physical state applications and modeling

“Dynamic programming” is most often associated with problems which involve a physical state (more precisely, where the cost/contribution function is itself a function of the state, unlikely our earlier learning problems). Given the complexity of these problems, this is where we are really going to start to learn how to model real-world problems.


  • Deterministic dynamic programs
    • Shortest path
    • A budgeting problem
  • Stochastic problems
    • Decision trees
    • Stochastic shortest paths
    • The gambling problem
    • Asset acquisition problems
    • Batch replenishment problem
    • Dynamic assignment problem
  • Information acquisition problems
    • Learning with a physical state
  • Modeling principles
    • Notation
    • Modeling time
  • Elements of a sequential decision problem
    • States
      • Initial state vs. subsequent states
      • Physical/resource states
      • Information
      • State of knowledge/belief states
    • Actions/decisions/controls
      • Different types of decisions (discrete/continuous, scalar/vector/multiattribute)
    • Exogenous information
      • Modeling stochastic information
    • Transition function
      • Different styles of transition functions
      • Model-based vs model-free
    • Objective functions
      • Continuous/monotone/convex
      • Expensive vs. inexpensive
      • Final vs. cumulative reward
  • Energy storage example

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapters 8-9


Week 6 – Oct 25-27 Modeling uncertainty and designing policies

Any sequential decision problem works from the triplet of starting with a state, making a decision, observing new information, and then transitioning to a new state. We start by discussing how to model uncertainty, and then transition to the challenge of designing policies.

Oct 18 – Modeling uncertainty

Stochastic optimization has traditionally focused on making the best decisions under uncertainty, where the focus is on the decisions, not the uncertainty. In practice, you will spend more time on modeling uncertainty.


  • Uncertainty mechanisms
    • Observational uncertainty
    • Inferential uncertainty
    • Prognostic uncertainty
    • Model uncertainty
    • Implementation uncertainty
  • Sample paths
    • Exogenous processes
    • State/action-dependent processes
  • Types of distributions
  • Monte Carlo simulation
  • Sampling vs. sampled models
  • Efficient sampling
    • Variance reduction methods
    • Importance sampling
    • Quantization methods
    • Sampling in high dimensions
  • Modeling adversarial problems

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapters 10


Oct 20 – Policies

In deterministic optimization, we look for real-valued decisions (typically vectors). In stochastic optimization, we look for functions called policies, that map states to (feasible) actions. It is here that we truly pull all the different fields of stochastic optimization together.


  • Classes of policies
    • Policy search
    • Lookahead approximations
  • Policy function approximations (PFAs)
  • Cost function approximations (CFAs)
  • Policies based on value function approximations (VFAs)
  • Direct lookahead policies (DLAs)
  • Illustration using an energy storage problem
  • Creating hybrid policies
  • Guidelines for designing policies for a particular problem

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapters 11


Nov 1-3 – Fall break


Week 7 – Nov 8-10 – Policy search

Policy search is used when we can specify the form of a parameterized policy, where we then have to search for the best value of the parameters. These come in two forms: analytical functions that directly map a state to an action, and parameterized cost functions where we minimize/maximize a parameterized cost function (where we may modify the objective function or the constraints).


  • Lookup table policies
  • Parametric policies
    • Boltzmann policies
    • Affine policies
    • Monotone policies
    • Neural networks
  • Online vs. offline learning
  • Gradient-based policy search
  • Gradient-free policy search
    • Review of learning policies
  • Active vs. passive policy search
  • Cost function approximations (CFAs)
    • Objective function modifications
    • Constraint modifications
    • Gradient-based search

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapters 12-13


Deisenroth, M. P., Neumann, G., & Peters, J. (2011). A Survey on Policy Search for Robotics. Foundations and Trends in Machine Learning, 2(1-2), 1–142. – A nice introduction to policy search, but primarily in the context of deterministic control problems.

Michael J. Hadjiyiannis, Paul J. Goulart, and Daniel Kuhn, An Efficient Method to Estimate the Suboptimality of Affine Controllers, IEEE Trans. on Automatic Control, Vol. 56, No. 12, 2011 – A peek into the world of “affine controllers” (that is, policies that are linear in the state), in the setting of discrete time optimal control.

Week 8 – Nov 15-17 – Discrete Markov decision processes

No class Nov 15

This is the classic material due to Bellman, which sets up the strategy for designing policies based on calculating the value of being in a state. This material is based on some strong assumptions: relatively small sets of discrete states and actions, with well-defined and computable expectation over the exogenous information, making it possible to calculate one-step transition matrices, the cornerstone of discrete Markov decision processes. This material also lays the foundation for approximation strategies that can be used when these assumptions are not satisfied.



  • Overview of sequential problems (with discrete actions)
    • Inventory/storage problems
    • Stopping problems
    • Stochastic shortest paths
    • Online vs. offline dynamic programs
  • Modeling sequential decision problems
    • State variables
      • Initial state (context)
      • Pre- and post-decision states
    • Decisions/actions/controls
    • Exogenous information
    • Transition function
    • Objective function
  • Major problem classes
    • Horizon
      • Finite/infinite horizon
    • Objectives
      • Expected cost/reward
        • Undiscounted/discounted
        • Average reward
      • Risk measures
      • Min max
    • Online and offline formulations
    • Model-free vs. model-based dynamic programming
  • Bellman’s optimality equations
  • Solution strategies (for discrete MDPs)
    • Backward MDP for finite horizon problems
    • Value/policy iteration for infinite horizon
  • Lookup tables and the curse of dimensionality


Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapters 14

Week 9

Nov 22 – Structured dynamic programs

Here we describe different classes of dynamic programs which exhibit special structure that we can exploit to find optimal policies.


  • Problems where myopic policies are optimal
  • Special cases with analytic solutions
  • Monotone policies
  • Linear quadratic regulation (optimal control)

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapters 15

Alberto Bemporad, Manfred Morari, Vivek Dua, Efstratios N. Pistikopoulos, The explicit linear quadratic regulator for constrained systems – A nice tutorial article on optimal control, focusing on constrained systems.

D. P. Bertsekas, “Value and Policy Iteration in Optimal Control and Adaptive Dynamic Programming,” to appear in IEEE Trans. on Neural Networks (2015). – An important overview of value and policy iteration for deterministic optimal control. Helps provide an understanding of the role of approximation error.

Books I have available in the lab include:

Bertsekas, D. P. (2012). Dynamic Programming and Optimal Control, Vol. II: Approximate Dynamic Programming (4th ed.). Belmont, MA: Athena Scientific. – The fourth edition of Bertsekas’ popular book on optimal control. This is the most readable treatment I have seen that deals with stochastic control, although the style will be familiar to anyone with a background in classical Markov decision processes.

Lewis, F. L., Vrabie, D. L., & Syrmos, V. L. (2012). Optimal Control (3rd ed.). Hoboken, NJ: John Wiley & Sons. – Another very popular book, Frank Lewis comes from a classical engineering controls background, which means that he focuses on deterministic models of engineering problems. He only touches on “stochastic control” late in the book, and only in a fairly casual way.

Jiongmin Yong and Xun Yu Zhou, Stochastic Controls: Hamiltonian Systems and HJB Equations, Springer, 1999 – This is a classical, heavy-math treatment of stochastic control. The treatment here focuses on theory rather than modeling and computation.

Some older classics:

Robert F. Stengel, Optimal Control and Estimation – An older, but classic, treatment of optimal control from an engineering perspective (Prof. Stengel is a professor in MAE at Princeton).

Donald E Kirk, Optimal Control Theory: An Introduction.


Nov 29 – Backward approximate dynamic programming

Classical backward dynamic programming is limited to a special class of fairly small problems. But before we give up, it is possible to introduce approximations within a backward DP procedure.


  • Basic idea of backward ADP
  • Approximation strategies
    • Linear models
    • Low-rank approximations
    • Numerical approximation methods

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapters 16

Because of Thanksgiving, the weeks will transition to Thursday-Tuesday pairs.

Week 10 – Dec 1-Dec 6 Forward approximate dynamic programming


  • Learning the value of a fixed policy
    • TD(0), TD(\lambda)
  • Strategies for approximating value functions
    • Lookup table and variants
    • Linear models
  • Reinforcement learning
    • TD(0) learning
    • TD(\lambda) learning
    • Q-learning
    • On- and off-policy learning
  • Bellman’s equation using a linear model
    • Fixed point results
  • Approximate value iteration
    • On-policy vs. off-policy learning
    • Recursive learning
    • Stepsizes for approximate value iteration
  • Approximate policy iteration
  • Projected Bellman-error minimization
  • Approximate dynamic programming for convex problems
    • Linear value function approximations
    • Piecewise linear, separable VFAs
    • Benders decomposition and SDDP

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapters 17-19

Week 11 – Dec 8-Dec 13- Direct lookahead policies

We finally move to designing policies using direct lookahead models, also known as model predictive control.


  • The principle of using approximate lookahead models
    • Five classes of approximations
    • Model predictive control
  • Deterministic lookahead
  • Sampled approximations
    • Basic idea
    • Issues
      • Computation
      • Generating “good” samples
      • phi-divergence
  • Monte Carlo tree search for discrete actions

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapter 20


Readings from earlier editions:

Agrawal, Goyal – Analysis of Thompson Sampling for the Multiarmed bandit problem – Thompson sampling is an example of a sampled lookahead model. Very easy to use, with nice analytical properties. Be able to explain how Thompson sampling works, and the bound (but not the proof of the bound).

Kleywegt, Shapiro, Homem-de-Melo The Sample Average Approximation Method for Stochastic Discrete Optimization

Shapiro, Dentcheva, Ruszczynski, Lecture notes on stochastic programming, 2nd edition (2014). – [Read 5.3, 5.4, 5.5, 5.10]

Nemirovski, Juditsky, Lan and Shapiro – Robust Stochastic Approximation Approach to Stochastic Programming – A comparison of stochastic approximation methods to the sample average approximation method.

Bayraksan, G., & Love, D. K. (2015). INFORMS Tutorials in Operations Research, “Data-Driven Stochastic Programming Using Phi- Divergences.” Informs TutORials in Operations Research 2014, (November), 1–19.– This is a nice tutorial on the idea of phi-divergences, which is a broad class of metrics for measuring the difference between two probability distributions.

Add Romisch paper…


Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapter xx

Week 8 – Nov 9-11 – Optimal control

“Stochastic control” is a field that evolved largely from the optimal control of engineering applications. It closely parallels Markov decision processes (in fact, stochastic control preceded Markov decision processes). While the mathematical formulation of stochastic control problems is general, most applications consider stochastics in the form of additive noise in the transition function.

The optimal control literature often uses continuous time formulations, which is reflected in the notational style. It also tends to assume that the underlying problem is continuous and differentiable, but nonconvex.


  • Value function approximations
    • Approximate value iteration using forward DP
    • Approximations using backward DP
  • Approximation strategies
    • Lookup table/nonparametric
    • Parametric
    • Low-rank approximations
  • Approximate policy iteration
  • Actor-critic methods
  • Active learning strategies in ADP
  • Online vs. offline learning

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapter xx

Week 9 – Nov 16-18 – Approximate Dynamic Programming

Approximate dynamic programming evolved initially from the computational problems that arose when trying to use the classical methods for solving Markov decision processes (primarily), although there has been significant activity in the optimal control community (motivated by engineering applications). In this very short presentation, we will focus on the methods that evolved to approximate the classical algorithms that evolved from Markov decision processes, namely value iteration and policy iteration.


  • Value function approximations
    • Temporal difference learning (fixed policy)
    • Approximate value iteration using forward DP
    • Approximations using backward DP
  • Approximation strategies
    • Lookup table/nonparametric
    • Parametric
    • Low-rank approximations
  • Approximate policy iteration
  • Actor-critic methods
  • Active learning strategies in ADP
  • Online vs. offline learning

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapter xx

*Geramifard et al, A Tutorial on Linear Function Approximators for Dynamic Programming and Reinforcement Learning – A tutorial on the use of different value function approximations, written in the language of reinforcement learning. [Read 1-3, skim 4]

*J.N. Tsitsiklis, Asynchronous Stochastic Optimization and Q-Learning – This was the first paper to bridge reinforcement learning/approximate dynamic programming with the mathematics of stochastic approximation methods. The “classic theory” this paper refers to the material in section 8.9 of Powell – Approximate Dynamic Programming.

*J.N. Tsitsiklis, Ben van Roy – Analysis of Temporal-Difference Learning with Functional Evaluation – Temporal-difference learning refers to an algorithmic strategy where we learn the value of a fixed policy (no optimization taking place).

*I. O. Ryzhov, P. I. Frazier, W. B. Powell, Optimal Stepsizes for Approximate Dynamic Programming – This paper builds on the George and Powell paper, but uses the setting of a dynamic program with a single state and action to derive an optimal stepsize for approximate value iteration algorithms (including Q-learning). [Up through section III.C]

Van Roy, B., & de Farias, D. P. (2000). On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 589–608.

*D.R. Jiang et al – A Comparison of Approximate Dynamic Programming Techniques on Benchmark Energy Storage Problems: Does Anything Work? – An extensive comparison of a variety of strategies based on approximating value functions. In short, classical machine learning methods estimated using approximate value or policy iteration do not work well when compared against optimal benchmarks. The only methods that work well are those that exploit convexity or monotonicity, but these are basically lookup table representations exploiting structure (vanilla lookup table does not work well either). [Just skim.]

Books and monographs (reference material)

*Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Belmont, MA: Athena Scientific. – This book is the first thorough treatment of approximate dynamic programming using the mathematics of stochastic approximation procedures. Click here for Chapter 5 which illustrates some useful proofs for lookup table representations.

R. Sutton and A. Barto, Reinforcement Learning, MIT Press, 2012 (second edition). – This is the bible of reinforcement learning – the first book to introduce the field. It illustrates the style and vocabulary of the field of reinforcement learning.

Szepesvari – Algorithms for reinforcement learning – A more modern but advanced monograph on reinforcement learning written by one of the top theoreticians in reinforcement learning.

D.P. Bertsekas, Dynamic programming and optimal control, Chapter 6, Approximate Dynamic Programming – This is a pre-publication version of a very thorough treatment of a class of ADP algorithms that later appeared in his update of his optimal control book.

Powell, Approximate dynamic programming, Chapter 8 – Approximating value functions (basically an overview of popular statistical methods used in ADP)

*Powell, Approximate dynamic programming, Chapter 9 – Approximating the value function for a fixed policy – This has some very important material on the use of approximating architectures. Be sure to see the material on projected Bellman minimization.

*Powell, Approximate dynamic programming, Chapter 10 – Approximating value functions while searching for an optimal policy –

J. Si, A. Barto, W.B. Powell and D. Wunsch (eds.), Learning and Approximate Dynamic Programming: Scaling up to the Real World,  John Wiley and Sons, New York, 2004.  – An edited volume from the 2002 NSF workshop, featuring contributions from computer science (reinforcement learning) and engineering controls.

Week 10 – Nov 23-25 – Stochastic programming and Benders decomposition for convex optimization

This week, we cover what could be called approximate dynamic programming for convex problems, and in particular for sequential stochastic linear programs, which is a special type of convexity (e.g. one that can be approximated using a series of cuts).


  • Two-stage stochastic programs
    • Basic model
    • Representing linear programs using cutting planes
    • Stochastic decomposition
    • Regularization methods for two-stage problems
  • Benders decomposition for multistage decomposition
    • Modeling multistage linear programs
    • The exogenous information process
      • Interstage independence
      • Markov models
    • The stochastic dual decomposition method (SDDP)
    • Regularization for multistage

Readings (* indicates required readings)

*Course textbook: Warren B. Powell, Optimization under Uncertainty: A Unified Framework, Chapter xx


  • Modeling lookahead policies
  • Approximation strategies for lookahead models
  • Base models vs. lookahead models
  • Model predictive control (deterministic lookahead)
  • Stochastic DP lookahead using latent variables
  • Sampled lookahead models
  • Monte Carlo tree search