Approximate dynamic programming
Approximate dynamic programming is an algorithmic strategy for solving complex dynamic problems (usually under uncertainty). In congtrast strategy of stepping backward in time, ADP steps forward in time, visiting states and updating the value of being in the state.
At the heart of ADP is typically an approximate of the value of being in a state. We have found that is often useful to use an idea known as the "post-decision" state variable - this is the state immediately after we make a decision, but before we make a new decision. So, if we have a quantity R_t units on hand at time t (blood, money, vaccines, people) and we order x_t units of additional resources, we would write our post-decision state as
R^x_t = R_t + x_t.
Now assume we observe a demand Dhat_{t+1} which is the demand between t and t+1. Our inventory just before we make a decision at time t+1 would be
R_{t+1} = max(0,R^x_t - Dhat_{t+1}).
A peek on my book on this subject (see chapter 4, sections 4.2 and 4.3, and chapter 5, the section on state variables), click here.
Post-decision states take many forms depending on the application. The power of the post-decision state is purely computational. The classical form of Bellman's equation looks like:
V_t(S_t) = max_x(C(S_t,x_t) + E V_{t+1}(S_{t+1})
where S_t is the state of the system at time t, and x_t is our decision at time t. The state S_{t+1} involves random quantities (e.g. random demands, prices, equipment failures, disease outbreaks) and as a result we have to take an expectation. We compute this state using
S_{t+1}=S^M(S_t,x_t,W_{t+1})
where W_{t+1} is a generic variable for new information arriving between t and t+1. The function S(...) is the transition function (or state transition function, or system model, or just model)
For many problems, the expectation in Bellman's equation can be computationally very difficult, especially if there is a vector of random variables. We avoid this by defining the post-decision state. We assume that we have a function that tells us the state of the system immediately after we have made a decison. We write the function using:
S^x_t = S^{M,x}(S_t,x_t)
We then compute the next pre-decision state using
S_{t+1} = S^{M,W}(S^x_t,W_{t+1})
The next step is to approximate the value function. In classical ADP, we would estimate an approximation around the pre-decision state. Instead, we try to find an approximation, call it Vbar, around the post-decision state S^x_t. This means that we would have to solve
V_t(S_t) = max_x(C(S_t,x_t) + Vbar_t(S^x_t(S_t,x_t))
This is a deterministic problem, and we can solve this using any of a vast range of deterministic algorithms developed by the operations research community.
There are, then, three key steps in our ADP strategy:
------------------------
Step 1: Starting with a post-decision state S^x_{t-1}, we simulate our way to the next pre-decision state using
S_t = S^{M,W}(S^x_{t-1},W_t(omega^n))
where W_t(omega^n) is a sample realization of the information arriving between t-1 and t.
Step 2: Deterministic optimization: Solve
vhat^n_t= max_x(C(S_t,x_t) + Vbar_t(S^x_t(S_t,x_t))
using any of a range of deterministic optimization algorithms.
Step 3: Statistics - use the results of our optimization at time t to update the post-decision value function Vbar_{t-1}(S^x_{t-1}).
Return to step 1.
---------------------------------
The strategy, then, involves classical simulation in step 1, classical (deterministic) optimization in step 2, and classical statistics in step 3.
The strategy is extremely simple, but it hides a host of theoretical and computational issues.