[5]{.chapter-number}  [Tabular RL algorithms]{.chapter-title}

Prashanth L.A.

5 Tabular RL algorithms

Author

Affiliation

Prashanth L.A.

Indian Institute of Technology Madras

Updated

December 25, 2025

\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\EE}[1]{{\mathbb{E}}\left[#1\right]} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\argmax}{\mathop{\rm arg\,max}} \newcommand{\argmin}{\mathop{\rm arg\,min}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cS}{\mathcal{S}} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\Y}{\mathcal{Y}} \newcommand{\S}{\mathcal{S}} \newcommand{\D}{\mathcal{D}} \newcommand{\A}{\mathcal{A}} \newcommand{\cC}{\mathcal{C}} \newcommand{\X}{\mathcal{X}} \newcommand{\F}{\mathcal{F}} \newcommand{\N}{\mathcal{N}} \newcommand{\cN}{\mathcal{N}} \newcommand{\C}{\mathbb{C}} \newcommand{\I}{\mathbb{I}} \newcommand{\cK}{\mathcal{K}} \newcommand{\Z}{\mathcal{Z}} \newcommand{\O}{\mathcal{O}} \newcommand{\P}{\mathbb{P}} \newcommand{\indic}[1]{\mathbb{I}\left\{#1\right\}} \]

In this chapter, we shall look at TD-learning, Q-learning with full state representations. We shall also discuss the asymptotic convergence of these algorithms.

5.1 The policy evaluation problem

Suppose we want to estimate $J_{\pi}$ for a given policy $\pi$. A full state algorithm maintains an estimate $J_{t}(i)$ $\forall$ $i$ $\in$ $\chi$ and updates this estimate incrementally using sample transitions. Such an algorithm falls under the tabular case as it requires maintaining a lookup table for the value estimate.

The disadvantage with the above approach is that for MDP’s which have a large state space, maintaining a table for all states maybe computationally infeasible. For instance, in a game of the popular board game Go, the number of states are of order $10^{170}$.

To tackle the problem of large state space, one usually employs a parametric approximation of the value function. Let $\theta$ be the parameter, $\phi(i)$ be the feature vector corresponding to a given state and $n$ be the number of states in state space. A simple case is when the value function is approximated using a linear architecture, i.e., value for state $i$ is approximated by \[\begin{aligned} J_{i} \approx \theta^{T} \phi(i) ; \theta \in \R^{d}. \end{aligned}\] In the approximation above, unlike tabular case, we need to store the parameter i.e. $\theta$ alone and can compute value function instantaneously as and when desired using the features generated. If $d << n$, then the approximation is computationally efficient. Further, we can see from the equation above that the value function generated using linear approximation isn’t the exact optimal value function but a close enough approximation. We shall look at the linear function approximation in much more detail in the next chapter, while focusing on the tabular case here.

Previously, when we wanted to estimate the mean($\mu$) of a random variable X, we employed the update rule as below: \[\begin{aligned} r_{m+1}=r_{m}+\beta_{m}(X_{m+1}-r_{m}) \end{aligned}\] Here $r_{m}$ is the mean estimate of random variable X and $X_{m}$ is the sample of X from the distribution and $\beta_{m}$ is the step size. As we take more and more samples, our estimate will converge to the actual mean value of the random variable. Now, we will show that policy evaluation can be framed as the mean estimation problem stated above.

Consider an SSP. Suppose we want to estimate $J_{\pi}$, i.e., \[\begin{aligned} J_{\pi}(i)=\E\left(\sum_{m=0}^{\infty}g(i_{m},i_{m+1})| i_{0} = i \right). \end{aligned}\] In the above, the actions are chosen using policy $\pi$. For notational simplicity, we have dropped the action component in the single stage cost $g(\cdot,\cdot)$.

To estimate the value function, there are two popular approaches: - The method of temporal differences (TD); and - Monte Carlo policy evaluation (MCPE).

5.2 Monte Carlo policy evaluation (MCPE)

Consider an SSP with state space $\chi = \{1,2,...n,T\}$. Fix a proper policy $\pi$. Generate $l$-trajectories from the SSP by simulation such each of them end in the terminal state $T$. Let the states in $l$-trajectories be denoted by \[\begin{aligned} \{ i^{j}_{0}, ... , i^{j}_{N} \}\quad \text{ for } j = \{1,2,\ldots,l\}. \end{aligned}\] Here $N$ is a random variable signifying the length of the trajectory. Using this data, we want to estimate $ J_{}(i)=({m=0}^{}g(i{m},i_{m+1})| i_{0} = i )
$. Let $J'(k)$ be the total cost of $k^{th}$ trajectory, i.e., \[\begin{aligned} J'(k) = g( i^{k}_{0},i^{k}_{1} ) + ..... + g( i^{k}_{N-1},i^{k}_{N} ) \quad \text{ for } k = 1, ... , l. \end{aligned}\] Assume $i^{k}_{0} = i\forall k \in \{ 1, ... , l \}$. Then, we can estimate $J_\pi(i)$ as: \[\begin{aligned} \hat J(i)=\frac{\sum_{k=1}^{l}J'(k)}{l}. \end{aligned}\] We can compute $\hat J(i)$ with the initial condition $J_{0}(i) = 0$ incrementally by employing the following update rule: \[\begin{aligned} J_{m+1}(i)=J_{m}(i)+\beta_{m}\left(J'(m) - J_{m}(i)\right). \end{aligned}\] Here, $\beta_{m}$ is the step size and must satisfy standard stochastic approximation conditions, i.e, $\sum\beta_{m}=\infty$ and $\sum\beta^{2}_{m}<\infty$.

We can reuse the trajectories used for one state to update the value functions of other states as well and if within a trajectory we reach a given state multiple times, we can make use of all the produced sub trajectories as well. We shall discuss these variants later.

5.3 Temporal difference learning: TD(0) variant

We can write $J_{\pi}(i)$ as \[\begin{aligned} J_{\pi}(i)=\E\left(g(i,i')+J_{\pi}(i') \right). \end{aligned}\] In the equation above, the expectation is over the next state $i'$ and the action is implicitly considered for the cost function $g(\cdot,\cdot)$. Now, in order to estimate the expectation, we take a sample of the expression within the expectation and apply the update rule as shown below: \[\begin{aligned} J_{m+1}(i)=J_{m}(i) + \beta_{m}( g(i,i') + J_{m}(i') - J_{m}(i) ). \end{aligned}\] Note that $i'$ is sampled from the transition dynamics of the MDP. Also, $g(i,i') + J_{m}(i')$ acts as a proxy for $\E(g(i,i') + J_{m}(i'))$. We know that \[\begin{aligned} \E(g(i,i') + J_{m}(i')) = T_{\pi}J_{m}(i). \end{aligned}\] From the equation above, we can modify the update equation to: \[\begin{aligned} J_{m+1}(i)=J_{m}(i) + \beta_{m}(T_{\pi}J_{m}(i) - J_{m}(i) + g(i,i') + J_{m}(i') - T_{\pi}J_{m}(i) ). \end{aligned}\] This is the classic TD(0) algorithm. The reason for the zero argument to TD will be clear after the description of TD($\lambda$) algorithm later.

We can now draw parallels to the general stochastic fixed point iteration algorithm presented in the previous chapter, i.e., we can see that the term $g(i,i') + J_{m}(i') - T_{\pi}J_{m}(i)$ is a zero mean random variable given $J_m$, which resembles the noise term $w_{m}(i)$. Therefore, the iterate $J_m$ above can be seen to converge to the fixed point $J_\pi$ that satisfies $T_{\pi}J_{\pi}(i) = J_{\pi}(i)$. We shall discuss this in detail later.

5.3.1 Understanding MCPE through temporal differences

Given an MDP trajectory $(i_k, i_{k+1}, \ldots, i_N)$ obtained from policy $\pi$ by simulation or real-world measurements, the MCPE update rule for the value function $J$ would be \[\begin{aligned} J_{m+1}(i_k) = J_m(i_k) + \beta_m(g(i_k, i_{k+1}) + g(i_{k+1}, i_{k+2}) + \ldots + g(i_{N-1}, i_N) - J_m(i_k)) \end{aligned}\]
This equation can be rewritten as \[\begin{aligned} J_{m+1}(i_k) = J_m(i_k) + \beta_m & [ (g(i_k, i_{k+1}) + J_m(i_{k+1}) - J_m(i_k)) \\ &+ (g(i_{k+1}, i_{k+2}) + J_m(i_{k+2}) - J_m(i_{k+1})) \\ &+ (g(i_{k+2}, i_{k+3}) + J_m(i_{k+3}) - J_m(i_{k+2})) \\ &\qquad \qquad \qquad \qquad \qquad \vdots \\ &+ (g(i_{N-1}, i_{N}) + J_m(i_{N}) - J_m(i_{N-1}))] \\ \end{aligned}\] Note that in the equation above, $J_m(i_N) = 0$ since $i_N$ is necessarily a terminal state. If we define $d_l = g(i_l, i_{l+1}) + J_m(i_{l+1}) - J_m(i_l)$ for each $l=k, \ldots, N-1$, we can again rewrite the MCPE update rule as follows: \[\begin{aligned} J_{m+1}(i_k) = J_m(i_k) + \beta_m(d_k + d_{k+1} + \ldots + d_{N-1}) \end{aligned}\] However, notice that since an observation of $g(i_l, i_{l+1})$ becomes available only when an $i_l \rightarrow i_{l+1}$ transition takes place in the MDP, $d_l$ also becomes available only at that time. So in many applications, we resort to incrementally computing $J_{m+1}(i_k) = J_m(i_k) + \beta_m d_l$ for $l=k, \ldots, N-1$.
In contrast to the MCPE update equation above, recall that the TD(0) update equation was given by \[J_{m+1}(i_k) = J_m(i_k) + \beta_m d_k.\] Notice that in the MCPE update rule, all the subsequent temporal differences i.e. $d_k, \ldots, d_{N-1}$ are used to compute the next estimate of $J_\pi(i_k)$ whereas in the TD(0) update rule, only one temporal difference $d_k$ i.e. the one from the next step is used to perform the update. TD(0) relies on the one-step fixed point equation for $J_pi$ whereas MCPE does not.

5.3.2 TD(0) vs. MCPE: A bias-variance tradeoff perspective

Consider a toy MDP in which there states $i$ and $\overline{i}$ such that $g(i, \overline{i})=1$ and under some policy $\pi$, $P(i, \pi(i), \overline{i}) = 1$. Also assume that there is a terminal state $T$ in this MDP. Suppose we wanted to estimate $J_\pi(i)$. We examine how this estimate would be computed using both MCPE and TD(0). If we were to use MCPE, we would simulate multiple (say $l$) trajectories starting from $i$ and collect the total cost samples into a set $\{J'(k)\}_{k=1}^{l}$. We would then compute \[J(i)=\frac{\sum_{k=1}^{l}J'(k)}{l}.\] and use it as the estimate for $J_\pi(i)$. On the other hand, if we were to use TD(0), we would use the one-step fixed point equation to compute the estimate. The estimate we would obtain is \[J(i) = J(\overline{i})+1.\]

Notice that MCPE would never use the estimate $J(\overline{i})$ to compute an estimate for $J_\pi(i)$. Therefore in general, $TD(0)$ finds a biased estimate for $J_\pi(i)$, the source of the bias being bias in the estimate $J(\overline{i})$ it uses.

The unbiased MCPE estimate however, may suffer from a high variance problem (particularly when $l$ is small). The estimate that MCPE comes up with is tied to the trajectories that end up stochastically being chosen when the MDP is being simulated. So in certain cases, it may be preferred to use TD(0) over MCPE even if the estimate hence computed is biased.

5.4 TD($\lambda$): A middle-ground between TD(0) and MCPE

As we have discussed above in our description of the update rules these two algorithms use, TD(0) uses the one-step fixed point equation to arrive at its update rule. \[\begin{aligned} J_\pi(i) = \E_{\overline{i}} [g(i, \overline{i}) + J_\pi(\overline{i})] \end{aligned}\] Generalising this idea, we can think of an algorithm that uses an $(l+1)$-step fixed point equation \[\begin{aligned} J_\pi(i) = \E \left[\sum_{m=0}^l g(i_m, i_{m+1}) + J_\pi(i_{l+1}) | i_0 = i \right]. \end{aligned}\]
The update rule corresponding to this equation would look like \[\begin{aligned} J_{m+1}(i_k) = J_m(i_k) + \beta_m(g(i_k, i_{k+1}) + \ldots + g(i_{k+l}, i_{k+l+1}) + J_m(i_{k+l+1}) - J_m(i_k)). \end{aligned}\]

Notice that for $l=0$, this amounts to the TD(0) update rule and for $l\rightarrow \infty$, this amounts to the $MCPE$ update rule. For various choices of $l$, we can reach update rules which present trade-offs between the bias and variance problems of TD(0) and MCPE respectively.
In the next few sections, we will come up with a generalisation of TD(0) called TD($\lambda$) which can be interpreted as a weighted average of all the fixed-point equations we obtain by setting various values of $l$.

The TD($\lambda$) Algorithm

Recall the fixed point equation for a given policy $\pi$ and state $i$, that serves as the basis for the TD($\lambda$) algorithm. \[\begin{align} J_{\pi}(i) = (1-\lambda)\mathbb{E}\left[\sum_{l=0}^\infty\lambda^l\Bigg(\sum_{m=0}^l g(i_m,i_{m+1})+J_{\pi}(i_{l+1})\Bigg)\right].\label{eq:tdl1} \end{align}\] where $\lambda \in [0,1]$, and $(1-\lambda)$ is the normalizing factor since $\sum_{l=0}^{\infty}\lambda^l = \frac{1}{1-\lambda}$. Hence, equation $\eqref{eq:tdl1}$ presents a weighted average of $(l+1)$ step fixed point targets for all possible values of $l$ ($0\leq l \leq \infty$). The weights are characterized by the term $\lambda^{l}(1-\lambda)$.

By splitting $\eqref{eq:tdl1}$, we obtain \[\begin{align} J_{\pi}(i) = (1-\lambda)\mathbb{E}\left[\sum_{l=0}^\infty\lambda^l\Bigg(\sum_{m=0}^l g(i_m,i_{m+1})\Bigg)\right]+(1-\lambda)\mathbb{E}\left[\sum_{l=0}^\infty\lambda^l J_{\pi}(i_{l+1})\right].\label{eq:tdl2} \end{align}\] Further, we can interchange the summation in the first term of $\eqref{eq:tdl2}$ and modify the second term there to obtain \[\begin{align} J_{\pi}(i) = (1-\lambda)\mathbb{E}\left[\sum_{m=0}^\infty\Bigg(\sum_{l=m}^\infty \lambda^l g(i_m,i_{m+1})\Bigg)\right]+\mathbb{E}\left[\sum_{l=0}^\infty(\lambda^l-\lambda^{l+1}) J_{\pi}(i_{l+1})\right].\label{eq:tdl3} \end{align}\] Since $(1-\lambda)\sum_{l=m}^{\infty}\lambda^l = \lambda^m$, we can simplify the first term of $\eqref{eq:tdl3}$ as follows: \[\begin{align} (1-\lambda)\mathbb{E}\left[\sum_{m=0}^\infty\Bigg(\sum_{l=m}^\infty \lambda^l g(i_m,i_{m+1})\Bigg)\right] = \mathbb{E}\left[\sum_{m=0}^\infty\Bigg(\lambda^m g(i_m,i_{m+1})\Bigg)\right].\label{eq:tdl-first} \end{align}\] We now simplify the second term in $\eqref{eq:tdl3}$ as follows: \[\begin{align} & \mathbb{E}\left[\sum_{l=0}^\infty(\lambda^l-\lambda^{l+1}) J_{\pi}(i_{l+1})\right] \nonumber\\ &= \mathbb{E}\left[(1-\lambda)J_{\pi}(i_1) + (\lambda-\lambda^2)J_{\pi}(i_2) + (\lambda^2-\lambda^3)J_{\pi}(i_3) + \cdots \right] \nonumber\\ &= \mathbb{E}\left[(J_{\pi}(i_1)-J_{\pi}(i)) + \lambda(J_{\pi}(i_2)-J_{\pi}(i_1))+\lambda^2(J_{\pi}(i_3)-J_{\pi}(i_2))+ \cdots \right] + J_{\pi}(i)\nonumber \\ &= \mathbb{E}\left[\sum_{m=0}^\infty\lambda^m( J_{\pi}(i_{m+1}) - J_{\pi}(i_{m})) \right] + J_{\pi}(i).\label{eq:tdl-second} \end{align}\] Combining $\eqref{eq:tdl-first}$ and $\eqref{eq:tdl-second}$, we have \[\begin{align} J_{\pi}(i) = \mathbb{E}\left[\sum_{m=0}^{\infty}\lambda^m\bigg(g(i_m,i_{m+1})+J_{\pi}(i_{m+1}) - J_{\pi}(i_m)\bigg)\right] + J_{\pi}(i).\label{eq:six} \end{align}\] Recall the temporal difference term is $d_m = g(i_m,i_{m+1})+J_{\pi}(i_{m+1}) - J_{\pi}(i_m).$ Hence, $\eqref{eq:six}$ can be written as \[\begin{align} J_{\pi}(i) = \mathbb{E}\left[\sum_{m=0}^{\infty}\lambda^md_m\right] + J_{\pi}(i) \label{eq:tdl-fixedpt} \end{align}\] The first term of $\eqref{eq:tdl-fixedpt}$ is indeed a finite sum with a random number of terms $N$, with $i_N=T$. In other words, $d_m=0\ \forall\ m \geq N$. Further, note that $\mathbb{E}[d_m] = 0\ \forall\ m$, which makes $\eqref{eq:tdl-fixedpt}$ valid.

Now, the goal is to solve $\eqref{eq:tdl-fixedpt}$ using a stochastic approximation algorithm. For simplicity, let $J_{\pi} = J$, and let $J_t(i)$ be the cost at the $t^{th}$ iteration for state $i$. For a learning rate characterized by $\beta_t$, we get the following iterative update rule. \[\begin{align} J_{t+1}(i) = J_{t}(i) + \beta_t\sum_{m=0}^{\infty}\lambda^m d_m. \label{eq:tdl-update-gen} \end{align}\] $\eqref{eq:tdl-update-gen}$ presents a family of algorithms called TD($\lambda$), one for each choice of $\lambda$.

Remarks

Let us consider three cases for the possible values of $\lambda$
Case 1: $\lambda=0$:

If $\lambda=0$ and initial state $i_0$, $\eqref{eq:tdl-update-gen}$ becomes TD($0$) update rule, i.e., \[J_{t+1}(i_0) = J_{t}(i_0) + \beta_td_0 = J_{t}(i_0) + \beta_t(g(i_0,i_{1})+J_t(i_{1}) - J_{t}(i_0))\] Case 2: $\lambda=1$:

If $\lambda=1$ and initial state $i_0$, $\eqref{eq:tdl-update-gen}$ becomes the TD($1$) update rule, i.e., \[J_{t+1}(i_0) = J_{t}(i_0) + \beta_t\sum_{m=0}^{\infty}d_m = J_{t}(i_0) + \beta_t(d_0+d_1+\cdots+d_{N-1})\] Case 3: $0<\lambda<1$:

A value of $0<\lambda<1$, tends to discount the effect of the temporal differences of state in the far future on the cost estimate of the current state. In other words, the temporal difference $d_m$ is weighted by $\lambda^m$, thereby making future temporal differences less important while updating the cost estimate of the current state, and the effect of $d_0$ is the most prominent when compared to $d_m,\ m\geq 1$.

NOTE: $\lambda$ is not to be confused with the discount factor in the discounted MDP setting. $\lambda$ just weighs the temporal differences.

5.5 Variants of TD($\lambda$)

Every Visit and First Visit

Assume we have a trajectory $(i_0,i_1,\cdots,i_N)$, in which a given state $i$ is visited more than once. Let state $i$ be visited $M$ times in the trajectory and $(m_1,m_2,\cdots,m_M)$ be the time instances when state $i$ is visited. We now describe two variants of the TD algorithm for a given state $i$, namely Every Visit and First Visit.

Every Visit TD($\lambda$):

This case would consider all visits (sub-trajectories) of state $i$, i.e., $(i_{m_1},\cdots,i_N),(i_{m_2},\cdots,i_N),\cdots,(i_{m_M},\cdots,i_N)$ to update $J(i)$. Hence, the iterative update equation for this case becomes, \[J_{t+1}(i) = J_{t}(i) + \beta_t\sum_{j=1}^M\sum_{m=m_j}^{\infty}\lambda^{m-m_j}d_m\] The above equation considers every visit of state $i$ as a starting point, and therefore the term $\lambda^{m-m_j}$ is justified. Analogously, this can also be shown by considering the fixed point equation for some state $i_k$ instead of $i_0$, \[J_{\pi}(i_k) = \mathbb{E}\left[\sum_{m=k}^{\infty}\lambda^{m-k}d_m\right] + J_{\pi}(i_k)\]

First Visit TD($\lambda$):

Unlike the Every Visit case, the First Visit variant of the TD($\lambda$) algorithm considers only the first visit of a given state $i$ for updating the cost for that state. The update rule for first visit TD($\lambda$) is given by \[J_{t+1}(i) = J_{t}(i) + \beta_t\sum_{m=m_1}^{\infty}\lambda^{m-m_1 }d_m.\] It can be shown that both variants of TD($\lambda$) converge to the true value using arguments commonly used in establishing the Strong Law of Large Numbers.

Offline and Online TD($\lambda$)

Offline TD($\lambda$):

Offline method involves simulating the entire trajectory ($i_0,i_1,i_2,...,i_N$) and then updating the cost (or value) in the end, meaning when all the temporal differences $d_0,d_1,d_2,...,d_{N-1}$ are available. Hence, the update rule for the offline algorithm will be the same as equation (13).
\[J_{t+1}(i) = J_t(i) + \beta_t \sum_{j = 1}^{M} \sum_{m = m_j}^{\infty} \lambda^{m - m_j} d_m.\] Online TD($\lambda$):

Online method on the other hand, involves the cost updation to be done in real-time after each transition, meaning when only a single temporal difference is available.

The equations for Online TD($\lambda$) with incremental update rule using constant step-size $\beta$ and initial state $i_0$ are shown below.

For ($i_0, i_1$) transition \[J_1(i_0) = J_0(i_0) + \beta d_0\] Likewise, for ($i_1, i_2$) transition
\[\begin{gathered} J_2(i_0) = J_1(i_0) + \beta \lambda d_1, \nonumber\\ J_2(i_1) = J_1(i_1) + \beta d_1. \end{gathered}\] In general, after $k+1$ steps in the trajectory, for ($i_k, i_{k+1}$) transition, \[\begin{gathered} J_{t+1}(i_0) = J_t(i_0) + \beta \lambda^k d_k,\\ J_{t+1}(i_1) = J_t(i_1) + \beta \lambda^{k-1} d_k,\\ \;\;\vdots\\ J_{t+1}(i_k) = J_t(i_k) + \beta d_k. \end{gathered}\] Note that the cost estimates obtained from Offline TD($\lambda$) and Online TD($\lambda$) can be different if a state $i$ is visited multiple times.

Example 5.1 We illustrate the updates of offline and online TD($\lambda$) for a trajectory given by (1,2,1,T). Let $J_{0}(1)$, $J_{0}(2)$ be the initial values and $J_{0}(T) = 0$.

Offline TD($\lambda$): Denote estimates by $J_{f}(1)$ and $J_{f}(2)$.

State 1’s update (Done in every visit style) : We would use both (1,2,1,T) and (1,T) to update \[J_f(1) = J_0(1) + \beta((d_0 + \lambda d_1 + \lambda^2 d_2) + (d_2)).\]

Expanding temporal differences and using $J_{0}(T) = 0$, \[\begin{aligned} J_f(1) = J_0(1) + \beta((g(1,2) + J_0(2) - J_0(1)) + \lambda (g(2,1) + J_0(1) - J_0(2)) \\ + \lambda^2 (g(1,T) - J_0(1)) + g(1,T) - J_0(1)). \end{aligned}\]

State 2’s update: We would use $(2,1,T)$ to update \[\begin{aligned} J_f(2) = J_0(2) + \beta((g(2,1) + J_0(1) - J_0(2)) + \lambda (g(1,T) - J_0(1))). \end{aligned}\]

Online TD($\lambda$) (Every-visit style):
On first transition $(1,2)$: \[J_1(1) = J_0(1) + \beta(g(1,2) + J_0(2) - J_0(1))\] \[J_1(2) = J_0(2)\]

On second transition $(2,1)$: \[J_2(1) = J_1(1) + \beta \lambda(g(2,1) + J_1(1) - J_1(2))\] \[J_2(2) = J_1(2) + \beta(g(2,1) + J_1(1) - J_1(2))\]

On last transition $(1,T)$: \[J_3(1) = J_2(1) + \beta (\lambda^2(g(1,T) - J_2(1)) + (g(1,T) - J_2(1)),\] \[J_3(2) = J_2(2) + \beta \lambda(g(1,T) - J_2(1)).\] Now, $J_3(1)$ & $J_3(2)$ are the online TD($\lambda$) estimates. Comparing online TD estimates $J_3(1)$ and $J_3(2)$ with offline counterparts $J_f(1)$ & $J_f(2)$, we observe that if we replace $J_1$ and $J_2$ by $J_0$ in the online TD update rules, then $(J_3(1),J_3(2))$ = $(J_f(1),J_f(2))$. On the other hand, with the regular online TD update, we have the following observation:

The difference between $J_1$ and $J_0$ is $O(\beta)$, between $J_2$ and $J_0$ is $O(\beta^2)$, and between $J_3$ and $J_0$ is $O(\beta^3)$.

With a small step-size $\beta$, say $0< \beta < 1$, the offline and online TD iterates will get close with more updates.

5.6 Convergence of TD($\lambda$): A high-level sketch

Recall the stochastic fixed point iteration update is given by \[\begin{align} J_{t+1} = J_t + \beta_t (HJ_t + w_t -J_t) \label{eq:First} \end{align}\]

If (i) H is a contraction
(ii) $\sum_{t}^{} \beta_t = \infty , \sum_{t}^{} \beta_t^2 < \infty$
(iii) $\E\left(w_t | \mathcal{F}_t\right) = 0$ , $\E\left(w_t^2 | \mathcal{F}_t\right) \leq A + B{||J_t||}^2$, then

\[\begin{align} J_t \rightarrow J^* \mbox{ a.s. as } t\rightarrow \infty \text{ where } HJ^*= J^*. \end{align}\] We want to show that $TD(\lambda)$ satisfies the three conditions listed above. Now, condition (ii) requires no proving effort, as one can set the step size to satisfy the requirements in (ii). One such choice is $\beta_t = \frac{1}{t}$. Next, we skip verification of condition (iii) for TD$(\lambda)$, while a similar check will be done in the discussion of convergence of Q-learning in a later section. We shall verify condition (i) for TD($\lambda$)’s underlying mapping.

Recall, we are performing policy evaluation using TD($\lambda$) for a proper policy $\pi$, and the fixed point equation underlying TD($\lambda$) is given by \[\begin{align} J_\pi(i) = (1 - \lambda)E\left[\sum_{l=0}^{\infty}\lambda^l\left(\sum_{m=0}^{l}g(i_m,i_{m+1}) + J_\pi(i_{l+1})\right)\right], \end{align}\] where $\lambda \in [0,1]$.

In vector-matrix notation, we can rewrite the above equation as follows: \[\begin{align} J_\pi(i) = G + (1-\lambda)\left[\sum_{l=0}^{\infty}\lambda^lP^{l+1}J_\pi\right], \end{align}\] where $P$ is the transition probability matrix of the Markov chain underlying the policy $\pi$ and $G$ is a matrix related to single stage costs, and is defined by \[G = (1-\lambda)\sum_{l=0}^{\infty}\lambda^lE[g(i_m,i_{m+1})].\] For a proper policy $\pi$, we showed earlier that there exists a positive vector $\xi$ and some $\rho \in (0,1)$ such that \[\begin{align} \sum_{j=1}^{n}P_{ij}(\pi(i))\xi(j) \leq \xi(i)-1 \leq \rho\xi(i), \end{align}\] or equivalently, \[{||PJ||}_\xi \leq \rho{||J||}_\xi\] where ${||.||}_\xi$ is the weighted max-norm defined earlier.

Define \[HJ = G + (1-\lambda)\left[\sum_{l=0}^{\infty}\lambda^lP^{l+1}J\right].\] We want \[\begin{align} {||HJ - HJ'||}_\xi \leq \rho{||J - J'||}_\xi \end{align}\] If the above relation holds, then the mapping underlying TD($\lambda$) is a contraction and condition (i), mentioned above for stochastic fixed point iterations, holds. Notice that \[\begin{split} {||HJ - HJ'||}_\xi & = {\left\Vert\left(G + (1-\lambda)\left[\sum_{l=0}^{\infty}\lambda^lP^{l+1}J_\pi\right]\right) - \left(G + (1-\lambda)\left[\sum_{l=0}^{\infty}\lambda^lP^{l+1}{J'}_\pi\right]\right)\right\Vert}_\xi\\ &\leq (1-\lambda)\sum_{l=0}^{\infty}\lambda^l{||P^{l+1}(J-J')||}_\xi\\ &\leq (1-\lambda)\sum_{l=0}^{\infty}\lambda^l\rho{||(J-J')||}_\xi\\ &= \rho{||(J-J')||}_\xi. \end{split}\] In arriving at the final inequality above, we used \[{||P^{l+1}(J-J')||}_\xi \leq {\rho^{l+1}||(J-J')||}_\xi \leq {\rho||(J-J')||}_\xi.\] Thus, we have ${||HJ - HJ'||}_\xi \leq \rho{||J - J'||}_\xi$, implying contractivity of the operator underlying TD($\lambda$).

Believing condition (iii) holds for TD($\lambda$), we can claim that

$J_t \rightarrow J_\pi$ as $t \rightarrow \infty$.

where $J_t$ is the TD($\lambda$) iterate and $J_\pi$ the value associated with policy $\pi$.

5.7 Q-Learning

Recall from an earlier chapter that Q-factors are given by \[\begin{align} Q^*(i,a) = \sum_{j} p_{ij}(a)(g(i,a,j) + J^*(j)), \end{align}\] Here $J^*(j)$ is the optimal cost starting in state $j$. Note that here we are not taking a discount into consideration. Intuitively, $Q^*(i,a)$ is the cumulative cost obtained by taking an action $a$ in state $i$ and then following the optimal policy in the subsequent state.

The Bellman equation $J^* = T J^*$ is equivalent to \[\begin{align} J^*(i) = \min_{a} Q^*(i,a).\label{eq:jstarqstar} \end{align}\] Combining the above two equations gives us the Q-Bellman equation, which is \[\begin{align} Q^*(i,a) = \sum_{j} p_{ij}(a)(g(i,a,j) + \min_{b}Q^*(j,b)). \label{eq:qbellman} \end{align}\] Note that $Q^*(i,a)$ is the unique solution of $\eqref{eq:qbellman}$. This can be argued as follows: Suppose that $Q$ is another solution of $\eqref{eq:qbellman}$, i.e., \[Q(i,a)= \sum_{j} p_{ij}(a)(g(i,a,j) + \min_{b}Q(j,b)),\,\, \forall i,a.\] Using the fact that $J^*(i)$ and $\min_{a}Q^(i,a)$ are solutions of $J^* = T J^*$ $\forall i \in X$ and that $J^*$ is the unique solution of the Bellman equation, we have \[J^*(i) = \min_{a} Q(i,a).\] Using the above and $\eqref{eq:jstarqstar}$, we can infer that \[\min_{a} Q(i,a) = \min_{a} Q^{*}(i,a).\] Thus, we have $Q = Q^*$, implying that $Q^*$ is a unique solution of the Q-Bellman equation $\eqref{eq:qbellman}$.

We note a few points about the uniqueness proof above.

There was no requirement of contractions to prove this argument.
Observing that $J^* = T J^*$ has a unique solution, we can cast the Q-Bellman Equation in a similar form : $Q^* = H Q^*$, where $H$ is the operator of the Q-Bellman Equation. We shall prove later that $H$ is contractive.

5.7.1 Value Iteration using Q-factors

The value iteration algorithm using Q-factors for an SSP can be written as follows: \[\begin{aligned} Q_{t+1}(i,a) = \sum_{j} p_{ij}(a)(g(i,a,j) + \min_{b}Q_{t}(j,b)). \end{aligned}\] This can equivalently be written using the underlying operator $H$ of the Q-Bellman equation as follows: \[\begin{align} Q_{t+1} = H Q_{t}.\label{eq:viq} \end{align}\] A smoothed variation of $\eqref{eq:viq}$ is \[\begin{align} Q_{t+1}(i,a) = (1-\beta)Q_{t}(i,a) \space + \space \beta \sum_{j} p_{ij}(a)(g(i,a,j) + \min_{b}Q_{t}(j,b)). \label{eq:q-vi} \end{align}\] We observe that to perform the update in $\eqref{eq:q-vi}$, we need to know the underlying transition dynamics. However, in a typical RL setting, we do not have this information. A stochastic fixed point iteration version of $\eqref{eq:q-vi}$, popularly known as Q-learning, performs the following update: \[\begin{align} Q_{t+1}(i,a) = (1-\beta)Q_{t}(i,a) \space + \space \beta (g(i,a,\overline{i}) + \min_{b}Q_{t}(\overline{i},b)), \end{align}\] where $\overline{i}$ is sampled from $P_{ij}(a)$.

The step-size $\beta$ could be time-dependent, i.e., $\beta_{t}$. For asymptotic convergence, these step sizes need to satisfy standard stochastic approximation conditions, i.e., $\sum \beta_{t} = \infty$ and $\sum \beta_{t}^2 < \infty$

Remarks:

In principle, Q-learning is similar to TD(0). Both are based on the value iteration principle, and replace an expectation by a sample.
There is no straight forward variation of Q-learning that is in the spirit of TD($\lambda$). In particular, it is hard to concoct a meaningful multi-step Q-Bellman equation that improves learning. In the case of TD$(\lambda)$, the policy is fixed, while with Q-learning, it is not the case.

5.7.2 Why use Q-factors for finding the optimal policy?

Why do we need to take the route using Q-factors for finding an optimal policy? In policy evaluation we have used $J_\pi$ which is a function of the state variable. An important question is: why do we need Q-factors $Q(i,a)$ and the Q-Bellman equation? Or, why not use a stochastic approximation scheme to find a $J^*$ that satisfies $J^* = TJ^*$?

The answer to this question lies in the formulation underlying stochastic approximation algorithms. In general, stochastic approximation algorithms attempt find $\mu = \E(X)$ by taking samples of $X$ and then performing an iterative update. The Q-Bellman equation is of the form \[Q^{*}(i,a) = \E_{j}\left[g(i,a,j)+\alpha \min_{b}Q^{*}(j,b)\right].\] We can sample $\left[g(i,a,j)+\alpha \min_{b}Q^{*}(j,b)\right]$ and do an iterative update. On the other hand, the regular Bellman equation is \[J^{*}(i) = \min_{a \in A}\E_{j}\left[g(i,a,j)+\alpha J^{*}(j)\right]\]
Since the minimum function above is outside the expectation, a stochastic approximation is not possible by just replacing the expectation in the Bellman equation with a sample.

5.8 Convergence analysis of Q-learning

Recall the Q-learning update rule is given by $ \[\begin{align} Q_{t+1}(i,a) = (1-\beta_t)Q_{t}(i,a) \space + \space \beta_t (g(i,a,\overline{i}) + \min_{b}Q_{t}(\overline{i},b)), \label{eq:qlearning-update} \end{align}\] where $\overline{i}$ is sampled from $P_{ij}(a)$.

For the convergence analyis, we make the following assumptions:

(A1) $\sum \beta_{t} = \infty$ and $\sum \beta_{t}^2 < \infty$.

(A2) All policies are proper.

A compact version of Q-learning update is as follows: \[\begin{aligned} Q_{t+1}(i,a) = (1-\beta)Q_{t}(i,a) \space + \space \beta(H Q_{t} + w_t), \end{aligned}\] where \[HQ(i,a) = \sum_{j} P_{i,j}(a)(g(i,a,j) + \min_{b}Q(j,b)) \space \space \forall i,a,\] and \[w_t(i,a) = g(i,a,\overline{i}) + \min_{b}Q_t(\overline{i},b) - (HQ_t)(i,a).\]

Theorem 5.1 Suppose A1, A2 holds and all state action pairs are updated infinitely often during Q-learning. Then, \[Q_{t} \space \rightarrow \space Q^* \text{ a.s. as } t\rightarrow\infty,\] where $Q^*$ is the fixed point of $H$.

Proof

To prove the main claim, we establish the following properties:

(B1) H is a contraction

(B2) $E[w_{t}|F_t] = 0$ and $E[w_{t}^2|F_t] <= A + B||Q_t||^2$

(B3) $\sum \beta_{t} = \infty$ and $\sum \beta_{t}^2 < \infty$, where, $F_t = \sigma(Q_0,Q_1,....,Q_t,w_0,w_1,....,w_t)$.

We first verify (B3)

Letting $w_t(i,a) = Y_t - E[Y_t],$ where $Y = g(i,a,\overline{i}) + \min_{b} Q_t(\overline{i},b)$. It is easy to see that $E[w_t(i,a)|F_t] = 0$ and \[\begin{align} E[{w_t}^2(i,a)|F_t] &= E[(Y_t - E[Y_t])^2 | F_t] \le E[{Y_t}^2|F_t]. \end{align}\]

Since the single stage cost $g(\cdot,\cdot,\cdot)$ is bounded, we have \[E[Y^2|F_t] = E[(g(i,a,\overline{i}) + \min_{b} Q_t(\overline{i},b))^2|F_t]\le \kappa (1 + \max_{j,b} {Q_t}^2(j,b)),\] for some $\kappa>0$. Thus, \[ E[{w_t}^2(i,a)|F_t] \le \kappa (1 + \max_{j,b} {Q_t}^2(j,b)),\] and assumption (B2) holds.

(B3) is satisfied if we choose $\beta_t$ carefully.

We now turn our attention to verifying (B1):. Since all policies are proper, from a result in an earlier chapter, we have that $\exists$ a positive vector $\xi$ and scalar $\rho$, such that \[\begin{align} P_{i,j}(a)\xi(j) \le \rho\xi(i). \label{eq:FactA} \end{align}\]

Recall the max-weighted norm is given by \[\left\| Q \right\|_{\xi} = \max_{i,a} \frac{|Q(i,a)|}{\xi(i)}.\]

To prove B1, we need to show that $\left\| HQ - HQ' \right\|_{\xi} \le \rho \left\| Q - Q' \right\|_{\xi}$, for some $\rho \in (0,1)$.

\[\begin{align} |(HQ)(i,a) - (HQ')(i,a)| &\le \sum_{j} P_{i,j}(a)| \min_{b} Q(j,b) - \min_b Q'(j,b)|\\ &\le \sum_{j} P_{i,j}(a) \max_b |Q(j,b) - Q'(j,b)|\\ &= \sum_{j} P_{i,j}(a) \left(\max_b \frac{|Q(j,b) - Q'(j,b)|}{\xi(j)}\right) \xi(j)\\ &\le \sum_{j} P_{i,j}(a) \left\| Q - Q' \right\|_{\xi} \xi(j)\\ &\le \left\| Q - Q' \right\|_{\xi} \rho \xi(i), \end{align}\] where the final inequality follows from $\eqref{eq:FactA}$

So, we have \[|(HQ)(i,a) - (HQ')(i,a)| \le \left\| Q - Q' \right\|_{\xi} \rho \xi(i)\]

\[\frac{|(HQ)(i,a) - (HQ')(i,a)|}{\xi(i)} \le \rho \left\| Q - Q' \right\|_{\xi}\]

\[\max_{i,a} \frac{|(HQ)(i,a) - (HQ')(i,a)|}{\xi(i)} \le \rho \left\| Q - Q' \right\|_{\xi}\]

\[\implies \left\| HQ - HQ' \right\|_{\xi} \le \rho \left\| Q - Q' \right\|_{\xi}\]

Thus, B1 holds. The main claim now follows from Theorem 4.3

Remark A variation can be considered where all policies are not necessarily proper. Instead, we have

(A1’) $\exists$ a proper policy, and all improper policies have infinite cost.

In this case, $Q^*$ is the unique solution to the Q-Bellman equation. However, the more important question is whether Q-learning algorithm still converges to $Q^*$ in this case.

Under (A1’), it is easy to see that $H$ is a monotone mapping, i.e, $Q \le Q' \implies HQ \le HQ'$. Using this fact and Theorem 4.4 from the previous chapter, it is possible to infer the convergence of Q-learning, and we omit the details.

As an aside, a sufficient condition was stated in Theorem 4.5 in the previous chapter. This condition can be verified for Q-learning, in turn implying the boundedness of the iterates ${Q_t}$.

5.9 The problem of exploration in Q-learning

In MCPE, we want to estimate \[E\left(\sum_{k=0}^{\infty} \alpha^{k} g(i,\pi(i),j)\right)\] where trajectories are sampled using $\pi$. If $\hat{J}(i)$ denotes the estimate of the value function or the expected cost, then $\hat{J}(i)\rightarrow J_{\pi}(i)$ if one can see enough trajectories starting with the state $i$. A similar argument also applies to TD and Q-learning. With TD, for $J_{t}(i)\rightarrow J_{\pi}(i)$ as $t\rightarrow\infty$, we need to visit $i$ infinitely often in the trajectories. But, with TD, the sampling actions use a fixed policy $\pi$. On the other hand, with Q-Learning, the algorithm is free to choose the action $a$ in each state. For $Q_{t}(i,a)\rightarrow Q^*(i,a)$ as $t\rightarrow\infty$, an important requirement is that all pairs $(i,a)$ are visited frequently. This is the issue of exploration. Recall the Q-Learning update: \[Q_{t+1}(i,a) = (1-\beta_{t})Q_{t}(i,a) + \beta_{t}\left[g(i,a,\bar{i})+\alpha \min_{b}Q_{t}(\bar{i},b)\right]\]

* The different methodologies to choose actions are discussed below:

Option (1) - Greedy:

In state i choose $\argmin_{a}Q_{t}(i,a)$ at time instant $t$.

If $Q_{t} \approx Q^{*}$, then this choice makes perfect sense, the greedy action is nearly optimal. However if $Q_{t}$ is not close to $Q^*$, then the greedy action is a bad choice, hence the need to explore to ensure each entry of the Q-Table is updated a good number of times, in turn, ensuring Q-Learning converges.

Option (2) - $\epsilon$-Greedy:

Fix an $\epsilon > 0$. Usually, $\epsilon$ is a small number. We chose the greedy action with probability $1-\epsilon$ and pick a random action with probability $\epsilon$.

An alternative approach is to make $\epsilon$ a function of t, say $\epsilon_{t}$ such that $\epsilon_{t} \to 0$ as $t \to 0$. This approach amounts to reducing the exploration as algorithm updates.

Option (3):

In state i at time t, action a is chosen with a probability given by: \[\frac{e^{-Q_{t}(i,a)/\tau}}{\sum_{b}e^{-Q_{t}(i,b)/\tau}},\] where $\tau$ is the temperature parameter which controls the exploration. If $\tau$ is very small then the choice is greedy.

5.10 Extension of Q-learning to discounted MDPs

5.10.1 Q-factors

Q-Bellman Equation in a discounted setting can be written as : \[\begin{align} Q^*(i,a) = \sum_{j} p_{ij}(a)(g(i,a,j) + \alpha \min_{b}Q^*(j,b)), \label{eq:qstar-infinite} \end{align}\] where $\alpha \in (0,1]$ is a discount factor. The value iteration algorithm using Q-factors can be written as : \[\begin{align} Q_{t+1}(i,a) = \sum_{j} p_{ij}(a)(g(i,a,j) + \alpha \min_{b}Q_{t}(j,b)), \label{eq:qrl-infinite} \end{align}\] The Q-Learning update iteration can be written as : \[\begin{align} Q_{t+1}(i,a) = (1-\beta_{t})Q_{t} + \beta_{t}(g(i,a,\bar{i}) + \alpha \min_{b}Q_{t}^(\bar{i},b)), \label{eq:qrl} \end{align}\] where $\bar{i}$ is sampled from $p_{ij}(a)$

5.10.2 Convergence of Q-Learning

(A1’) The single stage cost is bounded, i.e, $\forall i,a,j$ \[|g(i,a,j)| \le M < \infty\].

(A2’) Step sizes satisfy $\sum\beta_{t} = \infty, \sum \beta_{t}^2 < \infty$.

Under the assumptions above, following the proof in the "all policies are proper" case of SSP-Q-Learning, one can infer that the operator $H$ underlying Q-learning is a $\alpha$-contraction, where $H$ is defined by \[\begin{align} HQ(i,a) = \sum_{j} p_{ij}(a)(g(i,a,j) + \alpha \min_{b}Q^*(j,b)), \label{eq:H-discountedQ} \end{align}\] Since $H$ is $\alpha$-contraction, we have \[\left\Vert HQ - HQ'\right\Vert_{\infty} \le \alpha \left\Vert Q - Q'\right\Vert_{\infty}\] where $\left\Vert . \right\Vert_{\infty}$ is the max-norm.

Theorem 5.2 Under A1’ and A2’ we have, $Q_{t}\rightarrow Q^*$ with probability $1$ as $t\rightarrow\infty$, assuming all state action pairs are visited infinitely often.

Proof

Follows by a completely parallel argument to the proof of Theorem 5.1.

5.11 Exercises

Exercise 5.1 A gambler engages in a game of successive i.i.d. flipping of a coin that falls heads w.p.$p \in (0,1)$. When the coin turns up heads, the gambler wins a rupee. When the coin turns up tails, the gambles loses all his/her earnings so far.

Answer the following:

Model this problem as a Markov chain with a state variable that denotes the current earnings of the gambler. Show that this chain is positive recurrent by exhibiting a stationary distribution.
Suppose there are two actions for the gambler in each round. The first action is to stop and take home the current earnings, and the second action is to continue flipping. Formulate this problem as an infinite horizon discounted MDP with discount factor $\alpha <1$. Write down the Q-values, and the associated Bellman equation.
For the MDP in the part above, specify the update rule of the Q-learning algorithm, using the following template: \[Q_{t+1}(i) = (1-\beta_t) Q_t(i) + \beta_t(HQ_t(i) + w_t(i)).\]
Let $\xi$ denote the stationary distribution obtain in the first part above, and let $\left\| \cdot \right\|_\xi$ denote the weighted $\ell_2$ norm using the stationary distribution $\xi$. Show that the mapping $HQ$ underlying the Q-learning algorithm is a contraction w.r.t. $\left\| \cdot \right\|_\xi$, i.e., \[\left\| HQ - HQ'\right\|_\xi \le \alpha \left\| Q-Q'\right\|_\xi.\]

Exercise 5.2 Consider the problem of Monte Carlo policy evaluation in the context of a stochastic shortest path (SSP) problem. Fix a proper policy $\pi$, and obtain $l$ MDP trajectories using this policy. Use the every visit variant for policy evaluation, i.e., for a given state $i$, the estimate $\tilde J_l(i)$ of the expected cost $J_\pi(i)$ is formed by averaging the cost samples $\{\hat J(i,m,k), m=1,\ldots,n_k, k=1,\ldots,l \}$. Here $n_k$ is the number of visits to state $i$ in trajectory $k$, for $k=1,\ldots,l$ and $\hat J(i,m,k)$ is sum total of the costs starting from the $m$th visit point until the end of the trajectory.

Answer the following:

Is $\tilde J_l(i)$ an unbiased estimate of $J_\pi(i)$? Justify your answer.
Prove a result in the spirit of the strong law, i.e., \[\tilde J_l(i) \rightarrow J_\pi(i) \textrm{ a.s. as } l \rightarrow \infty.\]

5.12 Notes

TD learning was proposed by Sutton (1988), while a detailed convergence analysis came later. The presentation of TD and Q-learning variants here is based on Chapter 5 of Bertsekas and Tsitsiklis (1996). The reader is also referred to Sutton and Barto (1998) for an understanding of these algorithms.

Bertsekas, D. and Tsitsiklis, J. 1996. Neuro-dynamic programming. Athena Scientific.

Sutton, R. and Barto, A. 1998. Reinforcement learning: An introduction. MIT Press.

Sutton, R.S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3, 1, 9–44.