[4]{.chapter-number}  [Stochastic approximation]{.chapter-title}

Prashanth L.A.

4 Stochastic approximation

Author

Affiliation

Prashanth L.A.

Indian Institute of Technology Madras

Updated

October 28, 2025

\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\EE}[1]{{\mathbb{E}}\left[#1\right]} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\argmax}{\mathop{\rm arg\,max}} \newcommand{\argmin}{\mathop{\rm arg\,min}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cS}{\mathcal{S}} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\Y}{\mathcal{Y}} \newcommand{\S}{\mathcal{S}} \newcommand{\D}{\mathcal{D}} \newcommand{\A}{\mathcal{A}} \newcommand{\cC}{\mathcal{C}} \newcommand{\X}{\mathcal{X}} \newcommand{\F}{\mathcal{F}} \newcommand{\N}{\mathcal{N}} \newcommand{\cN}{\mathcal{N}} \newcommand{\C}{\mathbb{C}} \newcommand{\I}{\mathbb{I}} \newcommand{\cK}{\mathcal{K}} \newcommand{\Z}{\mathcal{Z}} \newcommand{\O}{\mathcal{O}} \newcommand{\P}{\mathbb{P}} \newcommand{\indic}[1]{\mathbb{I}\left\{#1\right\}} \]

4.1 Motivation and basic algorithm

An important RL task is to perform policy evaluation given a policy $\pi$ in any desired MDP (e.g. SSP, discounted MDP, etc.). If we know the transition probabilities, then we know $T_\pi$ completely and hence, we can solve \[\begin{align} J_\pi = T_\pi J_\pi, \end{align}\] either directly or by applying $T_\pi$ repeatedly on some $J \in \R^{|\mathcal{X}|}$ by the process of value iteration.

However, in a RL setting, the transition probabilities ($p_{ij}(a)$) are unknown. Instead, we get to observe a sample path or trajectory from an underlying distribution \[\begin{align} s_0 \xrightarrow{\pi(s_0)} s_1 \xrightarrow{\pi(s_1)} s_2 \xrightarrow{\pi(s_2)} \ldots, \text{ so on}, \end{align}\] where each $s_t$ is sampled from the distribution $p_{s_{t-1} s_t}(\pi(s_{t-1}))$.

To gain a better insight, let us revisit the Bellman equation \[\begin{align} T_\pi J (i) = \sum_j p_{ij}(\pi(i)) \left( g(i, \pi(i), j) + J(j) \right), \end{align}\] In a RL setting, the term $p_{ij}(\pi(i))$ is unknown, but we get to observe a sample path that follows that distribution. We also assume that the cost function $g(\cdot, \cdot)$ is known and deterministic (noise-free) and that it only depends one the current state and current action.

One can think along similar lines to solve the problem of control, i.e., finding the optimal policy, using a sample path, where the actions are chosen by an RL algorithm. An element of learning is involved here, as the goal is to solve the MDP to find the optimal policy, or evaluate a given policy, and the RL algorithm is given information through a sample path.

Recall that policy evaluation as well as control involves solving a fixed point equation, in particular, a system of equations of the following form: \[Hr = r,\] where $H: \R^n \rightarrow \R^n$ and $r \in \R^n$. In an RL application, we shall think of $H$ as the Bellman operator $T$ or $T_\pi$ and the vector $r$ as our value function. But, solving fixed point iterations are interesting outside of RL applications as well.

A direct approach can be is to start with some $r_0 \in \R^n$ and do $r_{n+1} = H r_n$. A variation to this update rule can be given as \[\begin{align} r_{m+1} = (1-\beta) r_m + \beta H r_m,\label{eq:f12} \end{align}\] where $\beta \in (0,1)$ is known as the step size to the above iterative scheme. In the ML community, $\beta$ is referred to as the learning rate. Consider the interesting special case where $Hr = r - \nabla f(r)$ for some $f: \R^n \rightarrow \R^n$. Then, \[\begin{align} Hr = r \Longleftrightarrow \nabla f(r) = 0, \end{align}\] which is equivalent to the problem of solving for the minima of $f$. Substituting this in $\eqref{eq:f12}$, we obtain the following gradient descent algorithm: \[\begin{align} r_{m+1} &= (1-\beta) r_m + \beta (r_m - \nabla f(r_m)), \\ \text{(or)} \quad r_{m+1} &= r_m - \beta \nabla f(r_m). \end{align}\] It is worth noting is that if $r_m \rightarrow r^*$ and $H$ is continuous at $r^*$, then $Hr^* = r^*$.

So far we have assumed that $H$ is perfectly observable for all $r$, however this isn’t the case in an RL problem. Hence, we consider the setting where $H$ is not precisely known, but we have a black box that outputs $Hr +w$ for input $r$. For simplicity, assume the noise component $w$ is zero-mean. A simplistic environment involves i.i.d. noise, e.g., $w \sim \mathcal{N}(0,1)$, while typical RL setting involve correlated noise that is conditionally zero mean.

Therefore, our iterative scheme along with the noise term can be written as \[\begin{align} r_{m+1} = (1-\beta) r_m + \beta (H r_m + w_m), \label{eq:sto_approx_fixedpoint} \end{align}\] where we get a noisy observation of $H r_m$. Also for simplicity, we assume that $\E(w_m) = 0$, $\E(w_m^2) < \infty$, and $w_m$ are sampled in an i.i.d. fashion. The above equation is an instance of a stochastic approximation algorithm, also known as a stochastic iterative scheme.

It would be nice to establish convergence of $\eqref{eq:sto_approx_fixedpoint}$, in particular, a guarantee of the following form: \[r_m \rightarrow r^* \textrm{ with probability (w.p.) } 1 \textrm { as } m \rightarrow \infty.\] After providing several applications of stochastic approximation in the next section, we shall establish that the following conditions are sufficient to establish the convergence guarantee given above:

$H$ is contractive
Step sizes $\beta_m = \frac{1}{m}$
Noise $w_m$ is conditionally zero-mean and has bounded variance (This covers the case where $\E w_i = 0, \E w_i^2 < \infty$ & $\{w_m\}$ i.i.d.).

4.2 Stochastic approximation

For finding the zeroes of a function $h:\R^d\rightarrow\R^d$, given noisy observations $h(x) + \eta$, Robbins and Monro (1951) proposed the seminal stochastic approximation scheme, which updates as follows: \[\begin{equation} \label{eq:robbinsmonro} x_{n+1} = x_n + \beta_n (h(x_n) + w_n), \end{equation}\] where $w_n$ is a martingale difference noise and $\beta_n$ the step size.

We next present a result by Kushner and Clark (1978) to infer asymptotic convergence of $\eqref{eq:robbinsmonro}$.

4.3 Kushner-Clark lemma

We make the following assumptions.

(A1) $h:\mathbb{R}^d\rightarrow\mathbb{R}^d$ is a Lipschitz continuous function with Lipschitz constant $L>0$.

(A2) Step size $\beta_n$ satisfy $\sum_n \beta_n=\infty$ and $\sum_n \beta_n^2<\infty$.

(A3) Noise $\{w_n\}$ is a martingale difference sequence with $\E[|| w_n||^2 ]<= C(1+ ||x_n||^2)$, for some positive scalar $C$.

Theorem 4.1 Under (A1)-(A3), the iterate $x_n$ governed by $\eqref{eq:robbinsmonro}$ converges almost surely (a.s.) to the set $K=\{x^*\mid h(x^*)=0\}$.

We next discuss a few popular examples of stochastic approximation.

4.3.1 Mean estimation

Consider a random variable (r.v.) $\X$ with mean $\mu$ and variance $\sigma^2$. Suppose we are given independent and identically distributed (i.i.d.) samples $X_1, \dots, X_m$ from the distribution of $X$. Let $x_m$ be the sample mean computed using these $m$ samples. We now derive an iterative scheme for updating the sample mean. \[\begin{align} x_{m+1} &= \dfrac{1}{m+1} \sum_{k=1}^{m+1} X_k\nonumber\\ &= \dfrac{m}{m+1} \left(\dfrac{1}{m} \sum_{k=1}^m X_n \right) + \dfrac{1}{m+1} X_{m+1}\nonumber\\ &= \dfrac{m}{m+1} x_m + \dfrac{1}{m+1} X_{m+1} \nonumber\\ x_{m+1} &= x_m + \dfrac{1}{m+1} \left(X_{m+1} - x_m\right).\label{eq:sa-mean-est} \end{align}\] The update rule above is a stochastic approximation scheme with step-size $\alpha_m = \dfrac{1}{m+1}$.

Strong law of large numbers says the following: \[\begin{align} x_m &\rightarrow \mu \text{\; a.s. as \;} m \rightarrow \infty. \end{align}\] Rewriting the update rule $\eqref{eq:sa-mean-est}$, we obtain \[\begin{align*} x_{m+1} &= x_m + \alpha_m \left(X_{m+1} - x_m\right)\\ x_{m+1} &= x_m + \alpha_m \left[\left(\mu-x_m\right) + \left(X_{m+1}-\mu\right)\right]. \end{align*}\] Letting $M_{m+1}=X_{m+1}-\mu$, it is easy to see that $M_{m+1}$ is a martingale difference sequence satisfying $\E M_m^2 < \infty$.

From an application of Kushner Clark lemma, we can see that $x_m \rightarrow \mu$ as $m \rightarrow \infty$ for more general step-sizes that satisfy \[\sum_m \alpha_m = \infty, \sum_m \alpha_m^2 < \infty.\]

One can reason about the need for the above conditions using the mean estimation problem as follows: Suppose $\{M_n\}$ is an i.i.d. sequence with mean zero and variance $\sigma^2$. Then, variance of $x_{m+1}$ is \[\begin{align*} Var(x_{m+1}) &= Var\left[x_m + \alpha_m h(x_m) \right] + \alpha_m^2 Var(M_{m+1})\\ &= Var\left[x_m + \alpha_m h(x_m) \right] + \alpha_m^2 \sigma^2\\ &\geq \alpha_m^2 \sigma^2. \end{align*}\] If we choose a constant stepsize, i.e., $\alpha_m = \beta \; \forall m$, then, $Var(x_{m+1}) \geq \beta^2 \sigma^2$. Thus, with a constant stepsize, $x_m \not\longrightarrow x^*$ almost surely, motivating the need for having a diminishing stepsize that vanishes asymptotically. However, such a stepsize cannot go down too fast, since \[\begin{align*} x_{m+1} &= x_m + \alpha_m (h(x_m) + M_{m+1} )\\ |x_m - x_0| &\leq \sum_{\tau = 0}^{m-1} \alpha_\tau |h(x_\tau) + M_{\tau+1}| \end{align*}\] If $|h x_\tau + M_{\tau+1}| \leq C_1$ and if $\sum_{\tau=0}^\infty \alpha_\tau \leq C_2 < \infty$, then $|x_m - x_0|$ is bounded above. This implies $x_m$ is within a certain radius of the initial point $x_0$. This will be problematic if $x^*$ lies outside this radius. Hence, we need $\sum_\tau \alpha_\tau = \infty$.

4.3.2 Stochastic gradient algorithm using a first-order oracle

Consider the following problem: \[\begin{align} x^* \in \arg\min_{x} f(x), \label{eq:basicpb} \end{align}\] A stochastic gradient algorithm for solving $\eqref{eq:basicpb}$ would update as follows: \[\begin{align} \label{eq:saalg} x_{n+1} = x(n) - a(n) \widehat\nabla f(x_n). \end{align}\] In the above, $\widehat\nabla f(x_n)$ is an estimate of the gradient ${\nabla} f(x(n))$, and $\{a(n)\}$ are (pre-determined) step-sizes satisfying standard stochastic approximation conditions (see (A2) above).

In a zeroth-order setting, the gradient information is not directly available, and instead, the optimization algorithm has oracle access to noise-corrupted function measurements. In the next section, we present the simultaneous perturbation trick for estimating gradients from zeroth-order information. Such estimates are not unbiased, but feature a parameter that can reduce the bias at the cost of variance.

4.3.3 Stochastic gradient algorithm using a zeroth-order oracle

In this setting, we do not have unbiased gradient measurements. Instead, we have the so-called ``zeroth-order oracle’’ that outputs noisy function observations, i.e., for an input $x$, the oracle responds with $f(x) + \xi$, where $\xi$ is the measurement noise that is usually assumed to be zero-mean.

Let $\F_n=\sigma(x_1,\ldots,x_n)$ denote the sigma field underlying the following stochastic gradient algorithm: \[\begin{align} \label{eq:saalg-again} x_{n+1} = x_n - a(n) \widehat\nabla f(x_n), \end{align}\] where $\widehat\nabla f(x_n)$ is formed as follows: \[\begin{align} \widehat\nabla f(x_n) = \left(\frac{ y_n^+ - y_n^-}{2\delta_n}\right) V,\label{eq:grad-unified-again} \end{align}\] where $y_n^+= f(x_n+\delta_n U)+\xi_n^+$, and $y_n^- = f(x_n- \delta_n U)+\xi_n^-$. Here $U$ is a random vector of independent standard normal random variables, $\xi_n^\pm$ denotes the measurement noise, and $\delta_n$ denotes a perturbation constant that can be used to handle the bias-variance tradeoff in the gradient estimate. The measurement noise satisfies \[\begin{align} \E[\xi_n^+-\xi_n^- |\, \F_n] &= 0, \text{~~ and ~~} \E [ (\xi_n^{+} - \xi^-)^{2} |\, \F_n] \le \sigma^2 <\infty\,, \,\forall n\ge 1. \label{eq:noiseass-intro} \end{align}\]

It can be shown that the gradient estimate above satisfies

\[\begin{align} \forall n\ge 1, \norm{ \EE{\widehat\nabla f(x_n)} \!-\! \nabla f(x_n) } \le C_1\delta_n^2, \textrm { and } \\ \EE{\norm{ \widehat\nabla f(x_n) - \EE{\widehat\nabla f(x_n)} }^2} \le \frac{C_2}{\delta_n^2}, \end{align}\] for some constants $C_1$ and $C_2$. Thus, decreasing $\delta_n$ leads to a more accurate gradient estimate, but this comes at the cost of increased variance, and vice-versa.

For the asymptotic convergence analysis of $\eqref{eq:saalg-again}$, the reader is referred to Chapter 4 of Prashanth and Bhatnagar (2025).

4.3.4 Stochastic fixed point iterations

Consider a function $f:\R^d\rightarrow\R^d$ that satisfies \[\begin{align} \norm{f(x)-f(y)} \le \alpha \norm{x-y}, \label{eq:contraction} \end{align}\] for any $x,y\in \R^d$. Here $\alpha \in (0,1)$, and $\norm{\cdot}$ is the $\ell_2$-norm.

Since the underlying space is Euclidean (and hence complete), by the Banach fixed point theorem, there exists a unique fixed point $x^*$ of the function $f$. A first attempt at finding such a fixed point is via the following iterative scheme: start with some $x_0 \in \R^n$ and do $x_{n+1} = f (x_n)$. A variation to this update rule can be given as \[\begin{align} x_{n+1} = (1-\beta) x_n + \beta f(x_n), \end{align}\] where $\beta \in (0,1)$ is the step size.

Note that if $x_n \rightarrow x^*$ and $f$ is continuous at $x^*$, then $f(x^*) = x^*$.

So far we have assumed that $f$ is perfectly observable for all $x$. However, in many learning scenarios, e.g., reinforcement learning, this isn’t the case. In particular, we consider the setting where $f$ is not precisely known, but we have black box access to $f$ as outlined in the sections above, i.e., for an input $x$, we observe $f(x) + w$,where $w$ is some zero-mean noise. The simplest noise model would correspond to $w \sim \mathcal{N}(0,1)$, while a martingale difference noise structure is more general.

For this setting, a stochastic fixed point iteration would update as follows: \[\begin{align} x_{n+1} = (1-\beta) x_n + \beta (f(x_n) + w_{n+1}), \label{eq:stoFixedPt} \end{align}\] For simplicity, we assume that $\E(w_m) = 0$, $\E(w_m^2) < \infty$, and $w_m$ are sampled in an i.i.d. fashion. Now, what we want is that $x_n \rightarrow x^*$ with probability (w.p.) 1 as $n \rightarrow \infty$. This will occur if

$f$ is a contraction
step sizes $\alpha_m$ satisfy standard stochastic approximation conditions, see
noise $w_m$ is a martingale difference sequence with a linear growth, see .

The stochastic fixed point iteration algorithm discussed above would not necessarily converge if the modulus of contraction $\alpha=1$ in $\eqref{eq:contraction}$. In this case, a fixed point is not even guaranteed to exist, e.g., consider $f(x)=x+1$. Alternatively, more than one fixed points may exists (e.g., $f(x)=x$), or only one fixed point exists (e.g. $f(x)=-x$). Under an additional assumption that at least one fixed point exists, the stochastic fixed point iteration $\eqref{eq:stoFixedPt}$ is guaranteed to converge almost surely to a sample path dependent fixed point solution.

4.3.5 Linear stochastic approximation

Consider the following stochastic approximation algorithm: \[x_{n+1} = x_n + a(n)\left( A_{n+1} x_n + b_{n+1} \right),\] where the step-size $a(n)$ satisfies $\sum_n a(n) = \infty$, and $\sum_n a(n)^2 < \infty$. Further, $A_n$ and $b_n$ are matrices and vectors that satisfy \[\E\left[A_{n+1} \mid x_1,\ldots,x_n\right] = A, E\left[b_{n+1} \mid x_1,\ldots,x_n\right] = b,\] where $A$ is a negative-definite matrix. Moreover, $\E\left[\norm{ (A_{n} - A)}^2\right] \le C_1$ and
$\E\left[ \norm{b_n - b}^2 \right] \le C_2$. In this setting, applying the Kushner Clark lemma, it can be shown that \[x_n \rightarrow x^* \textrm{ a.s. as } n \rightarrow \infty,\] where the limit $x^*$ satisfies $Ax^*+b=0$.

4.3.6 Quantile estimation

Consider the following problem, which is a variant of mean estimation. For a continuous random variable (r.v.) $X$ with cumulative distribution function $F$ and for a given $\alpha \in (0,1)$, define \[q_\alpha(X) = F^{-1}(\alpha).\] Notice that $q_\alpha(X)$ is the median of the distribution of $X$ when $\alpha=0.5$.

Let $\{X_n\}_{n\ge 1}$ be a independent sequence of r.v.s with common distribution $F$.

Notice that $F(q_\alpha(X))=\E[\indic{X\le q_\alpha(X)}]=\alpha$. A stochastic approximation algorithm for estimating $q_\alpha(X)$ for a pre-specified $\alpha$ can be arrived at as follows: Let $q_n$ denote an estimate of $q_\alpha(X)$ after observing samples $X_1,\ldots,X_n$. On observing $X_{n+1}$, $q_n$ is updated as follows: \[\begin{align} q_{n+1} = q_n + \alpha_n \left( \indic{X_{n+1} \le q_n} - \alpha \right).\label{eq:quantile-sa} \end{align}\] Notice that the update is iterative, i.e., given an estimate $q_n$ at time instant $n$ and a new sample $X_{n+1}$, the algorithm should perform an incremental update using $q_n, X_{n+1}$ to arrive at $q_{n+1}$.

Consider the following alternative observation model is as follows: At time instant $n$, the stochastic approximation algorithm picks a threshold, say $T$, and the environment returns a boolean that indicates whether $X_{n+1} < T$ or not. Quantile estimation in this threshold-based model would follow the same iterative scheme as $\eqref{eq:quantile-sa}$. To see this, let \[Y_{n+1}=\begin{cases} 1 & \textrm{ if } X_{n+1} \le q_n\\ 0 & \textrm{ else}. \end{cases}.\] Then, the update rule in $\eqref{eq:quantile-sa}$ is equivalent to \[\begin{align} q_{n+1} = q_n + \alpha_n \left( Y_{n+1} - \alpha \right).\label{eq:quantile-sa1} \end{align}\]

Using a variant of Kushner Clark lemma it is possible to establish almost sure convergence of $q_n$ to $q_\alpha(X)$, and we omit the details.

In finance literature, a risk measure closely related to quantiles is ‘Value at Risk (VaR)’. For any random variable $X$, we define the VaR at level $\alpha\in\left(0,1\right)$ as \[\text{VaR}_{\alpha}(X):=\inf\left\{\xi \mbox{ }|\mbox{ }\mathbb{P}\left(X\leq \xi\right)\geq\alpha\right\}.\] If the distribution of $X$ is continuous, then VaR is the lowest solution to $\mathbb{P}\left(X\leq \xi\right)=\alpha.$ VaR as a risk measure has several drawbacks, which precludes using standard stochastic optimization methods. This motivated the definition of coherent risk measures in (Artzner et al. 1999). A risk measure is coherent if it is convex, monotone, positive homogeneous and translation equi-variant. Conditional Value at Risk (CVaR) is one popular risk measure defined by \[\text{CVaR}_{\alpha}(X):=\mathbb{E}\left[X | X \geq \text{VaR}_{\alpha}(X)\right].\] Unlike VaR, the above is a coherent risk measure.

A well-known result from (Rockafellar and Uryasev 2000) is that both VaR and CVaR can be obtained from the solution of a certain convex optimization problem and we recall this result next.

Theorem 4.2 For any random variable $X$, let \[\begin{align} \label{eq:vV} v(\xi,X):=\xi + \frac{1}{1-\alpha}(X-\xi)_{+} \text{ and } V(\xi)=\E\left[v(\xi,X)\right] \end{align}\] Then, $\textnormal{VaR}_{\alpha}(X)=\left(\arg\min V:= {\left\{\xi \in \mathbb{R}\ | \ V'(\xi)=0 \right\}}\right)$, where $V'$ is the derivative of $V$ w.r.t. $\xi$. Further, $\textnormal{CVaR}_{\alpha}(X)=V(\textnormal{VaR}_{\alpha}(X))$.

From the above, it is clear that in order to estimate VaR/CVaR, one needs to find a $\xi$ that satisfies $V'(\xi)=0$. Stochastic approximation (SA) is a natural tool to use in this situation. Recall that SA is used to solve the equation $h(x) = 0$ when analytical form of $h$ is not known. However, noisy measurements $h(x_n) + \xi_n$ can be obtained, where $x_n, n \ge 0$ are the input parameters and $\xi_n, n \ge 0$ are zero-mean random variables, that are not necessarily i.i.d.

Using the stochastic approximation principle and the result in Theorem Theorem 4.2, we have the following scheme to estimate the VaR/CVaR simultaneously from the samples $\{X_1,\ldots,X_n\}$: \[\begin{align} \text{VaR: } & q_{n+1}= q_{n}-\alpha_{n}(1-\frac{1}{1-\alpha}\indic{X_{n+1}\geq q_n}), \label{eq:RMVaR}\\ \text{CVaR: } & \psi_{n+1}=\psi_{n}-\frac{1}{n+1}\left(\psi_{n} - v(q_{n},X_{n+1})\right). \label{eq:RMCVaR} \end{align}\] In the above, $\eqref{eq:RMVaR}$ can be seen as a gradient descent rule, while $\eqref{eq:RMCVaR}$ can be seen as a plain averaging update.

An interesting question is whether the stochastic gradient-based estimation scheme in $\eqref{eq:RMVaR}$ converges faster than the root-finding estimation scheme in $\eqref{eq:quantile-sa}$.

4.4 Policy evaluation in a discounted MDP

Fix a policy $\pi$ in a discounted MDP. Assume that the single stage cost is $g(i,\pi(i))$. For simplicity, we have assumed that the single stage cost does not depend on the next state. Define \[(HJ)(i) = g(i,\pi(i))+\alpha\sum\limits_{j}p_{ij}(\pi(i))J(j),\, \forall i.\] The problem of policy evaluation is to find a solution to the following fixed point equation: \[\begin{equation} \label{eq:star1} J(i) = HJ(i), \forall i. \end{equation}\]

A stochastic iterative algorithm for solving $\eqref{eq:star1}$ would update as follows: \[\begin{align} J_{t+1}(i)&=J_t(i)+\beta_t(g(i,\pi(i))+\alpha J_t(\bar{i})-J_t(i))\\ &=J_t(i)+\beta_t(g(i,\pi(i))+\alpha \sum\limits_{j}p_{ij}(\pi(i))J_t(j) - J_t(i) + \nonumber\\&\qquad \{g(i,\pi(i))+\alpha J_t(\bar{i})-g(i,\pi(i))-\alpha \sum\limits_{j}p_{ij}(\pi(i))J_t(j)\}). \end{align}\] Or, equivalently \[\begin{align} J_{t+1}(i) &= J_t(i)+\beta_t(((HJ_t)-J_t(i))+w_t(i)),\label{eq:t12} \end{align}\] where $w_t(i)= g(i,\pi(i))+\alpha J_t(\bar{i})-g(i,\pi(i))-\alpha \sum\limits_{j}p_{ij}(\pi(i))J_t(j)$, for all $i$.

Here $\bar{i}$ is a state observed along the sample path, i.e., in state $i$, on action $\pi(i)$, we observe a transition to $\bar{i}$ and $\bar{i}$ follows $p_{ij}(\pi(i))$.
In the absence of noise term $w_t(i)$, it is easy to infer

[J_tJ_{}$, where $J_{}=HJ_{}.]

In the presence of $w_t(i)$, we would like to guarantee asymptotic convergence of $J_t$ to $J_\pi$. Observe that $w_t(i)$ is conditionally zero-mean. More precisely, let $\mathcal{F}_t=\sigma(J_0,....,J_t,w_0,....,w_{t-1})$. Then, $\mathbb{E}(w_t(i)\vert\mathcal{F}_t)=0$. In the next section, we make precise the claim that convergence of $J_t$ governed by $\eqref{eq:t12}$ to $J_\pi$, when

H is a contraction mapping with respect to a max-norm $Hr^*=r^*$;
$\mathbb{E}(w_m(i)\vert\mathcal{F}_m)=0$, $\mathbb{E}(w_m^2(i)\vert\mathcal{F}_m)$ is bounded above; and
$\sum\beta_m=\infty$, $\sum{\beta_m}^2<\infty$.

4.5 Convergence of stochastic fixed point iterations

For the analysis of RL algorithms in the next chapter, we state (without proof) a result for stochastic fixed point iterations and the reader is referred to Section 4.3 of Bertsekas and Tsitsiklis (1996) for the proof.

For ease of readability, we recall the update iteration below. \[\begin{align} \label{eq:stItAlgo} r_{m+1}(i) = r_{m}(i) + \beta_{m}(H r_{m}(i) + w_{m}(i) - r_m(i)), \text{ for }i = 1,...,n. \end{align}\] It helps to look at the per-component update because RL applications would estimate the value function $J_{\pi}(i)$ with start state $i$ using an iterate of the form $r_{m+1}(i)$.

The underlying $\sigma$-field which captures the history of the algorithm until stage $m$ is: \[\begin{align} \mathcal{F}_{m} =\sigma(r_{0}(i),...,r_{m}(i),w_{0}(i),...,w_{m-1}(i),i = 1,...,n). \end{align}\]

4.5.1 The contractive case

We first handle the case when $H$ is contractive and subsequently, relax this requirement to monotonicity.

We define the weighted maximum norm by \[\begin{align} \left\Vert \gamma \right\Vert_{\xi} = \max_{i} \frac {|\gamma(i)|} {\xi(i)}. \end{align}\] If $\xi(i)=1 \forall i$, then the resulting norm is the max-norm.

We assume that $H$ is a weighted maximum norm pseudo-contraction, i.e., satisfying the following assumption:

(A1) There exists a positive vector $\xi = (\xi(1),...,\xi(n))$ and a constant $\beta \in [0,1)$ such that \[\begin{align} \left\Vert H r - r^{*}\right\Vert_{\xi} \leq \beta \left\Vert r - r^{*}\right\Vert \forall r \in \R^n. \end{align}\] The pseudo-contraction condition can be written in more detail as \[\begin{align} \frac{\left| H r(i) - r^{*}(i)\right|}{\xi(i)} \leq \beta \max_j \frac{\left| r(j) - r^{*}(j)\right|}{\xi(j)} \hspace{1cm} \forall i, r \end{align}\] It can be shown that $r^{*}$ is the unique fixed point of $H$, i.e. $Hr^{*}=r^{*}$.

Contraction mapping trivially satisfy the property given above.

We can see the following connection to MDPs:

In a SSP with all proper policies, the Bellman operator is a contraction.
In a discounted MDP with bounded single stage costs, the Bellman operator is a contraction.

We make the following assumption on the step size:

(A2) $\sum_m \beta_m = \infty$ and $\sum_m \beta_m^2 < \infty$.

We make the following assumption on the noise elements.

(A3) For all $i, m$, there exist positive scalars $A,B$ s.t. \[\begin{align} \E(w_m(i)\mid \F_m)=0 \text{ and } \E(w_m(i)^2\mid \F_m) <= A + B ||r_m||^2. \end{align}\] The assumption above requires $w_{m}(i)$ to have zero conditional mean and places a bound on its conditional variance.

Let us consider a rough intuitive argument. Take a special case where $r^{*}=0$ and $w_m=0 \forall m$. Thus, $r_{m+1}(i)=Hr_{m}(i)$. Suppose that the initial point $r_0$ is such that $\left|r_{0}(i)\right| \leq C|$. Due to the pseudo-contraction condition, if the $i$th component is updated $r_{m}(i)\leq \beta C< C$. This shows that given the initial cube, any subsequent update of any component keeps us within the cube. Thus, $r_m$ would eventually converge to the center.

Owing to noise $w_m$, for the algorithm $\eqref{eq:stItAlgo}$, the shrinkage is not immediate but the iterate $r_m$ gets it into smaller and smaller boxes as the algorithm proceeds.

Theorem 4.3 Assume (A1)-(A3). Then, $r_m$ governed by $\eqref{eq:stItAlgo}$ converges to the fixed $r^*$ of $H$ a.s. as $m\rightarrow \infty$.

4.5.2 The monotone case

The motivation for this section is as follows: In a SSP problem where $\exists$ at least one proper policy and improper policies have infinite cost. In such SSPs, the Bellman operator is not "contractive" but we still have "monotonicity".

Instead of (A1) that required $H$ to be contract, we assume the following:

(A1’) H is monotone, i.e., $r \le r'$ implies $Hr \le Hr'$, for all $r,r'$. In addition, there exists a $r^*$ s.t. $Hr^*=r^*$ and $r^*$ is unique. Letting $e$ denote the vector of all ones in $\R^n$, we have [ Hr - e H(r-e) H(r+e) Hr + e, ] for all $\delta>0$.

We have the following variant of Theorem 4.3 for the monotone case.

Theorem 4.4 Assume (A1’), (A2) and (A3). In addition, suppose $r_m$ is bounded w.p. $1$, i.e., $\sup_{m,i} |r_m(i)| < \infty$. Then, $r_m$ governed by $\eqref{eq:stItAlgo}$ converges to the fixed $r^*$ of $H$ a.s. as $m\rightarrow \infty$.

Note that unlike Theorem 4.3, for the monotone case, we require the iterate $r_m$ to satisfy a boundedness requirement.

We provide an additional assumption to ensure boundedness of the iterates.

(A4) $\exists$ a positive vector $\xi$, a scalar $\beta \in [0,1)$ and a scalar $D>0$ such that \[\begin{align} \left\Vert H r_m \right\Vert_{\xi} \leq \beta \left\Vert r_m \right\Vert_{\xi} + D, \forall m. \end{align}\]

Theorem 4.5 Under (A1’),(A2), (A3), (A4), $r_m$ is bounded w.p. $1$.

If $H$ is a pseudo-contraction, then \[\begin{align*} \left\Vert H r_m \right\Vert_{\xi} = \left\Vert H r_m - r^{*} + r^{*} \right\Vert_{\xi} \leq \left\Vert H r_m - r^{*} \right\Vert_{\xi} + \left\Vert r^{*} \right\Vert_{\xi} \\ \leq \beta \left\Vert r_m - r^{*} \right\Vert_{\xi} + \left\Vert r^{*} \right\Vert_{\xi} \\ \leq \beta \left\Vert r_m \right\Vert_{\xi} + (1+\beta)\left\Vert r^{*} \right\Vert_{\xi} \end{align*}\]

So a pseudo-contractive $H$ satisfies condition (A4) and hence, the iterate is bounded in the pseudo-contractive case.

Artzner, P., Delbaen, F., Eber, J.-M., and Heath, D. 1999. Coherent measures of risk. Mathematical finance 9, 3, 203–228.

Bertsekas, D. and Tsitsiklis, J. 1996. Neuro-dynamic programming. Athena Scientific.

Kushner, H. and Clark, D. 1978. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag.

Prashanth, L.A. and Bhatnagar, S. 2025. Gradient-based algorithms for zeroth-order optimization. Foundations and Trends in Optimization 8, 1–3, 1–332.

Robbins, H. and Monro, S. 1951. A stochastic approximation method. The Annals of Mathematical Statistics, 400–407.

Rockafellar, R.T. and Uryasev, S. 2000. Optimization of conditional value-at-risk. Journal of risk 2, 21–42.