Convergence of stochastic gradient method

Shuvomoy Das Gupta

December 27, 2022

In this blog, we study convergence of stochastic gradient method.

Problem setup

We are interested in solving the problemNotation

p^{⋆} = (\begin{matrix} \underset{x \in R^{d}}{minimize} & f (x) \\ subject to & x \in C, \end{matrix})

(

𝒫

)

where we have the following assumptions regarding the nature of the problem.

Assumption 1

We assume:

$f : R^{d} \to (- \infty, \infty]$ is a closed, proper, and convex function,
$C$ is a nonempty, closed, convex set, with $C \subseteq int dom f$ , and
$argmin f (C) = X^{⋆} \neq \emptyset$ .

Stochastic gradient descent

The stochastic gradient descent (SGD) algorithm to solve ( $𝒫$ ) is described by Algorithm 1 , where we make the following assumption regarding the nature of the oracle.

Assumption 2

We assume that given an iterate $x_{k}$ , the stochastic oracle is capable of producing a random vector $g_{k}$ with the following properties:

(unbiased) $\forall_{k \geq 0} E [g_{k} ∣ x_{k}] \in ∂f (x_{k})$ , and
(bounded variance) $\exists_{G > 0} \forall_{k \geq 0} E [∥ g_{k} ∥^{2} ∣ x_{k}] \leq G^{2} .$

___________________________________________________________

input: $f$ , $C$ , iteration limit $K$

___________________________________________________________

algorithm:

1. initialization:

pick $x_{0} \in C$ arbitrarily

2. main iteration:

for $k = 0, 1, 2, \dots, K - 1$

pick stepsizes $α_{k} > 0$ and random $g_{k} \in R^{d}$ satisfying Assumption ??

$x_{k + 1} \leftarrow Π_{C} (x_{k} - α_{k} g_{k})$ /* $Π_{C}$ : projection onto the set $C$ /*

end for

3. return $x_{K}$

___________________________________________________________

Algorithm 1: SGD to solve (

𝒫

)

___________________________________________________________

Convergence analysis

First, note that, for all $k \geq 0$ :

\begin{array}{l} E [∥ \overset{= Π_{C} (x_{k} - α_{k} g_{k})}{\overset{︷}{x_{k + 1}}} - \overset{= Π_{C} (x_{⋆})}{\overset{︷}{x_{⋆}}} ∥^{2} ∣ x_{k}] \\ = & E [\overset{\leq ∥ x_{k} - α_{k} g_{k} - x_{⋆} ∥^{2}}{\overset{︷}{∥ Π_{C} (x_{k} - α_{k} g_{k}) - Π_{C} (x_{⋆}) ∥^{2}}} ∣ x_{k}] \\ \leq & E [\overset{= ∥ x_{k} - x_{⋆} ∥^{2} + α_{k}^{2} ∥ g_{k} ∥^{2} - 2 α_{k} ⟨ x_{k} - x_{⋆}; g_{k} ⟩}{\overset{︷}{∥ x_{k} - α_{k} g_{k} - x_{⋆} ∥^{2}}} ∣ x_{k}] \\ = & E [∥ x_{k} - x_{⋆} ∥^{2} + α_{k}^{2} ∥ g_{k} ∥^{2} - 2 α_{k} ⟨ x_{k} - x_{⋆}; g_{k} ⟩ ∣ x_{k}] \\ = & E [∥ x_{k} - x_{⋆} ∥^{2} ∣ x_{k}] + α_{k}^{2} E [∥ g_{k} ∥^{2} ∣ x_{k}] - 2 α_{k} E [⟨ x_{k} - x_{⋆}; g_{k} ⟩ ∣ x_{k}] & ▹ using linearity of expectation \\ = & ∥ x_{k} - x_{⋆} ∥^{2} + α_{k}^{2} E [∥ g_{k} ∥^{2} ∣ x_{k}] - 2 α_{k} ⟨ x_{k} - x_{⋆}; E [g_{k} ∣ x_{k}] ⟩ & ▹ using "taking out what’s known" rule E [h (X) Y ∣ X] = h (X) E [Y ∣ X] \\ \leq & ∥ x_{k} - x_{⋆} ∥^{2} + α_{k}^{2} G^{2} - 2 α_{k} ⟨ x_{k} - x_{⋆}; E [g_{k} ∣ x_{k}] ⟩ \\ /* \\ we have E [∥ g_{k} ∥^{2} ∣ x_{k}] \leq G^{2} \\ \Leftrightarrow \forall_{y} f (y) \geq f (x_{k}) + ⟨ E [g_{k} ∣ x_{k}]; y - x_{k} ⟩ \\ \overset{y \leftarrow x_{⋆}}{\Rightarrow} f (x_{⋆}) \geq f (x_{k}) - ⟨ E [g_{k} ∣ x_{k}]; x_{k} - x_{⋆} ⟩ \\ \Rightarrow - ⟨ E [g_{k} ∣ x_{k}]; x_{k} - x_{⋆} ⟩ \leq f (x_{⋆}) - f (x_{k}) \\ */ \\ = & ∥ x_{k} - x_{⋆} ∥^{2} + α_{k}^{2} G^{2} - 2 α_{k} (f (x_{k}) - f (x_{⋆})), \end{array}

So, we have proved

E [∥ x_{k + 1} - x_{⋆} ∥^{2} ∣ x_{k}] \leq ∥ x_{k} - x_{⋆} ∥^{2} + α_{k}^{2} G^{2} - 2 α_{k} (f (x_{k}) - f (x_{⋆})),

so taking expectation with respect to $x_{k}$ on both sides, we get:

\begin{array}{l} E [E [∥ x_{k + 1} - x_{⋆} ∥^{2} ∣ x_{k}]] \\ = & E [∥ x_{k + 1} - x_{⋆} ∥^{2}] & ▹ using Adam’s law E [E [Y ∣ X]] = E [Y] \\ \leq & E [∥ x_{k} - x_{⋆} ∥^{2} + α_{k}^{2} G^{2} - 2 α_{k} (f (x_{k}) - f (x_{⋆}))] \\ = & E [∥ x_{k} - x_{⋆} ∥^{2}] - 2 α_{k} E [f (x_{k}) - f (x_{⋆})] + α_{k}^{2} G^{2}, \end{array}

E [∥ x_{k + 1} - x_{⋆} ∥^{2}] - E [∥ x_{k} - x_{⋆} ∥^{2}] \leq - 2 α_{k} E [f (x_{k}) - f (x_{⋆})] + α_{k}^{2} G^{2} .

Now, let us do a telescoping sum:

\begin{array}{l} E [∥ x_{k + 1} - x_{⋆} ∥^{2}] - E [∥ x_{k} - x_{⋆} ∥^{2}] & \leq - 2 α_{k} E [f (x_{k}) - f (x_{⋆})] + α_{k}^{2} G^{2} \\ E [∥ x_{k} - x_{⋆} ∥^{2}] - E [∥ x_{k - 1} - x_{⋆} ∥^{2}] & \leq - 2 α_{k} E [f (x_{k - 1}) - f (x_{⋆})] + α_{k - 1}^{2} G^{2} \\ ⋮ & ⋮ \\ E [∥ x_{m + 1} - x_{⋆} ∥^{2}] - E [∥ x_{m} - x_{⋆} ∥^{2}] & \leq - 2 α_{m} E [f (x_{m}) - f (x_{⋆})] + α_{m}^{2} G^{2}, \end{array}

and adding the equations above, we get:

\begin{array}{l} E [∥ x_{k + 1} - x_{⋆} ∥^{2}] - E [∥ x_{m} - x_{⋆} ∥^{2}] \leq - 2 ∑_{i = m}^{k} α_{i} E [f (x_{i}) - f (x_{⋆})] + G^{2} ∑_{i = m}^{k} α_{i}^{2} \\ \Leftrightarrow & 0 \leq E [∥ x_{k + 1} - x_{⋆} ∥^{2}] \leq E [∥ x_{m} - x_{⋆} ∥^{2}] - 2 ∑_{i = m}^{k} α_{i} E [f (x_{i}) - f (x_{⋆})] + G^{2} ∑_{i = m}^{k} α_{i}^{2} \\ \Rightarrow & 0 \leq E [∥ x_{m} - x_{⋆} ∥^{2}] - 2 ∑_{i = m}^{k} α_{i} E [f (x_{i}) - f (x_{⋆})] + G^{2} ∑_{i = 1}^{m} α_{i}^{2} \\ \Leftrightarrow & ∑_{i = m}^{k} α_{i} E [f (x_{i}) - f (x_{⋆})] \leq \frac{1}{2} (E [∥ x_{m} - x_{⋆} ∥^{2}] + G^{2} ∑_{i = m}^{k} α_{i}^{2}) \\ \Rightarrow & (∑_{i = m}^{k} α_{i}) (\min_{i \in {m, \dots, k}} E [f (x_{i}) - f (x_{⋆})]) \\ \leq \frac{1}{2} (E [∥ x_{m} - x_{⋆} ∥^{2}] + G^{2} ∑_{i = m}^{k} α_{i}^{2}) & ▹ for b_{k} \geq 0, we have (\min_{k} a_{k}) ∑_{k} b_{k} \leq ∑_{k} a_{k} b_{k} \\ \Rightarrow & E [\min_{i \in {m, \dots, k}} {f (x_{i}) - f (x_{⋆})}] \leq \min_{i \in {m, \dots, k}} E [f (x_{i}) - f (x_{⋆})] \\ \leq \frac{E [∥ x_{m} - x_{⋆} ∥^{2}] + G^{2} ∑_{i = m}^{k} α_{i}^{2}}{2 ∑_{i = m}^{k} α_{i}} . & ▹ using E [\min_{i} X_{i}] \leq \min_{i} E [X_{i}] \end{array}

In the last inequality, $m$ is arbitrary, so set $m \leftarrow 0$ , which leads to:

E [\min_{i \in {0, \dots, k}} f (x_{i})] - f (x_{⋆}) \leq \frac{E [∥ x_{0} - x_{⋆} ∥^{2}] + G^{2} ∑_{i = 0}^{k} α_{i}^{2}}{2 ∑_{i = 0}^{k} α_{i}},

so if we have $\sum_{i = 0}^{k} α_{i}^{2} < \infty$ and $\sum_{i = 0}^{k} α_{i} = \infty$ , then we have

E [\min_{i \in {0, \dots, k}} f (x_{i})] \to f (x_{⋆}) .

Pdf

A pdf of the blog is available here.