Convergence of stochastic gradient method

Shuvomoy Das Gupta

December 27, 2022

In this blog, we study convergence of stochastic gradient method.

Problem setup

We are interested in solving the problemNotation

p = ( minimizex Rdf(x) subject to x C, ) (𝒫)

where we have the following assumptions regarding the nature of the problem.

Assumption 1

We assume:

  • f : Rd (, ] is a closed, proper, and convex function,
  • C is a nonempty, closed, convex set, with C intdom f, and
  • argminf(C) = X.

Stochastic gradient descent

The stochastic gradient descent (SGD) algorithm to solve (𝒫) is described by Algorithm 1 , where we make the following assumption regarding the nature of the oracle.

Assumption 2

We assume that given an iterate xk, the stochastic oracle is capable of producing a random vector gkwith the following properties:

  • (unbiased) k0E [gkxk] ∂f(xk), and
  • (bounded variance) G>0k0E [gk2xk] G2.

___________________________________________________________

___________________________________________________________

input: f, C, iteration limit K

___________________________________________________________

algorithm:

1. initialization:

pick x0 C arbitrarily

2. main iteration:

for k = 0, 1, 2, , K1

pick stepsizes αk > 0 and random gk Rd satisfying Assumption ??

xk+1 ΠC (xk αkgk) /*ΠC: projection onto the set C/*

end for

3. return xK

___________________________________________________________

Algorithm 1: SGD to solve (𝒫)
___________________________________________________________

Convergence analysis

First, note that, for all k 0:

E [xk+1 =ΠC(xkαkgk) x=ΠC(x)2xk] = E [ΠC (xk αkgk) ΠC(x)2 xkαkgkx2xk] E [xk αkgk x2 =xkx2+αk2gk22αkxkx;gkxk] = E [xk x2 + αk2gk2 2αk xk x; gkxk] = E [xk x2xk] + αk2E [gk2xk] 2αkE [xk x; gkxk] using linearity of expectation = xk x2 + αk2E [gk2xk] 2αk xk x; E [gkxk] using "taking out what’s known" rule E [h(X)YX] = h(X)E [YX] xk x2 + αk2G2 2αk xk x; E [gkxk] /* we have E [gk2xk] G2 yf(y) f(xk) + E [gkxk] ; yxk y xf(x) f(xk) E [gkxk] ; xk x E [gkxk] ; xk x f(x) f(xk) */ = xk x2 + αk2G2 2αk (f(xk) f(x)) ,

So, we have proved

E [xk+1 x2xk] xk x2 + αk2G2 2αk (f(xk) f(x)) ,

so taking expectation with respect to xk on both sides, we get:

E [E [xk+1 x2xk]] = E [xk+1 x2 ] using Adam’s lawE [E [YX]] = E [Y ] E [xk x2 + αk2G2 2αk (f(xk) f(x))] = E [xk x2 ] 2αkE [f(xk) f(x)] + αk2G2,

so

E [xk+1 x2 ] E [xk x2 ] 2αkE [f(xk) f(x)] + αk2G2.

Now, let us do a telescoping sum:

E [xk+1 x2 ] E [xk x2 ] 2αkE [f(xk) f(x)] + αk2G2 E [xk x2 ] E [xk1 x2 ] 2αkE [f(xk1) f(x)] + αk12G2 E [xm+1 x2 ] E [xm x2 ] 2αmE [f(xm) f(x)] + αm2G2,

and adding the equations above, we get:

E [xk+1 x2 ] E [xm x2 ] 2i=mkαiE [f(xi) f(x)] + G2i=mkαi2 0 E [xk+1 x2 ] E [xm x2 ] 2i=mkαiE [f(xi) f(x)] + G2i=mkαi2 0 E [xm x2 ] 2i=mkαiE [f(xi) f(x)] + G2i=1mαi2 i=mkαiE [f(xi) f(x)] 1 2 (E [xm x2 ] + G2i=mkαi2 ) (i=mkαi) (min i{m,,k}E [f(xi) f(x)]) 1 2 (E [xm x2 ] + G2i=mkαi2 ) for bk 0, we have (min kak)kbk kakbk E [min i{m,,k} {f(xi) f(x)}] min i{m,,k}E [f(xi) f(x)] E [xm x2 ] + G2i=mkαi2 2i=mkαi . using E [min iXi] min iE [Xi]

In the last inequality, m is arbitrary, so set m 0, which leads to:

E [min i{0,,k}f(xi)] f(x) E [x0 x2 ] + G2i=0kαi2 2i=0kαi ,

so if we have i=0kαi2 < and i=0kαi = , then we have

E [min i{0,,k}f(xi)] f(x).

Pdf

A pdf of the blog is available here.