Convergence of stochastic gradient method

Shuvomoy Das Gupta

December 27, 2022

In this blog, we study convergence of stochastic gradient method.

Problem setup

We are interested in solving the problemNotation

p=(minimizexRdf(x)subject toxC,)(𝒫)

where we have the following assumptions regarding the nature of the problem.

Assumption 1

We assume:

  • f:Rd(,] is a closed, proper, and convex function,
  • C is a nonempty, closed, convex set, with Cintdom f, and
  • argminf(C)=X.

Stochastic gradient descent

The stochastic gradient descent (SGD) algorithm to solve (𝒫) is described by Algorithm 1 , where we make the following assumption regarding the nature of the oracle.

Assumption 2

We assume that given an iterate xk, the stochastic oracle is capable of producing a random vector gkwith the following properties:

  • (unbiased) k0E[gkxk]∂f(xk), and
  • (bounded variance) G>0k0E[gk2xk]G2.

___________________________________________________________

___________________________________________________________

input: f, C, iteration limit K

___________________________________________________________

algorithm:

1. initialization:

pick x0C arbitrarily

2. main iteration:

for k=0,1,2,,K1

pick stepsizes αk>0 and random gkRd satisfying Assumption ??

xk+1ΠC(xkαkgk) /*ΠC: projection onto the set C/*

end for

3. return xK

___________________________________________________________

Algorithm 1: SGD to solve (𝒫)
___________________________________________________________

Convergence analysis

First, note that, for all k0:

E[xk+1=ΠC(xkαkgk)x=ΠC(x)2xk]=E[ΠC(xkαkgk)ΠC(x)2xkαkgkx2xk]E[xkαkgkx2=xkx2+αk2gk22αkxkx;gkxk]=E[xkx2+αk2gk22αkxkx;gkxk]=E[xkx2xk]+αk2E[gk2xk]2αkE[xkx;gkxk]using linearity of expectation=xkx2+αk2E[gk2xk]2αkxkx;E[gkxk]using "taking out what’s known" ruleE[h(X)YX]=h(X)E[YX]xkx2+αk2G22αkxkx;E[gkxk]/*we have E[gk2xk]G2yf(y)f(xk)+E[gkxk];yxkyxf(x)f(xk)E[gkxk];xkxE[gkxk];xkxf(x)f(xk)*/=xkx2+αk2G22αk(f(xk)f(x)),

So, we have proved

E[xk+1x2xk]xkx2+αk2G22αk(f(xk)f(x)),

so taking expectation with respect to xk on both sides, we get:

E[E[xk+1x2xk]]=E[xk+1x2]using Adam’s lawE[E[YX]]=E[Y]E[xkx2+αk2G22αk(f(xk)f(x))]=E[xkx2]2αkE[f(xk)f(x)]+αk2G2,

so

E[xk+1x2]E[xkx2]2αkE[f(xk)f(x)]+αk2G2.

Now, let us do a telescoping sum:

E[xk+1x2]E[xkx2]2αkE[f(xk)f(x)]+αk2G2E[xkx2]E[xk1x2]2αkE[f(xk1)f(x)]+αk12G2E[xm+1x2]E[xmx2]2αmE[f(xm)f(x)]+αm2G2,

and adding the equations above, we get:

E[xk+1x2]E[xmx2]2i=mkαiE[f(xi)f(x)]+G2i=mkαi20E[xk+1x2]E[xmx2]2i=mkαiE[f(xi)f(x)]+G2i=mkαi20E[xmx2]2i=mkαiE[f(xi)f(x)]+G2i=1mαi2i=mkαiE[f(xi)f(x)]12(E[xmx2]+G2i=mkαi2)(i=mkαi)(mini{m,,k}E[f(xi)f(x)])12(E[xmx2]+G2i=mkαi2)for bk0, we have(minkak)kbkkakbkE[mini{m,,k}{f(xi)f(x)}]mini{m,,k}E[f(xi)f(x)]E[xmx2]+G2i=mkαi22i=mkαi.using E[miniXi]miniE[Xi]

In the last inequality, m is arbitrary, so set m0, which leads to:

E[mini{0,,k}f(xi)]f(x)E[x0x2]+G2i=0kαi22i=0kαi,

so if we have i=0kαi2< and i=0kαi=, then we have

E[mini{0,,k}f(xi)]f(x).

Pdf

A pdf of the blog is available here.