Problem setup We are interested in solving the problemNotation
p ⋆ = ( minimize x ∈ R d f ( x ) subject to x ∈ C , ) (𝒫 )
where we have the following assumptions regarding the nature of the problem.
Assumption 1 We assume:
f : R d → ( − ∞ , ∞ ] is a closed, proper, and convex function, C is a nonempty, closed, convex set, with C ⊆ int dom f , and argmin f ( C ) = X ⋆ ≠ ∅ .
Stochastic gradient descent The stochastic gradient descent (SGD) algorithm to solve (𝒫 ) is described by Algorithm 1 , where we make the following assumption regarding the nature of the oracle.
Assumption 2 We assume that given an iterate x k , the stochastic oracle is capable of producing a random vector g k with the following properties:
(unbiased) ∀ k ≥ 0 E [ g k ∣ x k ] ∈ ∂f ( x k ) , and (bounded variance) ∃ G > 0 ∀ k ≥ 0 E [ ∥ g k ∥ 2 ∣ x k ] ≤ G 2 .
___________________________________________________________ ___________________________________________________________
input: f , C , iteration limit K
___________________________________________________________
algorithm:
1. initialization:
pick x 0 ∈ C arbitrarily
2. main iteration:
for k = 0 , 1 , 2 , … , K − 1
pick stepsizes α k > 0 and random g k ∈ R d satisfying Assumption ??
x k + 1 ← Π C ( x k − α k g k ) /* Π C : projection onto the set C /*
end for
3. return x K
___________________________________________________________
Algorithm 1: SGD to solve (𝒫 ) ___________________________________________________________ Convergence analysis First, note that, for all k ≥ 0 :
E [ ∥ x k + 1 ︷ = Π C ( x k − α k g k ) − x ⋆ ︷ = Π C ( x ⋆ ) ∥ 2 ∣ x k ] = E [ ∥ Π C ( x k − α k g k ) − Π C ( x ⋆ ) ∥ 2 ︷ ≤ ∥ x k − α k g k − x ⋆ ∥ 2 ∣ x k ] ≤ E [ ∥ x k − α k g k − x ⋆ ∥ 2 ︷ = ∥ x k − x ⋆ ∥ 2 + α k 2 ∥ g k ∥ 2 − 2 α k ⟨ x k − x ⋆ ; g k ⟩ ∣ x k ] = E [ ∥ x k − x ⋆ ∥ 2 + α k 2 ∥ g k ∥ 2 − 2 α k ⟨ x k − x ⋆ ; g k ⟩ ∣ x k ] = E [ ∥ x k − x ⋆ ∥ 2 ∣ x k ] + α k 2 E [ ∥ g k ∥ 2 ∣ x k ] − 2 α k E [ ⟨ x k − x ⋆ ; g k ⟩ ∣ x k ] ▹ using linearity of expectation = ∥ x k − x ⋆ ∥ 2 + α k 2 E [ ∥ g k ∥ 2 ∣ x k ] − 2 α k ⟨ x k − x ⋆ ; E [ g k ∣ x k ] ⟩ ▹ using "taking out what’s known" rule E [ h ( X ) Y ∣ X ] = h ( X ) E [ Y ∣ X ] ≤ ∥ x k − x ⋆ ∥ 2 + α k 2 G 2 − 2 α k ⟨ x k − x ⋆ ; E [ g k ∣ x k ] ⟩ /* we have E [ ∥ g k ∥ 2 ∣ x k ] ≤ G 2 ⇔ ∀ y f ( y ) ≥ f ( x k ) + ⟨ E [ g k ∣ x k ] ; y − x k ⟩ ⇒ y ← x ⋆ f ( x ⋆ ) ≥ f ( x k ) − ⟨ E [ g k ∣ x k ] ; x k − x ⋆ ⟩ ⇒ − ⟨ E [ g k ∣ x k ] ; x k − x ⋆ ⟩ ≤ f ( x ⋆ ) − f ( x k ) */ = ∥ x k − x ⋆ ∥ 2 + α k 2 G 2 − 2 α k ( f ( x k ) − f ( x ⋆ ) ) , So, we have proved
E [ ∥ x k + 1 − x ⋆ ∥ 2 ∣ x k ] ≤ ∥ x k − x ⋆ ∥ 2 + α k 2 G 2 − 2 α k ( f ( x k ) − f ( x ⋆ ) ) ,
so taking expectation with respect to x k on both sides, we get:
E [ E [ ∥ x k + 1 − x ⋆ ∥ 2 ∣ x k ] ] = E [ ∥ x k + 1 − x ⋆ ∥ 2 ] ▹ using Adam’s law E [ E [ Y ∣ X ] ] = E [ Y ] ≤ E [ ∥ x k − x ⋆ ∥ 2 + α k 2 G 2 − 2 α k ( f ( x k ) − f ( x ⋆ ) ) ] = E [ ∥ x k − x ⋆ ∥ 2 ] − 2 α k E [ f ( x k ) − f ( x ⋆ ) ] + α k 2 G 2 , so
E [ ∥ x k + 1 − x ⋆ ∥ 2 ] − E [ ∥ x k − x ⋆ ∥ 2 ] ≤ − 2 α k E [ f ( x k ) − f ( x ⋆ ) ] + α k 2 G 2 .
Now, let us do a telescoping sum:
E [ ∥ x k + 1 − x ⋆ ∥ 2 ] − E [ ∥ x k − x ⋆ ∥ 2 ] ≤ − 2 α k E [ f ( x k ) − f ( x ⋆ ) ] + α k 2 G 2 E [ ∥ x k − x ⋆ ∥ 2 ] − E [ ∥ x k − 1 − x ⋆ ∥ 2 ] ≤ − 2 α k E [ f ( x k − 1 ) − f ( x ⋆ ) ] + α k − 1 2 G 2 ⋮ ⋮ E [ ∥ x m + 1 − x ⋆ ∥ 2 ] − E [ ∥ x m − x ⋆ ∥ 2 ] ≤ − 2 α m E [ f ( x m ) − f ( x ⋆ ) ] + α m 2 G 2 , and adding the equations above, we get:
E [ ∥ x k + 1 − x ⋆ ∥ 2 ] − E [ ∥ x m − x ⋆ ∥ 2 ] ≤ − 2 ∑ i = m k α i E [ f ( x i ) − f ( x ⋆ ) ] + G 2 ∑ i = m k α i 2 ⇔ 0 ≤ E [ ∥ x k + 1 − x ⋆ ∥ 2 ] ≤ E [ ∥ x m − x ⋆ ∥ 2 ] − 2 ∑ i = m k α i E [ f ( x i ) − f ( x ⋆ ) ] + G 2 ∑ i = m k α i 2 ⇒ 0 ≤ E [ ∥ x m − x ⋆ ∥ 2 ] − 2 ∑ i = m k α i E [ f ( x i ) − f ( x ⋆ ) ] + G 2 ∑ i = 1 m α i 2 ⇔ ∑ i = m k α i E [ f ( x i ) − f ( x ⋆ ) ] ≤ 1 2 ( E [ ∥ x m − x ⋆ ∥ 2 ] + G 2 ∑ i = m k α i 2 ) ⇒ ( ∑ i = m k α i ) ( min i ∈ { m , … , k } E [ f ( x i ) − f ( x ⋆ ) ] ) ≤ 1 2 ( E [ ∥ x m − x ⋆ ∥ 2 ] + G 2 ∑ i = m k α i 2 ) ▹ for b k ≥ 0 , we have ( min k a k ) ∑ k b k ≤ ∑ k a k b k ⇒ E [ min i ∈ { m , … , k } { f ( x i ) − f ( x ⋆ ) } ] ≤ min i ∈ { m , … , k } E [ f ( x i ) − f ( x ⋆ ) ] ≤ E [ ∥ x m − x ⋆ ∥ 2 ] + G 2 ∑ i = m k α i 2 2 ∑ i = m k α i . ▹ using E [ min i X i ] ≤ min i E [ X i ] In the last inequality, m is arbitrary, so set m ← 0 , which leads to:
E [ min i ∈ { 0 , … , k } f ( x i ) ] − f ( x ⋆ ) ≤ E [ ∥ x 0 − x ⋆ ∥ 2 ] + G 2 ∑ i = 0 k α i 2 2 ∑ i = 0 k α i ,
so if we have ∑ i = 0 k α i 2 < ∞ and ∑ i = 0 k α i = ∞ , then we have
E [ min i ∈ { 0 , … , k } f ( x i ) ] → f ( x ⋆ ) .
Pdf
A pdf of the blog is available here.