Study notes on “Stochastic Polyak Step-size, a sim-

ple step-size tuner with optimal rates” by F. Pe-

dregosa

Shuvomoy Das Gupta

March 29, 2024

Here are my study notes for Fabian Pedregosa’s amazing blog on

Stochastic Polyak Step-size; the full citation of Pedregosa’s blog is:

Stochastic Polyak Step-size, a simple step-size tuner with optimal rates, Fabian

Pedregosa, 2023 available at https://fa.bianp.net/blog/2023/sps/.

Contents

Problem setup 1

Notation 1

Stochastic Gradient Descent with Polyak Stepsize 1

Assumptions 2

Convergence analysis 2

Problem setup

We are interested in solving the problem

⋆



minimize

x∈R

f (x) =

∑

i=1

[i]

(x)



. . . (P)

where the optimal solution is achieved at x

⋆

. We have the following

assumptions regarding the nature of the problem.

Notation

Inner product between vectors x, y is denoted by

⟨

x | y

⟩

and Eu-

clidean norm of x is denoted by ∥x∥ =

⟨

x | y

⟩

. We let [1 : n] =

{1, 2, . . . , n} and z

= max{z, 0 }. Also, for notational convenience

we denote: sqd(x) = ∥x∥

and ReLU(z) = max{z, 0}. Comments are

enclosed in /

this is a comment

Stochastic Gradient Descent with Polyak Stepsize

The algorithm called Stochastic Gradient Descent with Stochastic

Polyak Stepsize (SGD-SPS) to solve (P) is described by Algorithm

1. The uniform distribution with support {1, . . . , n} is denoted by

unif[1 : n]. One subgradient of the function f

[i]

evaluated at x is

denoted by

∇f (x).

study notes on “stochastic polyak step-size, a simple step-size tuner with optimal rates”

by f. pedregosa 2

Algorithm 1 SGD-SPS to solve (P)

input: the functions f

[i]

for i ∈ [1 : n], iteration limit T

algorithm:

1. initialization:

pick x

∈ R

arbitrarily

2. main iteration:

for t = 0, 1, 2, . . . , T −1

sample a function f

uniformly at random i ∼ unif[1 : n]

set Polyak stepsize γ







ReLU

(

[i]

)−f

[i]

⋆

)

∥

∇f

[i]

)∥

, if

∇f

[i]

) = 0∥

0, else,

update iterate x

t+1

= x

−γ

∇f

[i]

)(x

)

end for

3. return x

Assumptions

We assume that for all i, f

[i]

: R

→ (−∞, ∞] is a nonsmooth, subgra-

dient bounded, and star-convex function, i.e,

• Star-convexity.

∀

i∈[ 1:n]

[i]

star-convex around x

⋆

def

⇔

∀

x∈dom f

[i]

(x) − f

[i]

⋆

) ≤

∇f

[i]

(x) | x − x

⋆

• Subgradient-boundedness.

∀

i∈[ 1:N]

∀

x∈B={y|∥y−x

⋆

∥≤∥x

−x

⋆

∥}

∀

∇f

[i]

(x)∈∂ f

[i]

(x)

∥

∇f

[i]

(x)∥ ≤ G.

Convergence analysis

Consider an arbitrary iteration number t and we want to compute

iterate x

t+1

from x

. Going from x

to x

t+1

the randomness lies in the

selection of the function f

by i ∼ unif[1 : N]. We will come up with

an inequality that works for any value of i. We will use the notation

∇f

[i]

) ≜ g

[i]

∇f (x

) ≜ g

, f

[i]

) ≜ f

[i]

Consider the case ∥g

[i]

∥ = 0. We have

∥x

t+1

− x

⋆

∥

=∥x

−γ

[i]

− x

⋆

∥

=∥(x

− x

⋆

) −γ

[i]

∥

▷ expand squares

=∥x

− x

⋆

∥

+ γ

∥g

[i]

∥

−2γ

[i]

| x

− x

⋆

study notes on “stochastic polyak step-size, a simple step-size tuner with optimal rates”

by f. pedregosa 3

we have f

[i]

(x) − f

[i]

⋆

) ≤

∇f

[i]

(x) | x − x

⋆

⇒ f

[i]

) − f

[i]

⋆

) ≤

∇f

[i]

) | x

− x

⋆

⇔ f

[i]

− f

[i]

⋆

≤

[i]

| x

− x

⋆

⇔ −



[i]

− f

[i]

⋆



≥ −

[i]

| x

− x

⋆

∴ −2γ

[i]

| x

− x

⋆

≤ −2γ



[i]

− f

[i]

⋆



≤ ∥x

− x

⋆

∥

+ γ

∥g

[i]

∥

−2γ



[i]

− f

[i]

⋆



▷ in this case γ

ReLU



[i]

−f

[i]

⋆



∥g

[i]

∥

= ∥x

− x

⋆

∥

(sqd ◦ReLU)



[i]

− f

[i]

⋆



∥g

[i]

∥

∥g

[i]

∥

−2

ReLU



[i]

− f

[i]

⋆



∥g

[i]

∥



[i]

− f

[i]

⋆



we have z ×ReLU(z) = z ×max {z, 0} =







, if z ≥ 0

0, else

(

max{z, 0}

)

= (sqd ◦ ReLU)



[i]

− f

[i]

⋆



= ∥x

− x

⋆

∥

(sqd ◦ReLU)



[i]

− f

[i]

⋆



∥g

[i]

∥

−2

(sqd ◦ReLU)



[i]

− f

[i]

⋆



∥g

[i]

∥

= ∥x

− x

⋆

∥

−

(sqd ◦ReLU)



[i]

− f

[i]

⋆



∥g

[i]

∥

. . . . (1)

Now consider the case ∥g

[i]

∥ = 0, then x

t+1

= x

and

∥x

t+1

− x

⋆

∥

= ∥x

− x

⋆

∥

. . . . (2)

Thus from (1) and (2), we have

∥x

t+1

−x

⋆

∥

≤











∥x

− x

⋆

∥

−

(sqd◦ReLU)



[i]

−f

[i]

⋆



∥g

[i]

∥

, with i ∼ unif[1 : n] and ∥g

[i]

∥ = 0,

∥x

− x

⋆

∥

, with i ∼ unif[1 : n] and ∥g

[i]

∥ = 0.

. . . (3)

From (3), we see that irrespective of the randomness in selecting

i, we always have ∥x

t+1

− x

⋆

∥

≤ ∥x

− x

⋆

∥

≤ ··· ≤ ∥x

− x

⋆

∥

hence we have x

∈ B = {y | ∥y − x

⋆

∥ ≤ ∥x

− x

⋆

∥} no matter what.

As a result, for the case ∥g

[i]

∥ = 0, using the gradient-boundedness

assumption we have

∥g

[i]

∥

≤ G

⇔

∥g

[i]

∥

≥

⇔−

(sqd ◦ReLU)



[i]

− f

[i]

⋆



∥g

[i]

∥

≤ −

(sqd ◦ReLU)



[i]

− f

[i]

⋆



. . . . (4)

study notes on “stochastic polyak step-size, a simple step-size tuner with optimal rates”

by f. pedregosa 4

Next, for the case ∥g

[i]

∥ = 0 ⇔ g

[i]

= 0, using star-convexity, we

have

[i]

(x) − f

[i]

⋆

) ≤

∇f

[i]

(x) | x − x

⋆

⇒ f

[i]

) − f

[i]

⋆

) ≤

[i]

| x

− x

⋆

= 0

⇒f

[i]

− f

[i]

⋆

≤ 0

⇒ReLU



[i]

− f

[i]

⋆



= max

[i]

− f

[i]

⋆

, 0

= 0

⇒

(sqd ◦ReLU)



[i]

− f

[i]

⋆



= 0 . . . (5)

So, using (4) and (5) in the cases of (3) we get

∥x

t+1

−x

⋆

∥

≤ ∥x

−x

⋆

∥

−

(sqd ◦ReLU)



[i]

− f

[i]

⋆



with i ∼ unif[1 : n]. . . . (6)

Next, on both sides of (6), we take conditional expectation with re-

spect to i given x

, which we denote by

[

· | x

]

≜

i∼unif[ 1:N]

[

· | x

]

and the resultant inequality is:

∥x

t+1

− x

⋆

∥

| x

≤





∥x

− x

⋆

∥

−

(sqd ◦ReLU)



[i]

− f

[i]

⋆



| x





∥x

− x

⋆

∥

| x

−





(sqd ◦ReLU)



[i]

− f

[i]

⋆



| x





▷ using linearity of expectation

=∥x

− x

⋆

∥

−

(sqd ◦ReLU)



[i]

− f

[i]

⋆



| x

. . . (7)

▷ using "taking out what’s known" rule

[

h(X)Y | X

]

= h(X)E

[

Y | X

]

Recall now Jensen’s inequality: if ϕ is a convex function and Z is a

random variable, then ϕ

(

[

])

≤

[

ϕ(Z)

]

. Setting ϕ

= sqd ◦ReLU =

sqd

(

ReLU(·)

)

, which is convex (see Boyd Vandenberghe, Convex

Optimization, Figure 3.7) and Z

(sqd ◦ReLU)



[i]

− f

[i]

⋆



| x

have

(sqd ◦ReLU)



h

[i]

− f

[i]

⋆



| x

i

≤

(sqd ◦ReLU)



[i]

− f

[i]

⋆

| x

i

⇔−(sqd ◦ ReLU)



h

[i]

− f

[i]

⋆



| x

i

≥ −

(sqd ◦ReLU)



[i]

− f

[i]

⋆

| x

i

⇔−

(sqd ◦ReLU)



h

[i]

− f

[i]

⋆



| x

i

≥ −

(sqd ◦ReLU)



[i]

− f

[i]

⋆

| x

i

. . . . (8)

Now notice the LHS term in (8):

h

[i]

− f

[i]

⋆



| x

i

study notes on “stochastic polyak step-size, a simple step-size tuner with optimal rates”

by f. pedregosa 5

i∼unif[ 1:n]

h

[i]

− f

[i]

⋆



| x

i

∑

i=1



[i]

− f

[i]

⋆



| x

= f (x

) − f (x

⋆

where the last term is a random variable in x

(recall that

[

Y | X

]

a random variable in X).

From (7), (8), and (9), we have

∥x

t+1

− x

⋆

∥

| x

≤∥x

− x

⋆

∥

−

(sqd ◦ReLU)



h

[i]

− f

[i]

⋆



| x

i

=∥x

− x

⋆

∥

−

(sqd ◦ReLU)

(

f (x

) − f (x

⋆

)

=∥x

− x

⋆

∥

−

(

max{f (x

) − f (x

⋆

), 0}

)

=∥x

− x

⋆

∥

−

(

f (x

) − f (x

⋆

)

. . . (10) ▷ as f (x

) − f (x

⋆

) ≥ 0

Now taking expectation with respect to x

on both sides of (10)

and then using Adam’s law

[

Y | X

]]

[

]

,we get:

∥x

t+1

− x

⋆

∥

| x

≤



∥x

− x

⋆

∥

−

(

f (x

) − f (x

⋆

)



⇔

∥x

t+1

− x

⋆

∥

≤

∥x

− x

⋆

∥

−



(

f (x

) − f (x

⋆

)



▷ using linearity of expectation on RHS

and Adam’s law on LHS

⇔

∥x

t+1

− x

⋆

∥

≤

∥x

− x

⋆

∥

−

(

f (x

) − f (x

⋆

)

⇔

(

f (x

) − f (x

⋆

)

≤

∥x

− x

⋆

∥

−

∥x

t+1

− x

⋆

∥

. . . . (11)

Now, let us do a telescoping sum on (11) for t = 0, . . . , T

(

f (x

) − f (x

⋆

)

≤

∥x

− x

⋆

∥

−

∥x

− x

⋆

∥

(

f (x

) − f (x

⋆

)

≤

∥x

− x

⋆

∥

−

∥x

− x

⋆

∥

(

f (x

) − f (x

⋆

)

≤

∥x

− x

⋆

∥

−

∥x

− x

⋆

∥

(

f (x

T−1

) − f (x

⋆

)

≤

∥x

T−1

− x

⋆

∥

−

∥x

− x

⋆

∥

(

f (x

) − f (x

⋆

)

≤

∥x

− x

⋆

∥

−

∥x

T+1

− x

⋆

∥

which yields:

∑

k=0

(

f (x

) − f (x

⋆

)

≤

∥x

− x

⋆

∥

−

∥x

T+1

− x

⋆

∥

study notes on “stochastic polyak step-size, a simple step-size tuner with optimal rates”

by f. pedregosa 6

=∥x

− x

⋆

∥

−

∥x

T+1

− x

⋆

∥

▷ as x

is deterministic

≤∥x

− x

⋆

∥

▷ as−



∥x

T+1

− x

⋆

∥



≤ 0

⇒

∑

k=0

(

f (x

) − f (x

⋆

)

≤ G

∥x

− x

⋆

∥

∴

T + 1

∑

k=0

(

f (x

) − f (x

⋆

)

≤

∥x

− x

⋆

∥

T + 1

. . . (12)

Recall now Jensen’s inequality again: if ϕ is a convex function and

Z is a random variable, then ϕ

(

[

])

≤

[

ϕ(Z)

]

. Setting ϕ

= sqd

and Z

= f (x

) − f (x

⋆

) we have

sqd

(

[

f (x

) − f (x

⋆

)

])

≤

[

sqd

(

f (x

) − f (x

⋆

)

)]

⇒ min

k∈[0:T]

(

[

f (x

) − f (x

⋆

)

])

≤ min

k∈[0:T]

(

f (x

) − f (x

⋆

)

. . . . (13)

Also,

∑

k=0

(

f (x

) − f (x

⋆

)

≥

∑

k=0



min

k∈[0:T]

(

f (x

) − f (x

⋆

)





min

k∈[0:T]

(

f (x

) − f (x

⋆

)



∑

k=0

= (T + 1)



min

k∈[0:T]

(

f (x

) − f (x

⋆

)



⇒

T + 1

∑

k=0

(

f (x

) − f (x

⋆

)

≥ min

k∈[0:T]

(

f (x

) − f (x

⋆

)

. . . . (14)

From (13) and (14), we have

min

k∈[0:T]

(

[

f (x

) − f (x

⋆

)

])

≤ min

k∈[0:T]

(

f (x

) − f (x

⋆

)

≤

T + 1

∑

k=0

(

f (x

) − f (x

⋆

)

. . . . (15)

Now, from (15) and (12), we have

min

k∈[0:T]

(

[

f (x

) − f (x

⋆

)

])

≤

∥x

− x

⋆

∥

T + 1

Let the min be achieved at index ℓ ∈ [0 : T], hence using the fact

that

√

· is monotonically increasing on R

(hence would not change

direction of inequalities when both sides are nonnegative), we have

(

[

f (x

ℓ

) − f (x

⋆

)

])

≤

∥x

− x

⋆

∥

T + 1

⇒

[

f (x

ℓ

) − f (x

⋆

)

]

≤

G∥x

− x

⋆

∥

√

T + 1

Thus we have proven that:

min

k∈[0:T]

(

[

f (x

) − f (x

⋆

)

])

≤

G∥x

− x

⋆

∥

√

T + 1