Study notes on “Stochastic Polyak Step-size, a sim-
ple step-size tuner with optimal rates” by F. Pe-
dregosa
Shuvomoy Das Gupta
March 29, 2024
Here are my study notes for Fabian Pedregosa’s amazing blog on
Stochastic Polyak Step-size; the full citation of Pedregosa’s blog is:
Stochastic Polyak Step-size, a simple step-size tuner with optimal rates, Fabian
Pedregosa, 2023 available at https://fa.bianp.net/blog/2023/sps/.
Contents
Problem setup 1
Notation 1
Stochastic Gradient Descent with Polyak Stepsize 1
Assumptions 2
Convergence analysis 2
Problem setup
We are interested in solving the problem
p
=
minimize
xR
d
n
f (x) =
1
n
n
i=1
f
[i]
(x)
o
. . . (P)
where the optimal solution is achieved at x
. We have the following
assumptions regarding the nature of the problem.
Notation
Inner product between vectors x, y is denoted by
x | y
and Eu-
clidean norm of x is denoted by x =
p
x | y
. We let [1 : n] =
{1, 2, . . . , n} and z
+
= max{z, 0 }. Also, for notational convenience
we denote: sqd(x) = x
2
and ReLU(z) = max{z, 0}. Comments are
enclosed in /
*
this is a comment
*
/.
Stochastic Gradient Descent with Polyak Stepsize
The algorithm called Stochastic Gradient Descent with Stochastic
Polyak Stepsize (SGD-SPS) to solve (P) is described by Algorithm
1. The uniform distribution with support {1, . . . , n} is denoted by
unif[1 : n]. One subgradient of the function f
[i]
evaluated at x is
denoted by
e
f (x).
study notes on stochastic polyak step-size, a simple step-size tuner with optimal rates
by f. pedregosa 2
Algorithm 1 SGD-SPS to solve (P)
input: the functions f
[i]
for i [1 : n], iteration limit T
algorithm:
1. initialization:
pick x
0
R
d
arbitrarily
2. main iteration:
for t = 0, 1, 2, . . . , T 1
sample a function f
i
uniformly at random i unif[1 : n]
set Polyak stepsize γ
t
=
ReLU
(
f
[i]
(x
t
)f
[i]
(x
)
)
e
f
[i]
(x
t
)
2
, if
e
f
[i]
(x
t
) = 0
0, else,
update iterate x
t+1
= x
t
γ
t
e
f
[i]
(x
t
)(x
t
)
end for
3. return x
T
Assumptions
We assume that for all i, f
[i]
: R
d
(, ] is a nonsmooth, subgra-
dient bounded, and star-convex function, i.e,
Star-convexity.
i[ 1:n]
f
[i]
star-convex around x
def
xdom f
f
[i]
(x) f
[i]
(x
)
D
e
f
[i]
(x) | x x
E
.
Subgradient-boundedness.
i[ 1:N]
x∈B={y|∥yx
∥≤x
0
x
∥}
e
f
[i]
(x) f
[i]
(x)
e
f
[i]
(x) G.
Convergence analysis
Consider an arbitrary iteration number t and we want to compute
iterate x
t+1
from x
t
. Going from x
t
to x
t+1
the randomness lies in the
selection of the function f
i
by i unif[1 : N]. We will come up with
an inequality that works for any value of i. We will use the notation
e
f
[i]
(x
t
) g
[i]
t
,
e
f (x
t
) g
t
, f
[i]
(x
t
) f
[i]
t
.
Consider the case g
[i]
t
= 0. We have
x
t+1
x
2
=x
t
γ
t
g
[i]
t
x
2
=(x
t
x
) γ
t
g
[i]
t
2
expand squares
=x
t
x
2
+ γ
2
t
g
[i]
t
2
2γ
t
D
g
[i]
t
| x
t
x
E
study notes on stochastic polyak step-size, a simple step-size tuner with optimal rates
by f. pedregosa 3
/
*
we have f
[i]
(x) f
[i]
(x
)
D
e
f
[i]
(x) | x x
E
x
:
=x
t
f
[i]
(x
t
) f
[i]
(x
)
D
e
f
[i]
(x
t
) | x
t
x
E
f
[i]
t
f
[i]
D
g
[i]
t
| x
t
x
E
f
[i]
t
f
[i]
D
g
[i]
t
| x
t
x
E
2γ
t
D
g
[i]
t
| x
t
x
E
2γ
t
f
[i]
t
f
[i]
*
/
x
t
x
2
+ γ
2
t
g
[i]
t
2
2γ
t
f
[i]
t
f
[i]
in this case γ
t
=
ReLU
f
[i]
t
f
[i]
g
[i]
t
2
= x
t
x
2
+
(sqd ReLU)
f
[i]
t
f
[i]
g
[i]
t
4
g
[i]
t
2
2
ReLU
f
[i]
t
f
[i]
g
[i]
t
2
f
[i]
t
f
[i]
/
*
we have z ×ReLU(z) = z ×max {z, 0} =
z
2
, if z 0
0, else
=
(
max{z, 0}
)
2
= (sqd ReLU)
f
[i]
t
f
[i]
*
/
= x
t
x
2
+
(sqd ReLU)
f
[i]
t
f
[i]
g
[i]
t
2
2
(sqd ReLU)
f
[i]
t
f
[i]
g
[i]
t
2
= x
t
x
2
(sqd ReLU)
f
[i]
t
f
[i]
g
[i]
t
2
. . . . (1)
Now consider the case g
[i]
t
= 0, then x
t+1
= x
t
and
x
t+1
x
2
= x
t
x
2
. . . . (2)
Thus from (1) and (2), we have
x
t+1
x
2
x
t
x
2
(sqdReLU)
f
[i]
t
f
[i]
g
[i]
t
2
, with i unif[1 : n] and g
[i]
t
= 0,
x
t
x
2
, with i unif[1 : n] and g
[i]
t
= 0.
. . . (3)
From (3), we see that irrespective of the randomness in selecting
i, we always have x
t+1
x
2
x
t
x
2
··· x
0
x
2
,
hence we have x
t
B = {y | y x
x
0
x
∥} no matter what.
As a result, for the case g
[i]
t
= 0, using the gradient-boundedness
assumption we have
g
[i]
t
2
G
2
1
g
[i]
t
2
1
G
2
(sqd ReLU)
f
[i]
t
f
[i]
g
[i]
t
2
(sqd ReLU)
f
[i]
t
f
[i]
G
2
. . . . (4)
study notes on stochastic polyak step-size, a simple step-size tuner with optimal rates
by f. pedregosa 4
Next, for the case g
[i]
t
= 0 g
[i]
t
= 0, using star-convexity, we
have
f
[i]
(x) f
[i]
(x
)
D
e
f
[i]
(x) | x x
E
x
:
=x
t
f
[i]
(x
t
) f
[i]
(x
)
D
g
[i]
t
| x
t
x
E
= 0
f
[i]
t
f
[i]
0
ReLU
f
[i]
t
f
[i]
= max
n
f
[i]
t
f
[i]
, 0
o
= 0
(sqd ReLU)
f
[i]
t
f
[i]
G
2
= 0 . . . (5)
So, using (4) and (5) in the cases of (3) we get
x
t+1
x
2
x
t
x
2
(sqd ReLU)
f
[i]
t
f
[i]
G
2
with i unif[1 : n]. . . . (6)
Next, on both sides of (6), we take conditional expectation with re-
spect to i given x
t
, which we denote by
E
[
· | x
t
]
E
iunif[ 1:N]
[
· | x
t
]
,
and the resultant inequality is:
E
h
x
t+1
x
2
| x
t
i
E
x
t
x
2
(sqd ReLU)
f
[i]
t
f
[i]
G
2
| x
t
=
E
h
x
t
x
2
| x
t
i
E
(sqd ReLU)
f
[i]
t
f
[i]
G
2
| x
t
using linearity of expectation
=x
t
x
2
1
G
2
E
h
(sqd ReLU)
f
[i]
t
f
[i]
| x
t
i
. . . (7)
using "taking out what’s known" rule
E
[
h(X)Y | X
]
= h(X)E
[
Y | X
]
Recall now Jensen’s inequality: if ϕ is a convex function and Z is a
random variable, then ϕ
(
E
[
Z
])
E
[
ϕ(Z)
]
. Setting ϕ
:
= sqd ReLU =
sqd
(
ReLU(·)
)
, which is convex (see Boyd Vandenberghe, Convex
Optimization, Figure 3.7) and Z
:
=
h
(sqd ReLU)
f
[i]
t
f
[i]
| x
t
i
we
have
(sqd ReLU)
E
h
f
[i]
t
f
[i]
| x
t
i
E
h
(sqd ReLU)
f
[i]
t
f
[i]
| x
t
i
(sqd ReLU)
E
h
f
[i]
t
f
[i]
| x
t
i
E
h
(sqd ReLU)
f
[i]
t
f
[i]
| x
t
i
1
G
2
(sqd ReLU)
E
h
f
[i]
t
f
[i]
| x
t
i
1
G
2
E
h
(sqd ReLU)
f
[i]
t
f
[i]
| x
t
i
. . . . (8)
Now notice the LHS term in (8):
E
h
f
[i]
t
f
[i]
| x
t
i
study notes on stochastic polyak step-size, a simple step-size tuner with optimal rates
by f. pedregosa 5
=
E
iunif[ 1:n]
h
f
[i]
t
f
[i]
| x
t
i
=
1
n
n
i=1
f
[i]
t
f
[i]
!
| x
t
!
= f (x
t
) f (x
),
where the last term is a random variable in x
t
(recall that
E
[
Y | X
]
is
a random variable in X).
From (7), (8), and (9), we have
E
h
x
t+1
x
2
| x
t
i
≤∥x
t
x
2
1
G
2
(sqd ReLU)
E
h
f
[i]
t
f
[i]
| x
t
i
=x
t
x
2
1
G
2
(sqd ReLU)
(
f (x
t
) f (x
)
)
=x
t
x
2
1
G
2
(
max{f (x
t
) f (x
), 0}
)
2
=x
t
x
2
1
G
2
(
f (x
t
) f (x
)
)
2
. . . (10) as f (x
t
) f (x
) 0
Now taking expectation with respect to x
t
on both sides of (10)
and then using Adam’s law
E
[
E
[
Y | X
]]
=
E
[
Y
]
,we get:
E
h
E
h
x
t+1
x
2
| x
t
ii
E
x
t
x
2
1
G
2
(
f (x
t
) f (x
)
)
2
E
h
x
t+1
x
2
i
E
h
x
t
x
2
i
E
1
G
2
(
f (x
t
) f (x
)
)
2
using linearity of expectation on RHS
and Adam’s law on LHS
E
h
x
t+1
x
2
i
E
h
x
t
x
2
i
1
G
2
E
h
(
f (x
t
) f (x
)
)
2
i
1
G
2
E
h
(
f (x
t
) f (x
)
)
2
i
E
h
x
t
x
2
i
E
h
x
t+1
x
2
i
. . . . (11)
Now, let us do a telescoping sum on (11) for t = 0, . . . , T
1
G
2
E
h
(
f (x
0
) f (x
)
)
2
i
E
h
x
0
x
2
i
E
h
x
1
x
2
i
1
G
2
E
h
(
f (x
1
) f (x
)
)
2
i
E
h
x
1
x
2
i
E
h
x
2
x
2
i
1
G
2
E
h
(
f (x
2
) f (x
)
)
2
i
E
h
x
2
x
2
i
E
h
x
3
x
2
i
.
.
.
.
.
.
1
G
2
E
h
(
f (x
T1
) f (x
)
)
2
i
E
h
x
T1
x
2
i
E
h
x
T
x
2
i
1
G
2
E
h
(
f (x
T
) f (x
)
)
2
i
E
h
x
T
x
2
i
E
h
x
T+1
x
2
i
which yields:
1
G
2
T
k=0
E
h
(
f (x
k
) f (x
)
)
2
i
E
h
x
0
x
2
i
E
h
x
T+1
x
2
i
study notes on stochastic polyak step-size, a simple step-size tuner with optimal rates
by f. pedregosa 6
=x
0
x
2
E
h
x
T+1
x
2
i
as x
0
is deterministic
≤∥x
0
x
2
as
E
x
T+1
x
2
0
T
k=0
E
h
(
f (x
k
) f (x
)
)
2
i
G
2
x
0
x
2
1
T + 1
T
k=0
E
h
(
f (x
k
) f (x
)
)
2
i
G
2
x
0
x
2
T + 1
. . . (12)
Recall now Jensen’s inequality again: if ϕ is a convex function and
Z is a random variable, then ϕ
(
E
[
Z
])
E
[
ϕ(Z)
]
. Setting ϕ
:
= sqd
and Z
:
= f (x
k
) f (x
) we have
sqd
(
E
[
f (x
k
) f (x
)
])
E
[
sqd
(
f (x
k
) f (x
)
)]
min
k[0:T]
(
E
[
f (x
k
) f (x
)
])
2
min
k[0:T]
E
h
(
f (x
k
) f (x
)
)
2
i
. . . . (13)
Also,
T
k=0
E
h
(
f (x
k
) f (x
)
)
2
i
T
k=0
min
k[0:T]
E
h
(
f (x
k
) f (x
)
)
2
i
=
min
k[0:T]
E
h
(
f (x
k
) f (x
)
)
2
i
T
k=0
1
= (T + 1)
min
k[0:T]
E
h
(
f (x
k
) f (x
)
)
2
i
1
T + 1
T
k=0
E
h
(
f (x
k
) f (x
)
)
2
i
min
k[0:T]
E
h
(
f (x
k
) f (x
)
)
2
i
. . . . (14)
From (13) and (14), we have
min
k[0:T]
(
E
[
f (x
k
) f (x
)
])
2
min
k[0:T]
E
h
(
f (x
k
) f (x
)
)
2
i
1
T + 1
T
k=0
E
h
(
f (x
k
) f (x
)
)
2
i
. . . . (15)
Now, from (15) and (12), we have
min
k[0:T]
(
E
[
f (x
k
) f (x
)
])
2
G
2
x
0
x
2
T + 1
.
Let the min be achieved at index [0 : T], hence using the fact
that
· is monotonically increasing on R
+
(hence would not change
direction of inequalities when both sides are nonnegative), we have
(
E
[
f (x
) f (x
)
])
2
G
2
x
0
x
2
T + 1
E
[
f (x
) f (x
)
]
Gx
0
x
T + 1
.
Thus we have proven that:
min
k[0:T]
(
E
[
f (x
k
) f (x
)
])
Gx
0
x
T + 1
.