
Online gradient descent and solving saddle point problems
Shuvomoy Das Gupta
April 30, 2025
Study notes based on [
1
] and [
2
].
1
Orabona, Francesco. "A modern intro-
duction to online learning." arXiv preprint
arXiv:1912.13213 (2019).
2
Hazan, Elad. "Introduction to online
convex optimization." Foundations and
Trends® in Optimization 2, no. 3-4 (2016):
157-325.
Contents
1 Online gradient descent and its regret 1
2 Saddle point problems and how to solve them via online gradient descent ascent 1
3 Solving saddle point problem using online alternating gradient descent ascent 1
1 Online gradient descent and its regret
Notation.
•
: subgradient of a convex function
at point
•
: Euclidean norm of
some vector
•
: Euclidean projection of onto
set
Online gradient descent.
• Setup. The set R
is compact convex set with diameter , i.e., for every .
• Online gradient descent. The online gradient descent algorithm is as follows. For do:
– Initialization: At pick some
.
– Update rule: For do:
1. Play (execute)
.
2. A cost function
R
R gets revealed, which satisfies
for all R
. Observe the cost
.
3. Compute
and store it to be played (executed) later.
Convergence result for online gradient descent. For stepsize
we have the following convergence result:
Average regret of the player at iteration
Diameter of
Gradient bound on
(Avg-Regret-OGD)
Proof to (Avg-Regret-OGD). Because
is convex, using the subgradient inequality we have: Subgradient inequality. A convex function
R
R is characterized by the
following subgradient inequality:
R
We set
and
in the last inequality we get:
leading to:
Unit regret
Unit regret upper bound
(Unit-Reg-UB)
The unit regret upper bound term
is the directional
derivative of
at
evaluated in the
direction of
.
Next, we will construct a potential function for the unit regret upper bound as follows. Because by definition,
, we
can write:
▷ using online gradient descent update rule
▷ because
we must have
▷ because
is nonexpansive for compact convex
leading to the following inequality:
Unit regret upper bound at
Potential at
Potential at
Some bounded term
(Potential-Inequality-Init)
Hence from (Unit-Reg-UB) we can write:
▷ add
to both sides
leading to the following potential function inequality that we will telescope sum over :
Unit regret at
Potential at
Potential at
Some bounded term
(Potential-Inequality)
Now we run (Potential-Inequality) for
for constant stepsize
and then add them together:
Summing all
(because )
(because )
(1)
Now we can minimize the right hand side of (1) (which is a convex function in , see marginnote) by taking derivative with
respect to as follows:
, so the optimal is The function
is convex for because it is sum of
the linear function
and
the function
which is convex over
domain .
and the minimum value of the right hand side is:
and hence, for this carefully chosen stepsize we have shown
2 Saddle point problems and how to solve them via online gradient descent ascent
Min-max optimization via online optimization. We want to solve the following saddle point problem
min
max
L (2)
We call the primal (minimizing) variable and call the dual (maximizing) variable for the Lagrangian L. We will
assume that:
1. and are both compact convex sets,
2. L is convex-concave, i.e., it is convex in the primal variable with the dual variable fixed and concave in dual variable with
the primal variable fixed. I.e., I have chosen as a mnemonic for minimiz-
ing (infimum) player and as a mnemonic
for maximizing (supremum) player.
(a)
L: a convex function in with fixed
(b)
L: a concave function in with fixed
Saddle point. A point
is a saddle point of the Lagrangian L if
L
L
L
The measure of suboptimality of a point in terms of finding a saddle point is its duality gap defined by
max
L L
Solving saddle point problems via online gradient descent ascent.
• Setup. The set R
is compact convex set with diameter
and the set R
is compact convex with diameter
.
• Online gradient descent ascent. The online gradient descent algorithm is as follows. For do:
– Initialization: At pick some
and
.
– Update rule: For , do
1. Primal player plays
and dual player plays
simultaneously.
2. The primal (minimizing) player receives the convex function
L
and the dual (maximizing) player
receives the concave function
L
3. Compute
and
based on online gradient descent, then the iterate computation rule can be explicitly written
as (as if we are running two copies of online gradient descent for each player): Intuitively, the primal (minimizing) player
is trying to minimize his convex function
and the dual (maximizing) player is trying to
maximize his concave function at hand. So
the primal player is running online gradient
descent in the same manner described earlier
as the function
is convex. However,
the dual player has a concave function
that he wants to maximize, which
is equivalent to minimizing
, so is
running online gradient descent on
.
(Online-Grad-Desc-Asc)
Convergence results for online gradient descent ascent. Define the average duality gap at iteration :
sup
L
L
We want to show that for online gradient descent ascent, we have
L
L
(3)
In other words, the duality gap goes to zero at a rate
.
Proof to (3). The primal player is running the online gradient descent algorithm on the sequences of functions
s for
. So, we can invoke the result (Avg-Regret-OGD) on his algorithm separately. Let
be the diameter of the set
,
be the bound on the subgradient of
, i.e.,
, and
, then applying (Avg-Regret-OGD) on
s
with stepsize
gives:
▷ use
L
L
L
(4)
The dual player is running the online gradient descent algorithm on the sequences of functions
s for . So,
we can invoke the result (Avg-Regret-OGD) on his algorithm separately as well. Let
be the diameter of the set ,
be
the bound on the subgradient of
, i.e.,
, and
, then applying (Avg-Regret-OGD) with stepsize
gives:
▷ use
L
L
L
L
L
L
L
(5)
Hence, we have
L
L
L
terms associated with state
L
L
L
L
L
L
L
L
L
from (5)
L
L
from (4)
Therefore,
L
L
3 Solving saddle point problem using online alternating gradient descent ascent
Different iterate computation schemes with different results. In online gradient descent ascent, we ran two independent copies of
gradient descent on primal and dual players. The communication between the players was done through the revealed func-
tions:
embedded information about
for primal player and
embedded information about
for the dual
player. However this is not the only way the communication can be done. We can pick a different communication scheme
or different computation rule for the iterates, leading to a different online algorithm for solving saddle point problem. For
example, we will describe online alternating gradient descent algorithm below.
Online alternating gradient descent ascent algorithm.
• Setup. The set R
is compact convex set with diameter
and the set R
is compact convex with diameter
.
• Online gradient descent ascent. The online gradient descent algorithm is as follows. For do:
– Initialization: At pick some
and
.
– Update rule: For , do
1. Primal part of online alternating gradient descent ascent
(a) Primal player plays
.
(b) The primal (minimizing) player receives the convex function
L
.
(c) Compute
based on online gradient descent:
2. Dual part of online alternating gradient descent ascent
(a) The dual player plays
.
(b) The dual (maximizing) player receives the concave function
L
.
(c) Compute
based on online gradient descent:
Note that, for each iteration of the previous
online gradient descent ascent we ran each
primal and dual steps simultaneously, each
iteration could be run in parallel. For this
new algorithm online alternating gradient
descent ascent, we are doing it serially with
running the primal update first and the dual
update later with the info about latest primal
update.
Compact iteration scheme for online alternating gradient descent ascent. In a compact manner, we can write the iteration computa-
tion rule for online alternating gradient descent ascent as follows. For we have:
(Online-Alt-Grad-Desc-Asc)
Convergence results for online alternating gradient descent ascent. Define the average duality gap at iteration :
sup
L
L
We want to show that for online alternating gradient descent ascent, we have
L
L
(6)
In other words, the duality gap goes to zero at a rate
for this algorithm as well.
Proof to (6). Like before, the primal player is running the online gradient descent algorithm on the sequences of functions
s for . So, we can invoke the result (Avg-Regret-OGD) on his algorithm separately. Let
be the
diameter of the set ,
be the bound on the subgradient of
, i.e.,
, and
, then applying
(Avg-Regret-OGD) on
s with stepsize
gives:
▷ use
L
(7)
Also note that, because
is a bound on the subgradient of
, we have
for any . Hence,
▷ using nonexpansiveness of convex projection
thus leading to
▷ use
(8)
The dual player is running the online gradient descent algorithm on the sequences of functions
s for .
So, we can invoke the result (Avg-Regret-OGD) on his algorithm separately as well. Let
be the diameter of the set ,
be the bound on the subgradient of
, i.e.,
, and
, then applying (Avg-Regret-OGD) with
stepsize
gives:
▷ use
L
(9)
First, note that
L
L
L
terms connecting
L
L
terms connecting
L
L
L
L
L
L
L
L
L
thus having
L
L
▷ using (9)(7)(8)
which completes the proof.