Online gradient descent and solving saddle point problems
Shuvomoy Das Gupta
April 30, 2025
Study notes based on [
1
] and [
2
].
1
Orabona, Francesco. "A modern intro-
duction to online learning." arXiv preprint
arXiv:1912.13213 (2019).
2
Hazan, Elad. "Introduction to online
convex optimization." Foundations and
Trends® in Optimization 2, no. 3-4 (2016):
157-325.
Contents
1 Online gradient descent and its regret 1
2 Saddle point problems and how to solve them via online gradient descent ascent 1
3 Solving saddle point problem using online alternating gradient descent ascent 1
1 Online gradient descent and its regret
Notation.
: subgradient of a convex function
at point

  
: Euclidean norm of
some vector
: Euclidean projection of onto
set
Online gradient descent.
Setup. The set   R
is compact convex set with diameter , i.e.,   for every  .
Online gradient descent. The online gradient descent algorithm is as follows. For  do:
Initialization: At   pick some
 .
Update rule: For   do:
1. Play (execute)
.
2. A cost function
R
R gets revealed, which satisfies 
  for all   R
. Observe the cost

.
3. Compute

 



and store it to be played (executed) later.
Convergence result for online gradient descent. For stepsize

we have the following convergence result:






Average regret of the player at iteration




 
Diameter of
Gradient bound on
(Avg-Regret-OGD)
Proof to (Avg-Regret-OGD). Because
is convex, using the subgradient inequality we have: Subgradient inequality. A convex function
R
R is characterized by the
following subgradient inequality:
R
  
  

  

  
We set  
and   
in the last inequality we get:

  



  


  
  




  
  


leading to:


Unit regret


  
Unit regret upper bound
(Unit-Reg-UB)
The unit regret upper bound term


  
is the directional
derivative of
at
evaluated in the
direction of
.
Next, we will construct a potential function for the unit regret upper bound as follows. Because by definition,
, we
can write:






using online gradient descent update rule






because
  we must have
 





because
is nonexpansive for compact convex










  



 






  




  



 



leading to the following inequality:


  


Unit regret upper bound at




Potential at 


Potential at



Some bounded term
(Potential-Inequality-Init)
Hence from (Unit-Reg-UB) we can write:


  

  
 add

to both sides









 

  






leading to the following potential function inequality that we will telescope sum over  :




Unit regret at




Potential at 


Potential at



Some bounded term
(Potential-Inequality)
Now we run (Potential-Inequality) for
  
for constant stepsize
   
and then add them together:
   











   
















   










   










Summing all


















 







 (because   )


 (because   )











 


(1)
Now we can minimize the right hand side of (1) (which is a convex function in , see marginnote) by taking derivative with
respect to as follows:



  , so the optimal is The function
 

is convex for because it is sum of
the linear function
 
and
the function

which is convex over
domain   .
and the minimum value of the right hand side is:


 


and hence, for this carefully chosen stepsize we have shown





 

2 Saddle point problems and how to solve them via online gradient descent ascent
Min-max optimization via online optimization. We want to solve the following saddle point problem
min

max

L (2)
We call the primal (minimizing) variable and call the dual (maximizing) variable for the Lagrangian L. We will
assume that:
1. and are both compact convex sets,
2. L is convex-concave, i.e., it is convex in the primal variable with the dual variable fixed and concave in dual variable with
the primal variable fixed. I.e., I have chosen as a mnemonic for minimiz-
ing (infimum) player and as a mnemonic
for maximizing (supremum) player.
(a)
  L: a convex function in  with fixed
(b)
  L: a concave function in  with fixed
Saddle point. A point 
  is a saddle point of the Lagrangian L if
L
  L
  L
     
The measure of suboptimality of a point    in terms of finding a saddle point is its duality gap defined by
    max

L L 
Solving saddle point problems via online gradient descent ascent.
Setup. The set  R
is compact convex set with diameter
and the set   R
is compact convex with diameter
.
Online gradient descent ascent. The online gradient descent algorithm is as follows. For  do:
Initialization: At   pick some
 and
 .
Update rule: For   , do
1. Primal player plays
and dual player plays
simultaneously.
2. The primal (minimizing) player receives the convex function
 L
and the dual (maximizing) player
receives the concave function
  L

3. Compute

and

based on online gradient descent, then the iterate computation rule can be explicitly written
as (as if we are running two copies of online gradient descent for each player): Intuitively, the primal (minimizing) player
is trying to minimize his convex function
and the dual (maximizing) player is trying to
maximize his concave function at hand. So
the primal player is running online gradient
descent in the same manner described earlier
as the function
is convex. However,
the dual player has a concave function
that he wants to maximize, which
is equivalent to minimizing 
, so is
running online gradient descent on 
.

 





 




  




(Online-Grad-Desc-Asc)
Convergence results for online gradient descent ascent. Define the average duality gap at iteration :
sup



L
L

We want to show that for online gradient descent ascent, we have



L
L
 
(3)
In other words, the duality gap goes to zero at a rate 
.
Proof to (3). The primal player is running the online gradient descent algorithm on the sequences of functions
s for
. So, we can invoke the result (Avg-Regret-OGD) on his algorithm separately. Let
be the diameter of the set
,
be the bound on the subgradient of
, i.e., 
 
, and
 , then applying (Avg-Regret-OGD) on
s
with stepsize

gives:




 
use
  L

L
L
 
(4)
The dual player is running the online gradient descent algorithm on the sequences of functions 
s for . So,
we can invoke the result (Avg-Regret-OGD) on his algorithm separately as well. Let
be the diameter of the set ,
be
the bound on the subgradient of 
, i.e.,  

, and
, then applying (Avg-Regret-OGD) with stepsize

gives:





 
use 
  L


L
L
 

L
L
 

L
L
 
(5)
Hence, we have

L
L


L
terms associated with state 

L
L
L

L
L
L
L


L
L



from (5)

L
L



from (4)
Therefore,



L
L
 
3 Solving saddle point problem using online alternating gradient descent ascent
Different iterate computation schemes with different results. In online gradient descent ascent, we ran two independent copies of
gradient descent on primal and dual players. The communication between the players was done through the revealed func-
tions:
embedded information about
for primal player and
embedded information about
for the dual
player. However this is not the only way the communication can be done. We can pick a different communication scheme
or different computation rule for the iterates, leading to a different online algorithm for solving saddle point problem. For
example, we will describe online alternating gradient descent algorithm below.
Online alternating gradient descent ascent algorithm.
Setup. The set  R
is compact convex set with diameter
and the set   R
is compact convex with diameter
.
Online gradient descent ascent. The online gradient descent algorithm is as follows. For  do:
Initialization: At   pick some
 and
 .
Update rule: For   , do
1. Primal part of online alternating gradient descent ascent
(a) Primal player plays
.
(b) The primal (minimizing) player receives the convex function
  L
.
(c) Compute

based on online gradient descent:

 




2. Dual part of online alternating gradient descent ascent
(a) The dual player plays
.
(b) The dual (maximizing) player receives the concave function

  L

.
(c) Compute

based on online gradient descent:

 






Note that, for each iteration of the previous
online gradient descent ascent we ran each
primal and dual steps simultaneously, each
iteration could be run in parallel. For this
new algorithm online alternating gradient
descent ascent, we are doing it serially with
running the primal update first and the dual
update later with the info about latest primal
update.
Compact iteration scheme for online alternating gradient descent ascent. In a compact manner, we can write the iteration computa-
tion rule for online alternating gradient descent ascent as follows. For  we have:

 





 





  





(Online-Alt-Grad-Desc-Asc)
Convergence results for online alternating gradient descent ascent. Define the average duality gap at iteration :
sup



L
L

We want to show that for online alternating gradient descent ascent, we have



L
L
  

(6)
In other words, the duality gap goes to zero at a rate 
for this algorithm as well.
Proof to (6). Like before, the primal player is running the online gradient descent algorithm on the sequences of functions
s for . So, we can invoke the result (Avg-Regret-OGD) on his algorithm separately. Let
be the
diameter of the set ,
be the bound on the subgradient of
, i.e., 

, and
, then applying
(Avg-Regret-OGD) on
s with stepsize

gives:




 
use
  L
(7)
Also note that, because
is a bound on the subgradient of
, we have 

  
 for any  . Hence,



 




 




 




 
using nonexpansiveness of convex projection
 

 


 
thus leading to








use





(8)
The dual player is running the online gradient descent algorithm on the sequences of functions 

s for .
So, we can invoke the result (Avg-Regret-OGD) on his algorithm separately as well. Let
be the diameter of the set ,
be the bound on the subgradient of 
, i.e.,  

, and
, then applying (Avg-Regret-OGD) with
stepsize

gives:







 
use 

  L

 (9)
First, note that
L

L
  L

terms connecting


L

L

terms connecting

L
L
L
 L

L








L
L




L

L





 













thus having

L

L






















using (9)(7)(8)


which completes the proof.