Online gradient descent and solving saddle point problems

Shuvomoy Das Gupta

April 30, 2025

Study notes based on [

] and [

Orabona, Francesco. "A modern intro-

duction to online learning." arXiv preprint

arXiv:1912.13213 (2019).

Hazan, Elad. "Introduction to online

convex optimization." Foundations and

Trends® in Optimization 2, no. 3-4 (2016):

157-325.

Contents

1 Online gradient descent and its regret 1

2 Saddle point problems and how to solve them via online gradient descent ascent 1

3 Solving saddle point problem using online alternating gradient descent ascent 1

1 Online gradient descent and its regret

Notation.

• 



: subgradient of a convex function

at point 

•  



  

: Euclidean norm of

some vector 

• 



: Euclidean projection of  onto

set 

Online gradient descent.

• Setup. The set   R



is compact convex set with diameter , i.e.,   for every  .

• Online gradient descent. The online gradient descent algorithm is as follows. For  do:

– Initialization: At   pick some 



 .

– Update rule: For   do:

1. Play (execute) 



2. A cost function 



 R



 R gets revealed, which satisfies 





  for all   R



. Observe the cost 







.

3. Compute 



 

























and store it to be played (executed) later.

Convergence result for online gradient descent. For stepsize 















we have the following convergence result:







































Average regret of the player at iteration 



Diameter of 



 

Gradient bound on 













 (Avg-Regret-OGD)

Proof to (Avg-Regret-OGD). Because 



is convex, using the subgradient inequality we have: Subgradient inequality. A convex function

 R



R is characterized by the

following subgradient inequality:



R



  



  









  









  

We set  



and   



in the last inequality we get:









  

















  



















  







  



























  







  

















leading to:





















Unit regret

 









  











Unit regret upper bound

(Unit-Reg-UB)

The unit regret upper bound term











  









is the directional

derivative of 



at 



evaluated in the

direction of 







Next, we will construct a potential function for the unit regret upper bound as follows. Because by definition, 



 , we

can write:











































▷ using online gradient descent update rule



































▷ because 



  we must have 



 



































▷ because 



is nonexpansive for compact convex 









































































  





















 













































  























  































 



















leading to the following inequality:











  











Unit regret upper bound at 



























Potential at 



























Potential at 

























Some bounded term

(Potential-Inequality-Init)

Hence from (Unit-Reg-UB) we can write:

















  









  







 ▷ add 







to both sides

























































 









  















































leading to the following potential function inequality that we will telescope sum over  :





















Unit regret at 



























Potential at 



























Potential at 

























Some bounded term

 (Potential-Inequality)

Now we run (Potential-Inequality) for   for constant stepsize 



    and then add them together:

   







































































   









































































   







































































   







































































Summing all 













































































































 



















































 (because   )





















 (because   )





















































































 

























 (1)

Now we can minimize the right hand side of (1) (which is a convex function in , see marginnote) by taking derivative with

respect to  as follows:

























  , so the optimal  is The function





 













is convex for   because it is sum of

the linear function





 



and

the function









which is convex over

domain   .

















and the minimum value of the right hand side is:





























 

















and hence, for this carefully chosen stepsize we have shown



































 









2 Saddle point problems and how to solve them via online gradient descent ascent

Min-max optimization via online optimization. We want to solve the following saddle point problem

min



max



L (2)

We call the primal (minimizing) variable and call  the dual (maximizing) variable for the Lagrangian L. We will

assume that:

1. and are both compact convex sets,

2. L is convex-concave, i.e., it is convex in the primal variable with the dual variable fixed and concave in dual variable with

the primal variable fixed. I.e., I have chosen  as a mnemonic for minimiz-

ing (infimum) player and  as a mnemonic

for maximizing (supremum) player.

(a) 



  L: a convex function in  with fixed

(b) 



  L: a concave function in  with fixed

Saddle point. A point 







  is a saddle point of the Lagrangian L if

L



  L







  L



     

The measure of suboptimality of a point     in terms of finding a saddle point is its duality gap defined by

    max



L L 

Solving saddle point problems via online gradient descent ascent.

• Setup. The set  R



is compact convex set with diameter 



and the set   R



is compact convex with diameter 



• Online gradient descent ascent. The online gradient descent algorithm is as follows. For  do:

– Initialization: At   pick some 



 and 



 .

– Update rule: For   , do

1. Primal player plays 



and dual player plays 



simultaneously.

2. The primal (minimizing) player receives the convex function 





  L



and the dual (maximizing) player

receives the concave function 





  L





3. Compute 



and 



based on online gradient descent, then the iterate computation rule can be explicitly written

as (as if we are running two copies of online gradient descent for each player): Intuitively, the primal (minimizing) player

is trying to minimize his convex function

and the dual (maximizing) player is trying to

maximize his concave function at hand. So

the primal player is running online gradient

descent in the same manner described earlier

as the function 





is convex. However,

the dual player has a concave function







that he wants to maximize, which

is equivalent to minimizing 





, so is

running online gradient descent on 





.





 

































 



























  





























(Online-Grad-Desc-Asc)

Convergence results for online gradient descent ascent. Define the average duality gap at iteration :

sup























L







L





We want to show that for online gradient descent ascent, we have























L







L







 





















 (3)

In other words, the duality gap goes to zero at a rate 



.

Proof to (3). The primal player is running the online gradient descent algorithm on the sequences of functions 





s for  

. So, we can invoke the result (Avg-Regret-OGD) on his algorithm separately. Let 



be the diameter of the set

, 



be the bound on the subgradient of 





, i.e., 







 



, and 



 , then applying (Avg-Regret-OGD) on 





with stepsize 



















gives:































 













▷ use 





  L

















L







L







 













 (4)

The dual player is running the online gradient descent algorithm on the sequences of functions 





s for . So,

we can invoke the result (Avg-Regret-OGD) on his algorithm separately as well. Let 



be the diameter of the set , 



the bound on the subgradient of 





, i.e.,  









, and 



, then applying (Avg-Regret-OGD) with stepsize





















gives:































 













▷ use 





  L

















L







L







 











L







L







 

























L







L







 













(5)

Hence, we have











L







L































L









terms associated with state 











L







L







L

































L







L







L







L





















L







L

























from (5)













L







L

























from (4)





















































Therefore,























L







L







 























3 Solving saddle point problem using online alternating gradient descent ascent

Different iterate computation schemes with different results. In online gradient descent ascent, we ran two independent copies of

gradient descent on primal and dual players. The communication between the players was done through the revealed func-

tions: 





embedded information about 



for primal player and 





embedded information about 



for the dual

player. However this is not the only way the communication can be done. We can pick a different communication scheme

or different computation rule for the iterates, leading to a different online algorithm for solving saddle point problem. For

example, we will describe online alternating gradient descent algorithm below.

Online alternating gradient descent ascent algorithm.

• Setup. The set  R



is compact convex set with diameter 



and the set   R



is compact convex with diameter 



• Online gradient descent ascent. The online gradient descent algorithm is as follows. For  do:

– Initialization: At   pick some 



 and 



 .

– Update rule: For   , do

1. Primal part of online alternating gradient descent ascent

(a) Primal player plays 



(b) The primal (minimizing) player receives the convex function 





  L



.



based on online gradient descent:





 





























2. Dual part of online alternating gradient descent ascent

(a) The dual player plays 



(b) The dual (maximizing) player receives the concave function 





  L



.



based on online gradient descent:





 



























Note that, for each iteration of the previous

online gradient descent ascent we ran each

primal and dual steps simultaneously, each

iteration could be run in parallel. For this

new algorithm online alternating gradient

descent ascent, we are doing it serially with

running the primal update first and the dual

update later with the info about latest primal

update.

Compact iteration scheme for online alternating gradient descent ascent. In a compact manner, we can write the iteration computa-

tion rule for online alternating gradient descent ascent as follows. For  we have:





 

































 



























  





























(Online-Alt-Grad-Desc-Asc)

Convergence results for online alternating gradient descent ascent. Define the average duality gap at iteration :

sup

























L







L









We want to show that for online alternating gradient descent ascent, we have























L







L







  























 (6)

In other words, the duality gap goes to zero at a rate 



for this algorithm as well.

Proof to (6). Like before, the primal player is running the online gradient descent algorithm on the sequences of functions







s for   . So, we can invoke the result (Avg-Regret-OGD) on his algorithm separately. Let 



be the

diameter of the set , 



be the bound on the subgradient of 





, i.e., 







  



, and 



 , then applying

(Avg-Regret-OGD) on 





s with stepsize 



















gives:































 













▷ use 





  L



 (7)

Also note that, because 



is a bound on the subgradient of 





, we have 











   



 for any  . Hence,























 





















 



 























 

















 





























 



























 ▷ using nonexpansiveness of convex projection

 













 



















 









thus leading to











































































▷ use

























































































































(8)

The dual player is running the online gradient descent algorithm on the sequences of functions 





s for   .

So, we can invoke the result (Avg-Regret-OGD) on his algorithm separately as well. Let 



be the diameter of the set , 



be the bound on the subgradient of 





, i.e.,  





  



, and 



 , then applying (Avg-Regret-OGD) with

stepsize 



















gives:

































 













▷ use 





  L



 (9)

First, note that

L







L







  L









terms connecting 









L







L









terms connecting 









L







L







L









 L







L

































L







L

































L







L

































 





























































thus having











L







L































































































































































▷ using (9)(7)(8)

























which completes the proof.