Table of contents
Consider an large-scale convex optimization problem of the form
where is the decision variable and is a very large number which is common in today's machine learning problems. The function is assumed to be convex with its subgradient norm bounded by some constant , and is an optimal solution to .
We assume that information on this objective function is gained through a first-order oracle. Given an input point this oracle returns
Suppose we want to find a solution to so we have to run some iterative algorithm initialized at some point and, then, using some algorithmic rules, compute , i.e., additional iterates. Because we have a large-scale setup, it is safe to assume that (if the decision variable is a vector with 10 billion components, it is not realistic to compute billion iterates of any algorithm!). What type of algorithm should we consider?
To solve this optimization problem, we consider algorithm that satisfies the condition
where . This condition holds for majority of practical methods.
For any , with , and any starting point , there exist (i) a function , which is convex and -Lipschitz continuous, and (ii) a first order oracle providing subgradient and function value information such that we have the lower bound on the objective inaccuracy:
for any first-order method satisfying (Span-Cond).
The proof proceeds by constructing a "worst-case" function, on which any first-order method satisfying (Span-Cond) will not be able to improve its initial objective value during the first computed iterations.
First, we define two convex functions and as follows. Define the convex function
where denotes the -th component of the vector . Using subdifferential calculus , we can write down the closed-form expression of 's subdifferential at a point , given by
where (recall from notation)
i.e., any element of corresponds to an index of a maximal component of vector searched over its first components. Next, define the function
which has subdifferential 
Finally define a function , which is going to be our worst-case function:
Note that is a convex function because it is point-wise maximum of two convex functions. Using subdifferential calculus, we can write down the closed-form expression of 's subdifferential at a point , given by  :
Hence, if we consider a point and any subgradient of at denoted by , it can be expressed as
with the norm satisfying:
where and use the triangle inequality and the fact that for all . Also, it is clear from the subdifferential of that, any would satisfy
Hence, from the subdifferential of above, and (1), (2), we find that any will satisfy:
Let us now verify that is indeed Lipschitz continuous. For any and any , from the subgradient inequality, we have
where uses (3). The last inequality is symmetric with respect to and , hence for any , we also have:
Thus, combining the last two inequalities, for any , we have
i.e., is -Lipschitz continuous.
Now let us define the oracle. When asked for a first-order information about the function at a point , the oracle returns the function value , and one subgradient of at given by (note that it lies in defined above)
where (recall from notation) is the first coordinate of that satisfies i.e.,
We will see soon that this oracle is providing us the worst possible subgradient for reducing function value in iterates, and this type of oracle is called resisting oracle.
Define the point as:
which has norm
As a result,
We claim that is a global minimum of . First, observe that:
as all the first components of are the same. Next note that, at , has the subdifferential
In the last equation, if we set and for all then we have one subdifferential of as follows:
where in the last line we have used the value of .
alpha = (1 + 1/Sqrt[n + 1])^-1;
alpha/(n + 1) - (1 - alpha)/Sqrt[n + 1] // Simplify
Computing iterate . Let us initialize our algorithm with the initial value which satisfies . For this we have
We ask the resisting oracle about first-order information about the function at . It returns
So, , which is the first computed iterate by our first-order method is going to satisfy (Span-Cond):
For the point we have
Computing iterate . Consider the case then
Now we ask the oracle about first-order information about the function at . It returns
Hence, for the case the second iterate is going to satisfy (Span-Cond):
Now consider the case then
So for the case , we have
So, irrespective of or we have
Generalization about the iterates. Continuing in this manner, we can show that for any we have
and as a result, components of corresponding to indices will be zero for all , i.e.,
Hence, the objective value of the iterates is going to satisfy for all
where in the last line we have used the simple fact that the maximum element of a list will be greater than any element of the list. This is interesting: because it says that no matter what we do, we cannot improve the initial objective value for the iterates !
Final lower bound. Finally, applying (ObjNotImprv) for the -th iterate, we get
thus completing our proof.
 Drori, Yoel, and Teboulle, Marc. An optimal variant of Kelley's cutting-plane method. Mathematical Programming 160.1 (2016): 321-351.
 Beck, Amir. First-order methods in optimization. Society for Industrial and Applied Mathematics, 2017.