A lower-complexity bound for the class of convex and Lipschitz continuous functions

Shuvomoy Das Gupta

November 8, 2021

In this blog, we discuss how to derive a lower-complexity bound for the class of convex and Lipschitz continuous functions. This blog is based on [1, Section 3.2.1]. [1] Y. Nesterov. Introductory lectures on convex optimization: A basic course. Kluwer Academic Publishers, 2004.


Table of contents


Background

Large-scale convex optimization problem

Consider a large-scale convex optimization problem of the form

minimizef(x),(P)\begin{array}{ll} \textrm{minimize} & f(x),\quad(P)\end{array}

where xRdx\in\mathbf{R}^{d} is the decision variable and dd is a very large number which is common in today's machine learning problems. The function ff is assumed to be convex with its subgradient norm bounded by some constant L>0L>0 over a large enough ball of radius RR around xx_{\star}, where xx_{\star} is an optimal solution to (P)(P).

The concept of oracle

We assume that information on this objective function ff is gained through a first-order oracle. Given an input point xRd,x\in\mathbf{R}^{d}, this oracle returns

  • f(x)f(x), which is the value of the objective function, and

  • f(x)f^{\prime}(x), which is one element in its subdifferential set f(x).\partial f(x).

Algorithm in consideration to solve (P)

Suppose we want to find a solution to (P),(P), so we have to run some iterative algorithm initialized at some point x0Rdx_{0}\in\mathbf{R}^{d} and, then, using some algorithmic rules, compute x1,x2,,xNx_{1},x_{2},\ldots,x_{N}, i.e., NN additional iterates. Because we have a large-scale setup, it is safe to assume that Nd1N\leq d-1 (if the decision variable is a vector with 10 billion components, it is not realistic to compute 1010 billion iterates of any algorithm!). What type of algorithm should we consider?

To solve this optimization problem, we consider algorithm that satisfies the condition We use the notation [1:N]=1,2,,N[1:N]={1,2,\ldots,N}.

(i[1:N])xix0+span{f(x0),,f(xi1)},(Span-Cond)\left(\forall i\in[1:N]\right)\quad x_{i}\in x_{0}+\textrm{span}\left\{ f^{\prime}(x_{0}),\ldots,f^{\prime}(x_{i-1})\right\} ,\quad\textrm{(Span-Cond)}

where gif(xi)g_{i}\in\partial f(x_{i}). This condition holds for majority of practical methods.

We next present the main lower bound result, and then we prove it.


A lower bound result

For any L,R>0L,R>0​, dNd\in\mathbf{N}​ with Nd1N\leq d-1​, and any starting point x0Rdx_{0}\in\mathbf{R}^{d}​, there exist (i) a function f:RdRf:\mathbf{R}^{d}\to\mathbf{R}​, which is convex and LL​-Lipschitz continuous on the ball B(x;R)={xRdxx22R2}B(x_{\star};R)=\{x\in\mathbf{R}^{d}\mid\|x-x_{\star}\|_{2}^{2}\leq R^{2}\}​ and (ii) a first order oracle providing subgradient and function value information such that we have the lower bound on the objective inaccuracy

f(xN)f(x)LR2(2+N+1),f(x_{N})-f(x_{\star})\geq\frac{LR}{2(2+\sqrt{N+1})},

for any first-order method satisfying (Span-Cond).


Proof of the lower bound result

The proof proceeds by constructing a "worst-case" function, on which any first-order method satisfying (Span-Cond) will not be able to improve its initial objective value f(x0)f(x_{0}) during the first NN computed iterations.

Define the worst case function ff

Define the function:

f(x):=μ2x22+γmax{x[1],x[2],,x[N+1]},f(x) := \frac{\mu}{2}\|x\|_{2}^{2}+\gamma\max\{x[1],x[2],\ldots, x[N+1]\},

where x[i]x[i] denotes the ii-th component of the vector xx, and γ\gamma and μ\mu are some positive numbers. For now, we do not specify what μ\mu and γ\gamma are, we will automatically find their values in terms of LL and RR down the line. Note that ff is convex by definition. Using subdifferential calculus, we can write down the closed-form expression of ff's subdifferential at a point xRdx\in\mathbf{R}^{d}, given by For any i{1,2,}i\in\{1,2,\ldots\}, eiRde_i\in\mathbf{R}^d corresponds to the unit vector that has 1 in its ii-th coordinate and 0 everywhere else.

f(x)=μx+γconv{ekkI(x)}=μx+γ{kI(x)λkek(kI(x)) λk0,kI(x)λk=1}\begin{aligned} \partial f(x) & =\mu x+\gamma\mathbf{conv}\{e_{k}\mid k\in I(x)\}\\ & =\mu x+\gamma\left\{ \sum_{k\in I(x)}\lambda_{k}e_{k}\mid\left(\forall k\in I(x)\right)\;\lambda_{k}\geq0,\sum_{k\in I(x)}\lambda_{k}=1\right\} \end{aligned}

where

I(x)={[1:N+1]x[]=max{x[1],x[2],,x[N+1]}},I(x)=\{\ell\in[1:N+1]\mid x[\ell]=\max\{x[1],x[2],\ldots ,x[N+1]\}\},

i.e., any element of I(x)I(x) corresponds to an index of a maximal component of vector x={x[1],,x[N+1],,x[d]}x=\{x[1],\ldots,x[N+1],\ldots,x[d]\} searched over its first N+1N+1 components.

A handy inequality to bound the subgradient of ff

Hence, if we consider a point xB(x;R)x\in B(x_{\star};R) and any subgradient of ff at x,x, denoted by f(x)f^{\prime}(x), it can be expressed as:

f(x)=μx+γkI(x)λkek: where (kI(x)) λk0,kI(x)λk=1,f^{\prime}(x)=\mu x+\gamma\sum_{k\in I(x)}\lambda_{k}e_{k}:\textrm{ where }\left(\forall k\in I(x)\right)\;\lambda_{k}\geq0,\sum_{k\in I(x)}\lambda_{k}=1,

with the norm satisfying:

f(x)2=μx+γkI(x)λkek2a)μx2+γkI(x)λkek2b)μx2+γkI(x)λkek2undefined=1=μx2+γkI(x)λkundefined=1=μx2+γc)μ(R+x2)+γ(1)\begin{aligned} \|f^{\prime}(x)\|_{2} & =\|\mu x+\gamma\sum_{k\in I(x)}\lambda_{k}e_{k}\|_{2}\\ & \overset{a)}{\leq}\mu\|x\|_{2}+\gamma\|\sum_{k\in I(x)}\lambda_{k}e_{k}\|_{2}\\ & \overset{b)}{\leq}\mu\|x\|_{2}+\gamma\sum_{k\in I(x)}\lambda_{k}\underbrace{\|e_{k}\|_{2}}_{=1}\\ & =\mu\|x\|_{2}+\gamma\underbrace{\sum_{k\in I(x)}\lambda_{k}}_{=1}\\ & =\mu\|x\|_{2}+\gamma\\ & \overset{c)}{\leq}\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma\qquad(1)\end{aligned}

where a)a) and b)b) use the triangle inequality and the fact that λk0\lambda_{k}\geq0 for all kI(x)k\in I(x), and c)c) uses the reverse triangle inequality:

x2x2xx2Rx2R+x2.\begin{aligned} & \|x\|_{2}-\|x_{\star}\|_{2}\leq\|x-x_{\star}\|_{2}\leq R\\ \Rightarrow & \|x\|_{2}\leq R+\|x_{\star}\|_{2}.\end{aligned}

Verifying that ff is Lipschitz continuous on B(x;R)B(x_{\star};R)

Let us now verify that ff is indeed Lipschitz continuous on the ball B(x;R)={xRdxx22R2}B(x_{\star};R)=\{x\in\mathbf{R}^{d}\mid\|x-x_{\star}\|_{2}^{2}\leq R^{2}\}. For any x,yB(x;R)x,y\in B(x_{\star};R) and any f(x)f(x)f^{\prime}(x)\in\partial f(x), from the subgradient inequality we have

f(y)f(x)+f(x)yxf(x)f(y)f(x)yx=f(x)xya)f(x)2xy2b)(μ(R+x2)+γ)xy2,\begin{aligned} f(y) & \geq f(x)+\left\langle f^{\prime}(x)\mid y-x\right\rangle \\ \Leftrightarrow f(x)-f(y) & \leq-\left\langle f^{\prime}(x)\mid y-x\right\rangle \\ & =\left\langle f^{\prime}(x)\mid x-y\right\rangle \\ & \overset{a)}{\leq}\|f^{\prime}(x)\|_2 \|x-y\|_2\\ & \overset{b)}{\leq}\left(\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma\right)\|x-y\|_2,\end{aligned}

where a)a) uses Cauchy–Schwarz inequality and b)b) uses (1). The last inequality is symmetric with respect to xx and yy, hence for any x,yB(x;R)x,y\in B(x_{\star};R), we also have:

f(y)f(x)(μ(R+x2)+γ)yx2.f(y)-f(x)\leq\left(\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma\right)\|y-x\|_2.

Thus, combining the last two inequalities, for any x,yB(x;R)x,y\in B(x_{\star};R), we have

f(y)f(x)=max{f(y)f(x),(f(y)f(x))}(μ(R+x2)+γ)yx2,|f(y)-f(x)|=\max\{f(y)-f(x),-\left(f(y)-f(x)\right)\}\leq\left(\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma\right)\|y-x\|_2,

i.e., ff is (μ(R+x2)+γ)\left(\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma\right)-Lipschitz continuous on B(x;R)B(x_{\star};R).

In other words, when we pick the value of μ\mu and γ\gamma, we need to satisfy the equation:

L=μ(R+x2)+γ,(ConstraitntOnL)L=\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma,\qquad\textrm{(ConstraitntOnL)}

which we will do in the end.

Defining the oracle

Now let us define the oracle. When asked for first-order information about the function at a point xx, the oracle returns the function value f(x)f(x), and one subgradient of ff at xx given by

f(x)=μx+γeixf(x),f^{\prime}(x)=\mu x+\gamma e_{i_{x}}\in\partial f(x),

where ixi_{x} is the first coordinate of xx that satisfies x[ix]=maxj[1:N+1]x[j],x[i_{x}]=\max_{j\in[1:N+1]}x[j], i.e.,

ix=minI(x).i_{x}=\min I(x).

We will see soon that this oracle is providing us the worst possible subgradient for reducing function value in NN iterates, and this type of oracle is called resisting oracle.

Constructing a global minimum of ff

Define the point xx_{\star} as:

x={γμ(N+1),if i[1:N+1]0,else,(OptPnt)x_{\star}=\begin{cases} -\frac{\gamma}{\mu(N+1)}, & \textrm{if }i\in[1:N+1]\\ 0, & \textrm{else}, \end{cases}\qquad\textrm{(OptPnt)}

which has norm

x2=γμN+1.(NrmOptPnt)\|x_{\star}\|_{2}=\frac{\gamma}{\mu\sqrt{N+1}}.\qquad\textrm{(NrmOptPnt)}

Keep in mind that, in the end we need to pick the value of γ\gamma and μ\mu in such a way so that the condition

x0x2R(OptPntDist)\|x_{0}-x_{\star}\|_{2}\leq R\qquad\textrm{(OptPntDist)}

is satisfied. We claim that xx_{\star} is a global minimum of ff. First, observe that:

I(x)={[1:N+1]x[]=max{x[1],x[2],,x[N+1]}}={1,2,,N+1},\begin{aligned} I(x_{\star}) & =\{\ell\in[1:N+1]\mid x_{\star}[\ell]=\max\{x_{\star}[1],x_{\star}[2],\ldots, x_{\star}[N+1]\}\}\\ & =\{1,2,\ldots,N+1\},\end{aligned}

as all the first N+1N+1 components of NN are the same. Next note that, at xx_{\star}, ff has the subdifferential

f(x)=μx+γ{k[1:N+1]λkek(k[1:N+1]) λk0,kI(x)λk=1}.\partial f(x_{\star})=\mu x_{\star}+\gamma\left\{ \sum_{k\in[1:N+1]}\lambda_{k}e_{k}\mid\left(\forall k\in[1:N+1]\right)\;\lambda_{k}\geq0,\sum_{k\in I(x)}\lambda_{k}=1\right\} .

If we set, λk=1/(N+1)\lambda_{k}=1/(N+1) for i[1:N+1],i\in[1:N+1],then one particular element of f(x)\partial f(x_{\star}) would be:

f(x)μx+γi[1:N+1]1N+1eiundefinedz(say)=μ(γμ(N+1)undefinedx[1],γμ(N+1)undefinedx[2],,γμ(N+1)undefinedx[N+1],0,,0undefinedx[d])+γ(1(N+1)undefinedz[1],1(N+1)undefinedz[2],,1(N+1)undefinedz[N+1],0,,0undefinedz[d])=(0,0,,0),\begin{aligned} \partial f(x_{\star}) & \ni\mu x_{\star}+\gamma\overbrace{\sum_{i\in[1:N+1]}\frac{1}{N+1}e_{i}}^{z(\textrm{say})}\\ & =\mu\left(\underbrace{-\frac{\gamma}{\mu(N+1)}}_{x_{\star}[1]},\underbrace{-\frac{\gamma}{\mu(N+1)}}_{x_{\star}[2]},\ldots,\underbrace{-\frac{\gamma}{\mu(N+1)}}_{x_{\star}[N+1]},0,\ldots,\underbrace{0}_{x_{\star}[d]}\right)+\\ & \gamma\left(\underbrace{\frac{1}{(N+1)}}_{z[1]},\underbrace{\frac{1}{(N+1)}}_{z[2]},\ldots,\underbrace{\frac{1}{(N+1)}}_{z[N+1]},0,\ldots,\underbrace{0}_{z[d]}\right)\\ & =(0,0,\ldots,0),\end{aligned}

hence via Fermat's rule, xx_{\star} is a global minimum of ff. The optimal value of ff at xx_{\star} is:

f(x)=μ2x22+γmax{x[1],x[2],x[N+1]}=μ2γ2μ2(N+1)+γ(γμ(N+1))=γ22μ(N+1)γ2μ(N+1)=γ22μ(N+1).\begin{aligned} f(x_{\star}) & =\frac{\mu}{2}\|x_{\star}\|_{2}^{2}+\gamma\max\{x_{\star}[1],x_{\star}[2],\ldots x_{\star}[N+1]\}\\ & =\frac{\mu}{2}\frac{\gamma^{2}}{\mu^{2}(N+1)}+\gamma\left(-\frac{\gamma}{\mu(N+1)}\right)\\ & =\frac{\gamma^{2}}{2\mu(N+1)}-\frac{\gamma^{2}}{\mu(N+1)}\\ & =-\frac{\gamma^{2}}{2\mu(N+1)}.\end{aligned}

Performance of the oracle in reducing function value

Let us initialize our algorithm with the initial value x0=0,x_{0}=0, for which we have

I(x0)={[1:N+1]x[]=max{0,0,,0}={1,2,,N+1},ix0=minI(x0)=1.\begin{aligned} I(x_{0}) & =\{\ell\in[1:N+1]\mid x[\ell]=\max\{0,0,\ldots,0\}\\ & =\{1,2,\ldots,N+1\},\\ i_{x_{0}} & =\min I(x_{0})=1.\end{aligned}

We ask the resisting oracle about first-order information about the function at x0x_{0}. It returns

f(x0)=μ2x022+γmax{x0[1],x0[2],,x0[N+1]}=0,\begin{aligned} f(x_{0}) & =\frac{\mu}{2}\|x_{0}\|_{2}^{2}+\gamma\max\{x_{0}[1],x_{0}[2],\ldots, x_{0}[N+1]\}\\ & =0,\end{aligned}

and

f(x0)=μx0+γeix0=γe1.\begin{aligned} f^{\prime}(x_{0}) & =\mu x_{0}+\gamma e_{i_{x_{0}}}\\ & =\gamma e_{1}.\end{aligned}

So, x1x_{1}, which is the first computed iterate by our first-order method is going to satisfy (Span-Cond):

x1x0+span{f(x0)}=0+span{γe1}=span{e1}x1=ξ1e1 for some ξ1R.\begin{aligned} x_{1} & \in x_{0}+\textrm{span}\left\{ f^{\prime}(x_{0})\right\} \\ & =0+\textrm{span}\{\gamma e_{1}\}\\ & =\textrm{span}\{e_{1}\}\\ \Rightarrow x_{1} & =\xi_{1}e_{1}\textrm{ for some }\xi_{1}\in\mathbf{R}.\end{aligned}

For the point x1,x_{1}, we have

I(x1)={[1:N+1]x[]=max{ξ1,0,,0}}={{1,2,,N+1},if ξ1=0,{2,3,,N+1},if ξ1<0,{1},if ξ1>0,\begin{aligned} I(x_{1}) & =\{\ell\in[1:N+1]\mid x[\ell]=\max\{\xi_{1},0,\ldots,0\}\}\\ & =\begin{cases} \{1,2,\ldots,N+1\}, & \textrm{if }\xi_{1}=0,\\ \{2,3,\ldots,N+1\}, & \textrm{if }\xi_{1}<0,\\ \{1\}, & \textrm{if }\xi_{1}>0, \end{cases}\end{aligned}

and

ix1=minI(x1)={1,if ξ1=0,2,if ξ1<0,1,if ξ1>0.\begin{aligned} i_{x_{1}} & =\min I(x_{1})\\ & =\begin{cases} 1, & \textrm{if }\xi_{1}=0,\\ 2, & \textrm{if }\xi_{1}<0,\\ 1, & \textrm{if }\xi_{1}>0. \end{cases}\end{aligned}

Now we ask the oracle about first-order information about the function at x1x_{1}. It returns

f(x1)=μ2x122+γmax{x1[1],x1[2],,x1[N+1]}\begin{aligned} f(x_{1}) & =\frac{\mu}{2}\|x_{1}\|_{2}^{2}+\gamma\max\{x_{1}[1],x_{1}[2],\ldots,x_{1}[N+1]\}\end{aligned}

and

f(x1)=μx1+γeix1=μξ1γe1+γ(e1 or e2)\begin{aligned} f^{\prime}(x_{1}) & =\mu x_{1}+\gamma e_{i_{x_{1}}}\\ & =\mu\xi_{1}\gamma e_{1}+\gamma(e_{1}\textrm{ or }e_{2})\end{aligned}

Hence, the second iterate x2x_{2} is going to satisfy (Span-Cond):

x2x0+span{f(x0),f(x1)}=span{γe1,μξ1γe1+γ(e1 or e2)}=span{e1,e2}.\begin{aligned} x_{2} & \in x_{0}+\textrm{span}\{f^{\prime}(x_{0}),f^{\prime}(x_{1})\}\\ & =\textrm{span}\{\gamma e_{1},\mu\xi_{1}\gamma e_{1}+\gamma(e_{1}\textrm{ or }e_{2})\}\\ & =\textrm{span}\{e_{1},e_{2}\}.\end{aligned}

Continuing in this manner, we can show that for any i[1:N],i\in[1:N], we have

xispan{e1,e2,,ei},x_{i}\in\textrm{span}\{e_{1},e_{2},\ldots,e_{i}\},

and as a result, components of xix_{i} corresponding to indices N+1,,dN+1,\ldots,d will be zero for all i[1:N]i\in[1:N], i.e.,

xi[N+1]==xi[d]=0.x_{i}[N+1]=\ldots=x_{i}[d]=0.

Hence, the objective value of the iterates x1,,xNx_{1},\ldots,x_{N} is going to satisfy for all i[1:N]i\in[1:N]

f(xi)=μ2xi22+γmax{xi[1],xi[2],,xi[N+1]}γmax{xi[1],xi[2],,xi[N+1]}γxi[N+1]=0,(ObjNotImprv)\begin{aligned} f(x_{i}) & =\frac{\mu}{2}\|x_{i}\|_{2}^{2}+\gamma\max\{x_{i}[1],x_{i}[2],\ldots,x_{i}[N+1]\}\\ & \geq\gamma\max\{x_{i}[1],x_{i}[2],\ldots,x_{i}[N+1]\}\\ & \geq \gamma x_{i}[N+1]\\ & =0,\qquad\textrm{(ObjNotImprv)}\end{aligned}

where in the second last line we have used the simple fact that the maximum element of a list will be greater than any element of the list. This is interesting: because it says that no matter what we do, we cannot improve the initial objective value f(x0)=0f(x_{0})=0 for the iterates x1,,xNx_{1},\ldots,x_{N}! Finally, applying (ObjNotImprv) for the NN-th iterate, we get

f(xN)f(x)0f(x)=(γ22μ(N+1))=γ22μ(N+1).(ObjGap)\begin{aligned} f(x_{N})-f(x_{\star}) & \geq0-f(x_{\star})\\ & =-\left(-\frac{\gamma^{2}}{2\mu(N+1)}\right)\\ & =\frac{\gamma^{2}}{2\mu(N+1)}.\qquad\textrm{(ObjGap)}\end{aligned}

Picking values of γ\gamma and μ\mu

Now we are left to picking the values of γ\gamma and μ\mu in terms of LL and RR. For that we need to be mindful of two constraints that γ\gamma and μ\mu need to satisfy:

  • From (ConstraitntOnL) :

L=μ(R+x2)+γ=μ(R+γμN+1)+γ,\begin{aligned} L & =\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma\\ & =\mu\left(R+\frac{\gamma}{\mu\sqrt{N+1}}\right)+\gamma,\end{aligned}

and

  • From (OptPntDist) :

x0x22=0x22=γ2μ2(N+1)R2,\|x_{0}-x_{\star}\|_{2}^{2}=\|0-x_{\star}\|_{2}^{2}=\frac{\gamma^{2}}{\mu^{2}(N+1)}\leq R^{2},

where we have used (NrmOptPnt).

To make our life easier, we consider equality in the last equation, and solver for γ\gamma and μ\mu. We can do it by hand, but I am lazy, so I have just used Mathematica.

(*Mathematica code*)
In[1] := Solve[Reduce[{L == \[Mu] (R + \[Gamma]/(\[Mu] Sqrt[\[CapitalNu] + 
            1])) + \[Gamma], \[Gamma]/(\[Mu] Sqrt[\[CapitalNu] + 1]) == 
     R , R > 0, \[CapitalNu] > -1, 
    L > 0, \[Gamma] > 0, \[Mu] > 0}, {\[Gamma], \[Mu]}, 
   Reals], {\[Gamma], \[Mu]}, Reals] // Simplify

The solution is:

γ=LN+12+N+1,μ=L2R+RN+1.\begin{aligned} \gamma & =\frac{L\sqrt{N+1}}{2+\sqrt{N+1}},\\ \mu & =\frac{L}{2R+R\sqrt{N+1}}.\end{aligned}

Putting the values of γ\gamma and μ\mu in (ObjGap),

(*Mathematica code*)
In[2] := \[Gamma] = (L Sqrt[1 + \[CapitalNu]])/(2 + Sqrt[1 + \[CapitalNu]]);
\[Mu] = L/(2 R + R Sqrt[1 + \[CapitalNu]]);
ObjGap = \[Gamma]^2/(2 \[Mu] (\[CapitalNu] + 1)) // FullSimplify

we find:

f(xN)f(x)γ22μ(N+1)=LR4+2N+1,\begin{aligned} f(x_{N})-f(x_{\star}) & \geq\frac{\gamma^{2}}{2\mu(N+1)}\\ & =\frac{LR}{4+2\sqrt{N+1}},\end{aligned}

thus completing our proof.