An improved lower-complexity bound for the class of convex and Lipschitz continuous functions

Shuvomoy Das Gupta

November 11, 2021

In this blog, we discuss how to derive a lower-complexity bound for the class of convex and Lipschitz continuous functions based on [1], which improves the previous lower bound I showed in this blog.

Notation used in the blog. We use the notation [1:N]={1,2,3,,N}[1:N]=\{1,2,3,\ldots,N\}. For a set SRdS\in \mathbf{R}^d, its convex hull is denoted by convS\mathbf{conv}S. For a dd-dimenstional vector

x={x[1],x[2],,x[d]}Rd,x = \{x[1],x[2],\ldots,x[d]\} \in \mathbf{R}^d,

define the index set (elements are sorted from smallest to largest):

I[1:N+1](x)={[1:N+1]x[]=max{x[1],x[2],x[N+1]}},I_{[1:N+1]}(x)=\{\ell\in[1:N+1]\mid x[\ell]=\max\{x[1],x[2],\ldots x[N+1]\}\},

i.e., any element of I[1:N+1](x)I_{[1:N+1]}(x) corresponds to an index of a maximal component of vector xx searched over its first N+1N+1 components. The first element in I[1:N+1](x)I_{[1:N+1]}(x) is defined by

ix=minI[1:N+1](x).i_{x}=\min I_{[1:N+1]}(x).

Finally, for any i[1:d]i\in [1:d], eiRde_i \in \mathbf{R}^d corresponds to the unit vector that has 1 in its ii-th coordinate and 0 everywhere else.

Table of contents

Large-scale convex optimization problem.

Consider an large-scale convex optimization problem of the form

minimizef(x),(P)\begin{array}{ll} \textrm{minimize} & f(x),\quad(P)\end{array}

where xRdx\in\mathbf{R}^{d} is the decision variable and dd is a very large number which is common in today's machine learning problems. The function f:RdRf:\mathbf{R}^{d}\to\mathbf{R} is assumed to be convex with its subgradient norm bounded by some constant L>0L>0, and xx_{\star} is an optimal solution to (P)(P).

The concept of oracle.

We assume that information on this objective function ff is gained through a first-order oracle. Given an input point xRd,x\in\mathbf{R}^{d},this oracle returns

  • f(x)f(x), which is the value of the objective function, and

  • f(x)f^{\prime}(x), which is one element in its subdifferential set f(x).\partial f(x).

Algorithm in consideration to solve (P).

Suppose we want to find a solution to (P),(P), so we have to run some iterative algorithm initialized at some point x0Rdx_{0}\in\mathbf{R}^{d} and, then, using some algorithmic rules, compute x1,x2,,xNx_{1},x_{2},\ldots,x_{N}, i.e., NN additional iterates. Because we have a large-scale setup, it is safe to assume that Nd1N\leq d-1 (if the decision variable is a vector with 10 billion components, it is not realistic to compute 1010 billion iterates of any algorithm!). What type of algorithm should we consider?

To solve this optimization problem, we consider algorithm that satisfies the condition

(i[1:N])xix0+span{f(x0),,f(xi1)},(Span-Cond)\left(\forall i\in[1:N]\right)\quad x_{i}\in x_{0}+\textrm{span}\left\{ f^{\prime}(x_{0}),\ldots,f^{\prime}(x_{i-1})\right\} ,\quad\textrm{(Span-Cond)}

where gif(xi)g_{i}\in\partial f(x_{i}). This condition holds for majority of practical methods.

A lower bound result.

Theorem 1:

For any L,R>0L,R>0, dNd\in\mathbf{N} with Nd1N\leq d-1, and any starting point x0Rdx_{0}\in\mathbf{R}^{d}, there exist (i) a function f:RdRf:\mathbf{R}^{d}\to\mathbf{R}, which is convex and LL-Lipschitz continuous, and (ii) a first order oracle providing subgradient and function value information such that we have the lower bound on the objective inaccuracy:


​ for any first-order method satisfying (Span-Cond).

Proof of the lower bound result.

The proof proceeds by constructing a "worst-case" function, on which any first-order method satisfying (Span-Cond) will not be able to improve its initial objective value f(x0)f(x_{0}) during the first NN computed iterations.

Define the worst case function ff.

First, we define two convex functions gg​ and hh​ as follows. Define the convex function

g(x)gN+1(x)=max{x[1],x[2],x[N+1]},g(x)\coloneqq g_{N+1}(x)=\max\{x[1],x[2],\ldots x[N+1]\},

​ where x[i]x[i]​ denotes the ii​-th component of the vector xx​. Using subdifferential calculus [2], we can write down the closed-form expression of gg​'s subdifferential at a point xRdx\in\mathbf{R}^{d}​, given by

g(x)=conv{ekkI[1:N+1](x)}={kI[1:N+1](x)λkek(kI[1:N+1](x)) λk0,kI(x)λk=1}\begin{aligned} \partial g(x) & =\mathbf{conv}\{e_{k}\mid k\in I_{[1:N+1]}(x)\}\\ & =\left\{ \sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}e_{k}\mid\left(\forall k\in I_{[1:N+1]}(x)\right)\;\lambda_{k}\geq0,\sum_{k\in I(x)}\lambda_{k}=1\right\} \end{aligned}

​ where (recall from notation)

I[1:N+1](x)={[1:N+1]x[]=max{x[1],x[2],x[N+1]}},I_{[1:N+1]}(x)=\{\ell\in[1:N+1]\mid x[\ell]=\max\{x[1],x[2],\ldots x[N+1]\}\},

i.e., any element of I[1:N+1](x)I_{[1:N+1]}(x)​ corresponds to an index of a maximal component of vector x={x[1],,x[N+1],,x[d]}x=\{x[1],\ldots,x[N+1],\ldots,x[d]\}​ searched over its first N+1N+1​ components. Next, define the function

h(x)hN+1(x)=x2R(1+1N+1),h(x)\coloneqq h_{N+1}(x)=\|x\|_{2}-R\left(1+\frac{1}{\sqrt{N+1}}\right),

​ which has subdifferential [2]

h(x)={xx2,if x0,{yy21},if x=0.\partial h(x)=\begin{cases} \frac{x}{\|x\|_{2}}, & \textrm{if }x\neq0,\\ \{y\mid\|y\|_{2}\leq1\}, & \textrm{if }x=0. \end{cases}

​ Finally define a function ff​, which is going to be our worst-case function:

f(x)fN+1(x)=Lmax{g(x),h(x)}.f(x)\coloneqq f_{N+1}(x)=L\max\left\{ g(x),h(x)\right\} .

​ Note that ff​ is a convex function because it is point-wise maximum of two convex functions. Using subdifferential calculus, we can write down the closed-form expression of ff​'s subdifferential at a point xRdx\in\mathbf{R}^{d}​, given by [2] :

f(x)=L×{g(x),if g(x)>h(x),h(x),if g(x)<h(x),conv{g(x),h(x)g(x)g(x),h(x)h(x)},if g(x)=h(x). \partial f(x)= L \times \begin{cases} \partial g(x), & \textrm{if } g(x) > h(x),\\ \partial h(x), & \textrm{if } g(x) < h(x),\\ \mathbf{conv}\{g^{\prime}(x),h^{\prime}(x)\mid g^{\prime}(x)\in\partial g(x),h^{\prime}(x)\in\partial h(x)\}, & \textrm{if } g(x) = h(x). \end{cases}

A handy inequality to bound the subgradient of ff.

Hence, if we consider a point xx and any subgradient of gg at x,x, denoted by g(x)g^{\prime}(x), it can be expressed as

g(x)=kI[1:N+1](x)λkek: where (kI[1:N+1](x)) λk0,kI(x)λk=1,g^{\prime}(x)=\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}e_{k}:\textrm{ where }\left(\forall k\in I_{[1:N+1]}(x)\right)\;\lambda_{k}\geq0,\sum_{k\in I(x)}\lambda_{k}=1,

with the norm satisfying:

g(x)2=kI[1:N+1](x)λkek2a)kI[1:N+1](x)λkek2b)kI[1:N+1](x)λkek2undefined=1=kI[1:N+1](x)λkundefined=1=1(1)\begin{aligned} \|g^{\prime}(x)\|_{2} & =\|\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}e_{k}\|_{2}\\ & \overset{a)}{\leq}\|\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}e_{k}\|_{2}\\ & \overset{b)}{\leq}\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}\underbrace{\|e_{k}\|_{2}}_{=1}\\ & =\underbrace{\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}}_{=1}\\ & =1\quad(1)\end{aligned}

where a)a) and b)b) use the triangle inequality and the fact that λk0\lambda_{k}\geq0 for all kI[1:N+1](x)k\in I_{[1:N+1]}(x). Also, it is clear from the subdifferential of hh that, any h(x)h(x)h^{\prime}(x)\in\partial h(x) would satisfy


Hence, from the subdifferential of ff above, and (1), (2), we find that any f(x)f(x)f^{\prime}(x)\in\partial f(x) will satisfy:

f(x)2L.(3)\|f^{\prime}(x)\|_{2}\leq L.\quad(3)

Verifying that ff is LL-Lipschitz continuous.

Let us now verify that ff is indeed Lipschitz continuous. For any x,yx,y and any f(x)f(x)f^{\prime}(x)\in\partial f(x), from the subgradient inequality, we have

f(y)f(x)+f(x)yxf(x)f(y)f(x)yx=f(x)xya)f(x)2xy2Lxy2\begin{aligned} f(y) & \geq f(x)+\left\langle f^{\prime}(x)\mid y-x\right\rangle \\ \Leftrightarrow f(x)-f(y) & \leq-\left\langle f^{\prime}(x)\mid y-x\right\rangle \\ & =\left\langle f^{\prime}(x)\mid x-y\right\rangle \\ & \overset{a)}{\leq}\|f^{\prime}(x)\|_{2}\|x-y\|_{2}\\ & \leq L\|x-y\|_{2}\end{aligned}

where a)a) uses (3). The last inequality is symmetric with respect to xx and yy, hence for any x,yx,y, we also have:

f(y)f(x)Lyx2.f(y)-f(x)\leq L\|y-x\|_{2}.

Thus, combining the last two inequalities, for any x,yx,y, we have

f(y)f(x)=max{f(y)f(x),(f(y)f(x))}Lyx2,|f(y)-f(x)|=\max\{f(y)-f(x),-\left(f(y)-f(x)\right)\}\leq L\|y-x\|_{2},

i.e., ff is LL-Lipschitz continuous.

Defining the oracle.

Now let us define the oracle. When asked for a first-order information about the function at a point xx​, the oracle returns the function value f(x)f(x)​, and one subgradient of ff​ at xx​ given by (note that it lies in f(x)\partial f(x)​ defined above)

f(x)=L×{eix,if g(x)h(x),x/x2,if g(x)<h(x), f^{\prime}(x)=L \times \begin{cases} e_{i_{x}}, & \textrm{if } g(x) \geq h(x),\\ x/\|x\|_{2}, & \textrm{if } g(x) < h(x), \end{cases}

where (recall from notation) ixi_{x} is the first coordinate of xx that satisfies x[ix]=maxj[1:N+1]x[j],x[i_{x}]=\max_{j\in[1:N+1]}x[j],i.e.,

ix=minI[1:N+1](x).i_{x}=\min I_{[1:N+1]}(x).

We will see soon that this oracle is providing us the worst possible subgradient for reducing function value in NN iterates, and this type of oracle is called resisting oracle.

Constructing a global minimum of ff.

Define the point xx_{\star} as:

x={RN+1,if i[1:N+1]0,else.(OptPnt)x_{\star}=\begin{cases} -\frac{R}{\sqrt{N+1}}, & \textrm{if }i\in[1:N+1]\\ 0, & \textrm{else}. \end{cases}\qquad\textrm{(OptPnt)}

which has norm


along with




As a result,


We claim that xx_{\star} is a global minimum of ff. First, observe that:

I[1:N+1](x)={[1:N+1]x[]=max{x[1],x[2],x[N+1]}}={1,2,,N+1},\begin{aligned} I_{[1:N+1]}(x_{\star}) & =\{\ell\in[1:N+1]\mid x_{\star}[\ell]=\max\{x_{\star}[1],x_{\star}[2],\ldots x_{\star}[N+1]\}\}\\ & =\{1,2,\ldots,N+1\},\end{aligned}

as all the first N+1N+1 components of NN are the same. Next note that, at xx_{\star}, ff has the subdifferential

f(x)=L×conv{g(x),h(x)g(x)g(x),h(x)h(x)}=L×{αg(x)+(1α)h(x)α[0,1],g(x)g(x),h(x)h(x)}=L×{αkI[1:N+1](x)λkek+(1α)xx2α[0,1],(kI[1:N+1](x)) λk0,kI(x)λk=1}.\begin{aligned} & \partial f(x_{\star})\\ & =L\times\mathbf{conv}\{g^{\prime}(x_{\star}),h^{\prime}(x_{\star})\mid g^{\prime}(x_{\star})\in\partial g(x_{\star}),h^{\prime}(x_{\star})\in\partial h(x_{\star})\}\\ & =L\times\{\alpha g^{\prime}(x_{\star})+(1-\alpha)h^{\prime}(x_{\star})\mid\alpha\in[0,1],g^{\prime}(x_{\star})\in\partial g(x_{\star}),h^{\prime}(x_{\star})\in\partial h(x_{\star})\}\\ & =L\times\left\{ \alpha\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}e_{k}+(1-\alpha)\frac{x_{\star}}{\|x_{\star}\|_{2}}\mid\alpha\in[0,1],\left(\forall k\in I_{[1:N+1]}(x)\right)\;\lambda_{k}\geq0,\sum_{k\in I(x)}\lambda_{k}=1\right\} .\end{aligned}

In the last equation, if we set α((1/N+1)+1)1\alpha\coloneqq\left((1/\sqrt{N+1})+1\right)^{-1} and λkλ=1/(N+1)\lambda_{k}\coloneqq\lambda=1/(N+1) for all k[1:N+1],k\in[1:N+1], then we have one subdifferential of f(x)\partial f(x_{\star}) as follows:

f(x)L×(αkI[1:N+1](x)λkek+(1α)xx2)=Lαλ(1undefinedindex[1],1undefinedindex[2],,1undefinedindex[N+1],0,,0undefinedindex[d])+L(1α)(1N+1undefinedindex[1],1N+1undefinedindex[2],,1N+1undefinedindex[N+1],0,,0undefinedindex[d])=Lα1N+1(1undefinedindex[1],1undefinedindex[2],,1undefinedindex[N+1],0,,0undefinedindex[d])+L(1α)1N+1(1undefinedindex[1],1undefinedindex[2],,1undefinedindex[N+1],0,,0undefinedindex[d])=(0,0,,0),\begin{aligned} \partial f(x_{\star}) & \ni L\times(\alpha\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}e_{k}+(1-\alpha)\frac{x_{\star}}{\|x_{\star}\|_{2}})\\ & =L\alpha\lambda\left(\underbrace{1}_{\textrm{index}[1]},\underbrace{1}_{\textrm{index}[2]},\ldots,\underbrace{1}_{\textrm{index}[N+1]},0,\ldots,\underbrace{0}_{\textrm{index}[d]}\right)+\\ & L(1-\alpha)\left(\underbrace{-\frac{1}{\sqrt{N+1}}}_{\textrm{index}[1]},\underbrace{-\frac{1}{\sqrt{N+1}}}_{\textrm{index}[2]},\ldots,\underbrace{-\frac{1}{\sqrt{N+1}}}_{\textrm{index}[N+1]},0,\ldots,\underbrace{0}_{\textrm{index}[d]}\right)\\ & =L\alpha\frac{1}{N+1}\left(\underbrace{1}_{\textrm{index}[1]},\underbrace{1}_{\textrm{index}[2]},\ldots,\underbrace{1}_{\textrm{index}[N+1]},0,\ldots,\underbrace{0}_{\textrm{index}[d]}\right)+\\ & L(1-\alpha)\frac{1}{\sqrt{N+1}}\left(\underbrace{-1}_{\textrm{index}[1]},\underbrace{-1}_{\textrm{index}[2]},\ldots,\underbrace{-1}_{\textrm{index}[N+1]},0,\ldots,\underbrace{0}_{\textrm{index}[d]}\right)\\ & =(0,0,\ldots,0),\end{aligned}

where in the last line we have used the value of α\alpha.

(*Mathematica code*)
alpha = (1 + 1/Sqrt[n + 1])^-1; 
alpha/(n + 1) - (1 - alpha)/Sqrt[n + 1] // Simplify

Performance of the oracle in reducing function value.

Computing iterate x1x_1. Let us initialize our algorithm with the initial value x0=0,x_{0}=0, which satisfies xx02=R\|x_{\star}-x_{0}\|_{2}=R. For this x0,x_{0}, we have

g(x0)=max{0,,0}=0h(x0)=02R(1+1N+1)=R(1+1N+1),g(x_{0})=\max\{0,\ldots,0\}=0\geq h(x_{0})=\|\mathbf{0}\|_{2}-R\left(1+\frac{1}{\sqrt{N+1}}\right)=-R\left(1+\frac{1}{\sqrt{N+1}}\right),


I[1:N+1](x0)={[1:N+1]x[]=max{0,0,,0}={1,2,,N+1},ix0=minI(x0)=1.\begin{aligned} I_{[1:N+1]}(x_{0}) & =\{\ell\in[1:N+1]\mid x[\ell]=\max\{0,0,\ldots,0\}\\ & =\{1,2,\ldots,N+1\},\\ i_{x_{0}} & =\min I(x_{0})=1.\end{aligned}

We ask the resisting oracle about first-order information about the function at x0x_{0}​. It returns

f(x0)=Lmax{g(x0),h(x0)}=L×0=0\begin{aligned} f(x_{0}) & =L\max\left\{ g(x_{0}),h(x_{0})\right\} \\ & =L\times0=0\end{aligned}

​ and

f(x0)=L×eix0=Le1.f^{\prime}(x_{0})=L\times e_{i_{x_{0}}}=Le_{1}.

​ So, x1x_{1}​, which is the first computed iterate by our first-order method is going to satisfy (Span-Cond):

x1x0+span{f(x0)}=0+span{Le1}=span{e1}x1=ξ1e1 for some ξ1R.\begin{aligned} x_{1} & \in x_{0}+\textrm{span}\left\{ f^{\prime}(x_{0})\right\} \\ & =0+\textrm{span}\{Le_{1}\}\\ & =\textrm{span}\{e_{1}\}\\ \Rightarrow x_{1} & =\xi_{1}e_{1}\textrm{ for some }\xi_{1}\in\mathbf{R}.\end{aligned}

For the point x1,x_{1},​ we have

g(x1)=max{ξ1e1,,0}={ξ1,if ξ100,if ξ1<0g(x_{1})=\max\{\xi_{1}e_{1},\ldots,0\}=\begin{cases} \xi_{1}, & \textrm{if }\xi_{1}\geq0\\ 0, & \textrm{if }\xi_{1}<0 \end{cases}




Computing iterate x2x_{2}​. Consider the case g(x1)h(x1),g(x_{1})\geq h(x_{1}),​ then

I[1:N+1](x1)={[1:N+1]x[]=max{ξ1,0,,0}}={{1,2,,N+1},if ξ1=0,{2,3,,N+1},if ξ1<0,{1},if ξ1>0,\begin{aligned} I_{[1:N+1]}(x_{1}) & =\{\ell\in[1:N+1]\mid x[\ell]=\max\{\xi_{1},0,\ldots,0\}\}\\ & =\begin{cases} \{1,2,\ldots,N+1\}, & \textrm{if }\xi_{1}=0,\\ \{2,3,\ldots,N+1\}, & \textrm{if }\xi_{1}<0,\\ \{1\}, & \textrm{if }\xi_{1}>0, \end{cases}\end{aligned}


ix1=minI(x1)={1,if ξ1=0,2,if ξ1<0,1,if ξ1>0.\begin{aligned} i_{x_{1}} & =\min I(x_{1})\\ & =\begin{cases} 1, & \textrm{if }\xi_{1}=0,\\ 2, & \textrm{if }\xi_{1}<0,\\ 1, & \textrm{if }\xi_{1}>0. \end{cases}\end{aligned}


Now we ask the oracle about first-order information about the function at x1x_{1}. It returns

f(x1)=L×max{x1[1],x1[2],,x1[N+1]}L×x1[N+1],\begin{aligned} f(x_{1}) & =L\times\max\{x_{1}[1],x_{1}[2],\ldots,x_{1}[N+1]\}\\ & \geq L\times x_{1}[N+1],\end{aligned}

​ and

f(x1)=L×eix1=L(e1 or e2)\begin{aligned} f^{\prime}(x_{1}) & =L\times e_{i_{x_{1}}}\\ & =L(e_{1}\textrm{ or }e_{2})\end{aligned}

​ Hence, for the case g(x1)h(x1),g(x_{1})\geq h(x_{1}),​ the second iterate x2x_{2}​ is going to satisfy (Span-Cond):

x2x0+span{f(x0),f(x1)}=span{Le1,L(e1 or e2)}=span{e1,e2}.\begin{aligned} x_{2} & \in x_{0}+\textrm{span}\{f^{\prime}(x_{0}),f^{\prime}(x_{1})\}\\ & =\textrm{span}\{Le_{1},L(e_{1}\textrm{ or }e_{2})\}\\ & =\textrm{span}\{e_{1},e_{2}\}.\end{aligned}

Now consider the case g(x1)<h(x1)g(x_1)<h(x_1)then

f(x1)=ξ1R(1+1N+1)=L×max{g(x1),h(x1)} (by definition)L×g(x1) (by definition)=L×max{x1[1],x1[2],,x1[N+1]}L×x1[N+1],\begin{aligned} f(x_{1}) & =\vert\xi_{1}\vert-R\left(1+\frac{1}{\sqrt{N+1}}\right)\\ & =L\times\max\{g(x_{1}),h(x_{1})\}\textrm{ (by definition)}\\ & \geq L\times g(x_{1})\textrm{ (by definition)}\\ & =L\times\max\{x_{1}[1],x_{1}[2],\ldots,x_{1}[N+1]\}\\ & \geq L\times x_{1}[N+1],\end{aligned}





So for the case g(x1)<h(x1)g(x_1) < h(x_1)​​​,​​​ we have


​​​​​​ So, irrespective of g(x1)h(x1)g(x_{1})\geq h(x_{1})​​​​​​ or g(x1)<h(x1),g(x_1)<h(x_1),​​​​​​ we have



Generalization about the iterates. Continuing in this manner, we can show that for any i[1:N],i\in[1:N], we have


and as a result, components of xix_{i} corresponding to indices N+1,,dN+1,\ldots,d will be zero for all i[1:N]i\in[1:N], i.e.,


Hence, the objective value of the iterates x1,,xNx_{1},\ldots,x_{N} is going to satisfy for all i[1:N]i\in[1:N]

f(xi)=L×max{g(xi),h(xi)} (by definition)L×g(xi) (by definition)=L×max{xi[1],xi[2],,xi[N+1]}L×xi[N+1]=0,(ObjNotImprv)\begin{aligned} f(x_{i}) & =L\times\max\{g(x_{i}),h(x_{i})\}\textrm{ (by definition)}\\ & \geq L\times g(x_{i})\textrm{ (by definition)}\\ & =L\times\max\{x_{i}[1],x_{i}[2],\ldots,x_{i}[N+1]\}\\ & \geq L\times x_{i}[N+1]=0,\qquad\textrm{(ObjNotImprv)}\end{aligned}

where in the last line we have used the simple fact that the maximum element of a list will be greater than any element of the list. This is interesting: because it says that no matter what we do, we cannot improve the initial objective value f(x0)=0f(x_{0})=0 for the iterates x1,,xNx_{1},\ldots,x_{N}!

Final lower bound. Finally, applying (ObjNotImprv) for the NN-th iterate, we get

f(xN)f(x)0f(x)=(LRN+1)=LRN+1(ObjGap)\begin{aligned} f(x_{N})-f(x_{\star}) & \geq0-f(x_{\star})\\ & =-\left(-\frac{LR}{\sqrt{N+1}}\right)\\ & =\frac{LR}{\sqrt{N+1}}\qquad\textrm{(ObjGap)}\end{aligned}

thus completing our proof.


[1] Drori, Yoel, and Teboulle, Marc. An optimal variant of Kelley's cutting-plane method. Mathematical Programming 160.1 (2016): 321-351.

[2] Beck, Amir. First-order methods in optimization. Society for Industrial and Applied Mathematics, 2017.