[1] Y. Nesterov. Introductory lectures on convex optimization: A basic course. Kluwer Academic Publishers, 2004. Table of contents
Consider a large-scale convex optimization problem of the form
minimize f ( x ) , ( P ) \begin{array}{ll} \textrm{minimize} & f(x),\quad(P)\end{array} minimize f ( x ) , ( P ) where x ∈ R d x\in\mathbf{R}^{d} x ∈ R d is the decision variable and d d d is a very large number which is common in today's machine learning problems. The function f f f is assumed to be convex with its subgradient norm bounded by some constant L > 0 L>0 L > 0 over a large enough ball of radius R R R around x ⋆ x_{\star} x ⋆ , where x ⋆ x_{\star} x ⋆ is an optimal solution to ( P ) (P) ( P ) .
We assume that information on this objective function f f f is gained through a first-order oracle. Given an input point x ∈ R d , x\in\mathbf{R}^{d}, x ∈ R d , this oracle returns
Suppose we want to find a solution to ( P ) , (P), ( P ) , so we have to run some iterative algorithm initialized at some point x 0 ∈ R d x_{0}\in\mathbf{R}^{d} x 0 ∈ R d and, then, using some algorithmic rules, compute x 1 , x 2 , … , x N x_{1},x_{2},\ldots,x_{N} x 1 , x 2 , … , x N , i.e. , N N N additional iterates. Because we have a large-scale setup, it is safe to assume that N ≤ d − 1 N\leq d-1 N ≤ d − 1 (if the decision variable is a vector with 10 billion components, it is not realistic to compute 10 10 10 billion iterates of any algorithm!). What type of algorithm should we consider?
To solve this optimization problem, we consider algorithm that satisfies the condition ⊕ We use the notation [ 1 : N ] = 1 , 2 , … , N [1:N]={1,2,\ldots,N} [ 1 : N ] = 1 , 2 , … , N .
( ∀ i ∈ [ 1 : N ] ) x i ∈ x 0 + span { f ′ ( x 0 ) , … , f ′ ( x i − 1 ) } , (Span-Cond) \left(\forall i\in[1:N]\right)\quad x_{i}\in x_{0}+\textrm{span}\left\{ f^{\prime}(x_{0}),\ldots,f^{\prime}(x_{i-1})\right\} ,\quad\textrm{(Span-Cond)} ( ∀ i ∈ [ 1 : N ] ) x i ∈ x 0 + span { f ′ ( x 0 ) , … , f ′ ( x i − 1 ) } , (Span-Cond) where g i ∈ ∂ f ( x i ) g_{i}\in\partial f(x_{i}) g i ∈ ∂ f ( x i ) . This condition holds for majority of practical methods.
We next present the main lower bound result, and then we prove it.
For any L , R > 0 L,R>0 L , R > 0 , d ∈ N d\in\mathbf{N} d ∈ N with N ≤ d − 1 N\leq d-1 N ≤ d − 1 , and any starting point x 0 ∈ R d x_{0}\in\mathbf{R}^{d} x 0 ∈ R d , there exist (i) a function f : R d → R f:\mathbf{R}^{d}\to\mathbf{R} f : R d → R , which is convex and L L L -Lipschitz continuous on the ball B ( x ⋆ ; R ) = { x ∈ R d ∣ ∥ x − x ⋆ ∥ 2 2 ≤ R 2 } B(x_{\star};R)=\{x\in\mathbf{R}^{d}\mid\|x-x_{\star}\|_{2}^{2}\leq R^{2}\} B ( x ⋆ ; R ) = { x ∈ R d ∣ ∥ x − x ⋆ ∥ 2 2 ≤ R 2 } and (ii) a first order oracle providing subgradient and function value information such that we have the lower bound on the objective inaccuracy
f ( x N ) − f ( x ⋆ ) ≥ L R 2 ( 2 + N + 1 ) , f(x_{N})-f(x_{\star})\geq\frac{LR}{2(2+\sqrt{N+1})}, f ( x N ) − f ( x ⋆ ) ≥ 2 ( 2 + N + 1 ) L R , for any first-order method satisfying (Span-Cond).
The proof proceeds by constructing a "worst-case" function, on which any first-order method satisfying (Span-Cond) will not be able to improve its initial objective value f ( x 0 ) f(x_{0}) f ( x 0 ) during the first N N N computed iterations.
Define the function:
f ( x ) : = μ 2 ∥ x ∥ 2 2 + γ max { x [ 1 ] , x [ 2 ] , … , x [ N + 1 ] } , f(x) := \frac{\mu}{2}\|x\|_{2}^{2}+\gamma\max\{x[1],x[2],\ldots, x[N+1]\}, f ( x ) := 2 μ ∥ x ∥ 2 2 + γ max { x [ 1 ] , x [ 2 ] , … , x [ N + 1 ]} , where x [ i ] x[i] x [ i ] denotes the i i i -th component of the vector x x x , and γ \gamma γ and μ \mu μ are some positive numbers. For now, we do not specify what μ \mu μ and γ \gamma γ are, we will automatically find their values in terms of L L L and R R R down the line. Note that f f f is convex by definition. Using subdifferential calculus, we can write down the closed-form expression of f f f 's subdifferential at a point x ∈ R d x\in\mathbf{R}^{d} x ∈ R d , given by ⊕ For any i ∈ { 1 , 2 , … } i\in\{1,2,\ldots\} i ∈ { 1 , 2 , … } , e i ∈ R d e_i\in\mathbf{R}^d e i ∈ R d corresponds to the unit vector that has 1 in its i i i -th coordinate and 0 everywhere else.
∂ f ( x ) = μ x + γ c o n v { e k ∣ k ∈ I ( x ) } = μ x + γ { ∑ k ∈ I ( x ) λ k e k ∣ ( ∀ k ∈ I ( x ) ) λ k ≥ 0 , ∑ k ∈ I ( x ) λ k = 1 } \begin{aligned} \partial f(x) & =\mu x+\gamma\mathbf{conv}\{e_{k}\mid k\in I(x)\}\\ & =\mu x+\gamma\left\{ \sum_{k\in I(x)}\lambda_{k}e_{k}\mid\left(\forall k\in I(x)\right)\;\lambda_{k}\geq0,\sum_{k\in I(x)}\lambda_{k}=1\right\} \end{aligned} ∂ f ( x ) = μx + γ conv { e k ∣ k ∈ I ( x )} = μx + γ ⎩ ⎨ ⎧ k ∈ I ( x ) ∑ λ k e k ∣ ( ∀ k ∈ I ( x ) ) λ k ≥ 0 , k ∈ I ( x ) ∑ λ k = 1 ⎭ ⎬ ⎫ where
I ( x ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x [ ℓ ] = max { x [ 1 ] , x [ 2 ] , … , x [ N + 1 ] } } , I(x)=\{\ell\in[1:N+1]\mid x[\ell]=\max\{x[1],x[2],\ldots ,x[N+1]\}\}, I ( x ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x [ ℓ ] = max { x [ 1 ] , x [ 2 ] , … , x [ N + 1 ]}} , i.e. , any element of I ( x ) I(x) I ( x ) corresponds to an index of a maximal component of vector x = { x [ 1 ] , … , x [ N + 1 ] , … , x [ d ] } x=\{x[1],\ldots,x[N+1],\ldots,x[d]\} x = { x [ 1 ] , … , x [ N + 1 ] , … , x [ d ]} searched over its first N + 1 N+1 N + 1 components.
Hence, if we consider a point x ∈ B ( x ⋆ ; R ) x\in B(x_{\star};R) x ∈ B ( x ⋆ ; R ) and any subgradient of f f f at x , x, x , denoted by f ′ ( x ) f^{\prime}(x) f ′ ( x ) , it can be expressed as:
f ′ ( x ) = μ x + γ ∑ k ∈ I ( x ) λ k e k : where ( ∀ k ∈ I ( x ) ) λ k ≥ 0 , ∑ k ∈ I ( x ) λ k = 1 , f^{\prime}(x)=\mu x+\gamma\sum_{k\in I(x)}\lambda_{k}e_{k}:\textrm{ where }\left(\forall k\in I(x)\right)\;\lambda_{k}\geq0,\sum_{k\in I(x)}\lambda_{k}=1, f ′ ( x ) = μx + γ k ∈ I ( x ) ∑ λ k e k : where ( ∀ k ∈ I ( x ) ) λ k ≥ 0 , k ∈ I ( x ) ∑ λ k = 1 , with the norm satisfying:
∥ f ′ ( x ) ∥ 2 = ∥ μ x + γ ∑ k ∈ I ( x ) λ k e k ∥ 2 ≤ a ) μ ∥ x ∥ 2 + γ ∥ ∑ k ∈ I ( x ) λ k e k ∥ 2 ≤ b ) μ ∥ x ∥ 2 + γ ∑ k ∈ I ( x ) λ k ∥ e k ∥ 2 undefined = 1 = μ ∥ x ∥ 2 + γ ∑ k ∈ I ( x ) λ k undefined = 1 = μ ∥ x ∥ 2 + γ ≤ c ) μ ( R + ∥ x ⋆ ∥ 2 ) + γ ( 1 ) \begin{aligned} \|f^{\prime}(x)\|_{2} & =\|\mu x+\gamma\sum_{k\in I(x)}\lambda_{k}e_{k}\|_{2}\\ & \overset{a)}{\leq}\mu\|x\|_{2}+\gamma\|\sum_{k\in I(x)}\lambda_{k}e_{k}\|_{2}\\ & \overset{b)}{\leq}\mu\|x\|_{2}+\gamma\sum_{k\in I(x)}\lambda_{k}\underbrace{\|e_{k}\|_{2}}_{=1}\\ & =\mu\|x\|_{2}+\gamma\underbrace{\sum_{k\in I(x)}\lambda_{k}}_{=1}\\ & =\mu\|x\|_{2}+\gamma\\ & \overset{c)}{\leq}\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma\qquad(1)\end{aligned} ∥ f ′ ( x ) ∥ 2 = ∥ μx + γ k ∈ I ( x ) ∑ λ k e k ∥ 2 ≤ a ) μ ∥ x ∥ 2 + γ ∥ k ∈ I ( x ) ∑ λ k e k ∥ 2 ≤ b ) μ ∥ x ∥ 2 + γ k ∈ I ( x ) ∑ λ k = 1 ∥ e k ∥ 2 = μ ∥ x ∥ 2 + γ = 1 k ∈ I ( x ) ∑ λ k = μ ∥ x ∥ 2 + γ ≤ c ) μ ( R + ∥ x ⋆ ∥ 2 ) + γ ( 1 ) where a ) a) a ) and b ) b) b ) use the triangle inequality and the fact that λ k ≥ 0 \lambda_{k}\geq0 λ k ≥ 0 for all k ∈ I ( x ) k\in I(x) k ∈ I ( x ) , and c ) c) c ) uses the reverse triangle inequality:
∥ x ∥ 2 − ∥ x ⋆ ∥ 2 ≤ ∥ x − x ⋆ ∥ 2 ≤ R ⇒ ∥ x ∥ 2 ≤ R + ∥ x ⋆ ∥ 2 . \begin{aligned} & \|x\|_{2}-\|x_{\star}\|_{2}\leq\|x-x_{\star}\|_{2}\leq R\\ \Rightarrow & \|x\|_{2}\leq R+\|x_{\star}\|_{2}.\end{aligned} ⇒ ∥ x ∥ 2 − ∥ x ⋆ ∥ 2 ≤ ∥ x − x ⋆ ∥ 2 ≤ R ∥ x ∥ 2 ≤ R + ∥ x ⋆ ∥ 2 . Let us now verify that f f f is indeed Lipschitz continuous on the ball B ( x ⋆ ; R ) = { x ∈ R d ∣ ∥ x − x ⋆ ∥ 2 2 ≤ R 2 } B(x_{\star};R)=\{x\in\mathbf{R}^{d}\mid\|x-x_{\star}\|_{2}^{2}\leq R^{2}\} B ( x ⋆ ; R ) = { x ∈ R d ∣ ∥ x − x ⋆ ∥ 2 2 ≤ R 2 } . For any x , y ∈ B ( x ⋆ ; R ) x,y\in B(x_{\star};R) x , y ∈ B ( x ⋆ ; R ) and any f ′ ( x ) ∈ ∂ f ( x ) f^{\prime}(x)\in\partial f(x) f ′ ( x ) ∈ ∂ f ( x ) , from the subgradient inequality we have
f ( y ) ≥ f ( x ) + ⟨ f ′ ( x ) ∣ y − x ⟩ ⇔ f ( x ) − f ( y ) ≤ − ⟨ f ′ ( x ) ∣ y − x ⟩ = ⟨ f ′ ( x ) ∣ x − y ⟩ ≤ a ) ∥ f ′ ( x ) ∥ 2 ∥ x − y ∥ 2 ≤ b ) ( μ ( R + ∥ x ⋆ ∥ 2 ) + γ ) ∥ x − y ∥ 2 , \begin{aligned} f(y) & \geq f(x)+\left\langle f^{\prime}(x)\mid y-x\right\rangle \\ \Leftrightarrow f(x)-f(y) & \leq-\left\langle f^{\prime}(x)\mid y-x\right\rangle \\ & =\left\langle f^{\prime}(x)\mid x-y\right\rangle \\ & \overset{a)}{\leq}\|f^{\prime}(x)\|_2 \|x-y\|_2\\ & \overset{b)}{\leq}\left(\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma\right)\|x-y\|_2,\end{aligned} f ( y ) ⇔ f ( x ) − f ( y ) ≥ f ( x ) + ⟨ f ′ ( x ) ∣ y − x ⟩ ≤ − ⟨ f ′ ( x ) ∣ y − x ⟩ = ⟨ f ′ ( x ) ∣ x − y ⟩ ≤ a ) ∥ f ′ ( x ) ∥ 2 ∥ x − y ∥ 2 ≤ b ) ( μ ( R + ∥ x ⋆ ∥ 2 ) + γ ) ∥ x − y ∥ 2 , where a ) a) a ) uses Cauchy–Schwarz inequality and b ) b) b ) uses (1). The last inequality is symmetric with respect to x x x and y y y , hence for any x , y ∈ B ( x ⋆ ; R ) x,y\in B(x_{\star};R) x , y ∈ B ( x ⋆ ; R ) , we also have:
f ( y ) − f ( x ) ≤ ( μ ( R + ∥ x ⋆ ∥ 2 ) + γ ) ∥ y − x ∥ 2 . f(y)-f(x)\leq\left(\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma\right)\|y-x\|_2. f ( y ) − f ( x ) ≤ ( μ ( R + ∥ x ⋆ ∥ 2 ) + γ ) ∥ y − x ∥ 2 . Thus, combining the last two inequalities, for any x , y ∈ B ( x ⋆ ; R ) x,y\in B(x_{\star};R) x , y ∈ B ( x ⋆ ; R ) , we have
∣ f ( y ) − f ( x ) ∣ = max { f ( y ) − f ( x ) , − ( f ( y ) − f ( x ) ) } ≤ ( μ ( R + ∥ x ⋆ ∥ 2 ) + γ ) ∥ y − x ∥ 2 , |f(y)-f(x)|=\max\{f(y)-f(x),-\left(f(y)-f(x)\right)\}\leq\left(\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma\right)\|y-x\|_2, ∣ f ( y ) − f ( x ) ∣ = max { f ( y ) − f ( x ) , − ( f ( y ) − f ( x ) ) } ≤ ( μ ( R + ∥ x ⋆ ∥ 2 ) + γ ) ∥ y − x ∥ 2 , i.e. , f f f is ( μ ( R + ∥ x ⋆ ∥ 2 ) + γ ) \left(\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma\right) ( μ ( R + ∥ x ⋆ ∥ 2 ) + γ ) -Lipschitz continuous on B ( x ⋆ ; R ) B(x_{\star};R) B ( x ⋆ ; R ) .
In other words, when we pick the value of μ \mu μ and γ \gamma γ , we need to satisfy the equation:
L = μ ( R + ∥ x ⋆ ∥ 2 ) + γ , (ConstraitntOnL) L=\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma,\qquad\textrm{(ConstraitntOnL)} L = μ ( R + ∥ x ⋆ ∥ 2 ) + γ , (ConstraitntOnL) which we will do in the end.
Now let us define the oracle. When asked for first-order information about the function at a point x x x , the oracle returns the function value f ( x ) f(x) f ( x ) , and one subgradient of f f f at x x x given by
f ′ ( x ) = μ x + γ e i x ∈ ∂ f ( x ) , f^{\prime}(x)=\mu x+\gamma e_{i_{x}}\in\partial f(x), f ′ ( x ) = μx + γ e i x ∈ ∂ f ( x ) , where i x i_{x} i x is the first coordinate of x x x that satisfies x [ i x ] = max j ∈ [ 1 : N + 1 ] x [ j ] , x[i_{x}]=\max_{j\in[1:N+1]}x[j], x [ i x ] = max j ∈ [ 1 : N + 1 ] x [ j ] , i.e. ,
i x = min I ( x ) . i_{x}=\min I(x). i x = min I ( x ) . We will see soon that this oracle is providing us the worst possible subgradient for reducing function value in N N N iterates, and this type of oracle is called resisting oracle.
Define the point x ⋆ x_{\star} x ⋆ as:
x ⋆ = { − γ μ ( N + 1 ) , if i ∈ [ 1 : N + 1 ] 0 , else , (OptPnt) x_{\star}=\begin{cases} -\frac{\gamma}{\mu(N+1)}, & \textrm{if }i\in[1:N+1]\\ 0, & \textrm{else}, \end{cases}\qquad\textrm{(OptPnt)} x ⋆ = { − μ ( N + 1 ) γ , 0 , if i ∈ [ 1 : N + 1 ] else , (OptPnt) which has norm
∥ x ⋆ ∥ 2 = γ μ N + 1 . (NrmOptPnt) \|x_{\star}\|_{2}=\frac{\gamma}{\mu\sqrt{N+1}}.\qquad\textrm{(NrmOptPnt)} ∥ x ⋆ ∥ 2 = μ N + 1 γ . (NrmOptPnt) Keep in mind that, in the end we need to pick the value of γ \gamma γ and μ \mu μ in such a way so that the condition
∥ x 0 − x ⋆ ∥ 2 ≤ R (OptPntDist) \|x_{0}-x_{\star}\|_{2}\leq R\qquad\textrm{(OptPntDist)} ∥ x 0 − x ⋆ ∥ 2 ≤ R (OptPntDist) is satisfied. We claim that x ⋆ x_{\star} x ⋆ is a global minimum of f f f . First, observe that:
I ( x ⋆ ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x ⋆ [ ℓ ] = max { x ⋆ [ 1 ] , x ⋆ [ 2 ] , … , x ⋆ [ N + 1 ] } } = { 1 , 2 , … , N + 1 } , \begin{aligned} I(x_{\star}) & =\{\ell\in[1:N+1]\mid x_{\star}[\ell]=\max\{x_{\star}[1],x_{\star}[2],\ldots, x_{\star}[N+1]\}\}\\ & =\{1,2,\ldots,N+1\},\end{aligned} I ( x ⋆ ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x ⋆ [ ℓ ] = max { x ⋆ [ 1 ] , x ⋆ [ 2 ] , … , x ⋆ [ N + 1 ]}} = { 1 , 2 , … , N + 1 } , as all the first N + 1 N+1 N + 1 components of N N N are the same. Next note that, at x ⋆ x_{\star} x ⋆ , f f f has the subdifferential
∂ f ( x ⋆ ) = μ x ⋆ + γ { ∑ k ∈ [ 1 : N + 1 ] λ k e k ∣ ( ∀ k ∈ [ 1 : N + 1 ] ) λ k ≥ 0 , ∑ k ∈ I ( x ) λ k = 1 } . \partial f(x_{\star})=\mu x_{\star}+\gamma\left\{ \sum_{k\in[1:N+1]}\lambda_{k}e_{k}\mid\left(\forall k\in[1:N+1]\right)\;\lambda_{k}\geq0,\sum_{k\in I(x)}\lambda_{k}=1\right\} . ∂ f ( x ⋆ ) = μ x ⋆ + γ ⎩ ⎨ ⎧ k ∈ [ 1 : N + 1 ] ∑ λ k e k ∣ ( ∀ k ∈ [ 1 : N + 1 ] ) λ k ≥ 0 , k ∈ I ( x ) ∑ λ k = 1 ⎭ ⎬ ⎫ . If we set, λ k = 1 / ( N + 1 ) \lambda_{k}=1/(N+1) λ k = 1/ ( N + 1 ) for i ∈ [ 1 : N + 1 ] , i\in[1:N+1], i ∈ [ 1 : N + 1 ] , then one particular element of ∂ f ( x ⋆ ) \partial f(x_{\star}) ∂ f ( x ⋆ ) would be:
∂ f ( x ⋆ ) ∋ μ x ⋆ + γ ∑ i ∈ [ 1 : N + 1 ] 1 N + 1 e i undefined z ( say ) = μ ( − γ μ ( N + 1 ) undefined x ⋆ [ 1 ] , − γ μ ( N + 1 ) undefined x ⋆ [ 2 ] , … , − γ μ ( N + 1 ) undefined x ⋆ [ N + 1 ] , 0 , … , 0 undefined x ⋆ [ d ] ) + γ ( 1 ( N + 1 ) undefined z [ 1 ] , 1 ( N + 1 ) undefined z [ 2 ] , … , 1 ( N + 1 ) undefined z [ N + 1 ] , 0 , … , 0 undefined z [ d ] ) = ( 0 , 0 , … , 0 ) , \begin{aligned} \partial f(x_{\star}) & \ni\mu x_{\star}+\gamma\overbrace{\sum_{i\in[1:N+1]}\frac{1}{N+1}e_{i}}^{z(\textrm{say})}\\ & =\mu\left(\underbrace{-\frac{\gamma}{\mu(N+1)}}_{x_{\star}[1]},\underbrace{-\frac{\gamma}{\mu(N+1)}}_{x_{\star}[2]},\ldots,\underbrace{-\frac{\gamma}{\mu(N+1)}}_{x_{\star}[N+1]},0,\ldots,\underbrace{0}_{x_{\star}[d]}\right)+\\ & \gamma\left(\underbrace{\frac{1}{(N+1)}}_{z[1]},\underbrace{\frac{1}{(N+1)}}_{z[2]},\ldots,\underbrace{\frac{1}{(N+1)}}_{z[N+1]},0,\ldots,\underbrace{0}_{z[d]}\right)\\ & =(0,0,\ldots,0),\end{aligned} ∂ f ( x ⋆ ) ∋ μ x ⋆ + γ i ∈ [ 1 : N + 1 ] ∑ N + 1 1 e i z ( say ) = μ ⎝ ⎛ x ⋆ [ 1 ] − μ ( N + 1 ) γ , x ⋆ [ 2 ] − μ ( N + 1 ) γ , … , x ⋆ [ N + 1 ] − μ ( N + 1 ) γ , 0 , … , x ⋆ [ d ] 0 ⎠ ⎞ + γ ⎝ ⎛ z [ 1 ] ( N + 1 ) 1 , z [ 2 ] ( N + 1 ) 1 , … , z [ N + 1 ] ( N + 1 ) 1 , 0 , … , z [ d ] 0 ⎠ ⎞ = ( 0 , 0 , … , 0 ) , hence via Fermat's rule, x ⋆ x_{\star} x ⋆ is a global minimum of f f f . The optimal value of f f f at x ⋆ x_{\star} x ⋆ is:
f ( x ⋆ ) = μ 2 ∥ x ⋆ ∥ 2 2 + γ max { x ⋆ [ 1 ] , x ⋆ [ 2 ] , … x ⋆ [ N + 1 ] } = μ 2 γ 2 μ 2 ( N + 1 ) + γ ( − γ μ ( N + 1 ) ) = γ 2 2 μ ( N + 1 ) − γ 2 μ ( N + 1 ) = − γ 2 2 μ ( N + 1 ) . \begin{aligned} f(x_{\star}) & =\frac{\mu}{2}\|x_{\star}\|_{2}^{2}+\gamma\max\{x_{\star}[1],x_{\star}[2],\ldots x_{\star}[N+1]\}\\ & =\frac{\mu}{2}\frac{\gamma^{2}}{\mu^{2}(N+1)}+\gamma\left(-\frac{\gamma}{\mu(N+1)}\right)\\ & =\frac{\gamma^{2}}{2\mu(N+1)}-\frac{\gamma^{2}}{\mu(N+1)}\\ & =-\frac{\gamma^{2}}{2\mu(N+1)}.\end{aligned} f ( x ⋆ ) = 2 μ ∥ x ⋆ ∥ 2 2 + γ max { x ⋆ [ 1 ] , x ⋆ [ 2 ] , … x ⋆ [ N + 1 ]} = 2 μ μ 2 ( N + 1 ) γ 2 + γ ( − μ ( N + 1 ) γ ) = 2 μ ( N + 1 ) γ 2 − μ ( N + 1 ) γ 2 = − 2 μ ( N + 1 ) γ 2 . Let us initialize our algorithm with the initial value x 0 = 0 , x_{0}=0, x 0 = 0 , for which we have
I ( x 0 ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x [ ℓ ] = max { 0 , 0 , … , 0 } = { 1 , 2 , … , N + 1 } , i x 0 = min I ( x 0 ) = 1. \begin{aligned} I(x_{0}) & =\{\ell\in[1:N+1]\mid x[\ell]=\max\{0,0,\ldots,0\}\\ & =\{1,2,\ldots,N+1\},\\ i_{x_{0}} & =\min I(x_{0})=1.\end{aligned} I ( x 0 ) i x 0 = { ℓ ∈ [ 1 : N + 1 ] ∣ x [ ℓ ] = max { 0 , 0 , … , 0 } = { 1 , 2 , … , N + 1 } , = min I ( x 0 ) = 1. We ask the resisting oracle about first-order information about the function at x 0 x_{0} x 0 . It returns
f ( x 0 ) = μ 2 ∥ x 0 ∥ 2 2 + γ max { x 0 [ 1 ] , x 0 [ 2 ] , … , x 0 [ N + 1 ] } = 0 , \begin{aligned} f(x_{0}) & =\frac{\mu}{2}\|x_{0}\|_{2}^{2}+\gamma\max\{x_{0}[1],x_{0}[2],\ldots, x_{0}[N+1]\}\\ & =0,\end{aligned} f ( x 0 ) = 2 μ ∥ x 0 ∥ 2 2 + γ max { x 0 [ 1 ] , x 0 [ 2 ] , … , x 0 [ N + 1 ]} = 0 , and
f ′ ( x 0 ) = μ x 0 + γ e i x 0 = γ e 1 . \begin{aligned} f^{\prime}(x_{0}) & =\mu x_{0}+\gamma e_{i_{x_{0}}}\\ & =\gamma e_{1}.\end{aligned} f ′ ( x 0 ) = μ x 0 + γ e i x 0 = γ e 1 . So, x 1 x_{1} x 1 , which is the first computed iterate by our first-order method is going to satisfy (Span-Cond):
x 1 ∈ x 0 + span { f ′ ( x 0 ) } = 0 + span { γ e 1 } = span { e 1 } ⇒ x 1 = ξ 1 e 1 for some ξ 1 ∈ R . \begin{aligned} x_{1} & \in x_{0}+\textrm{span}\left\{ f^{\prime}(x_{0})\right\} \\ & =0+\textrm{span}\{\gamma e_{1}\}\\ & =\textrm{span}\{e_{1}\}\\ \Rightarrow x_{1} & =\xi_{1}e_{1}\textrm{ for some }\xi_{1}\in\mathbf{R}.\end{aligned} x 1 ⇒ x 1 ∈ x 0 + span { f ′ ( x 0 ) } = 0 + span { γ e 1 } = span { e 1 } = ξ 1 e 1 for some ξ 1 ∈ R . For the point x 1 , x_{1}, x 1 , we have
I ( x 1 ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x [ ℓ ] = max { ξ 1 , 0 , … , 0 } } = { { 1 , 2 , … , N + 1 } , if ξ 1 = 0 , { 2 , 3 , … , N + 1 } , if ξ 1 < 0 , { 1 } , if ξ 1 > 0 , \begin{aligned} I(x_{1}) & =\{\ell\in[1:N+1]\mid x[\ell]=\max\{\xi_{1},0,\ldots,0\}\}\\ & =\begin{cases} \{1,2,\ldots,N+1\}, & \textrm{if }\xi_{1}=0,\\ \{2,3,\ldots,N+1\}, & \textrm{if }\xi_{1}<0,\\ \{1\}, & \textrm{if }\xi_{1}>0, \end{cases}\end{aligned} I ( x 1 ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x [ ℓ ] = max { ξ 1 , 0 , … , 0 }} = ⎩ ⎨ ⎧ { 1 , 2 , … , N + 1 } , { 2 , 3 , … , N + 1 } , { 1 } , if ξ 1 = 0 , if ξ 1 < 0 , if ξ 1 > 0 , and
i x 1 = min I ( x 1 ) = { 1 , if ξ 1 = 0 , 2 , if ξ 1 < 0 , 1 , if ξ 1 > 0. \begin{aligned} i_{x_{1}} & =\min I(x_{1})\\ & =\begin{cases} 1, & \textrm{if }\xi_{1}=0,\\ 2, & \textrm{if }\xi_{1}<0,\\ 1, & \textrm{if }\xi_{1}>0. \end{cases}\end{aligned} i x 1 = min I ( x 1 ) = ⎩ ⎨ ⎧ 1 , 2 , 1 , if ξ 1 = 0 , if ξ 1 < 0 , if ξ 1 > 0. Now we ask the oracle about first-order information about the function at x 1 x_{1} x 1 . It returns
f ( x 1 ) = μ 2 ∥ x 1 ∥ 2 2 + γ max { x 1 [ 1 ] , x 1 [ 2 ] , … , x 1 [ N + 1 ] } \begin{aligned} f(x_{1}) & =\frac{\mu}{2}\|x_{1}\|_{2}^{2}+\gamma\max\{x_{1}[1],x_{1}[2],\ldots,x_{1}[N+1]\}\end{aligned} f ( x 1 ) = 2 μ ∥ x 1 ∥ 2 2 + γ max { x 1 [ 1 ] , x 1 [ 2 ] , … , x 1 [ N + 1 ]} and
f ′ ( x 1 ) = μ x 1 + γ e i x 1 = μ ξ 1 γ e 1 + γ ( e 1 or e 2 ) \begin{aligned} f^{\prime}(x_{1}) & =\mu x_{1}+\gamma e_{i_{x_{1}}}\\ & =\mu\xi_{1}\gamma e_{1}+\gamma(e_{1}\textrm{ or }e_{2})\end{aligned} f ′ ( x 1 ) = μ x 1 + γ e i x 1 = μ ξ 1 γ e 1 + γ ( e 1 or e 2 ) Hence, the second iterate x 2 x_{2} x 2 is going to satisfy (Span-Cond):
x 2 ∈ x 0 + span { f ′ ( x 0 ) , f ′ ( x 1 ) } = span { γ e 1 , μ ξ 1 γ e 1 + γ ( e 1 or e 2 ) } = span { e 1 , e 2 } . \begin{aligned} x_{2} & \in x_{0}+\textrm{span}\{f^{\prime}(x_{0}),f^{\prime}(x_{1})\}\\ & =\textrm{span}\{\gamma e_{1},\mu\xi_{1}\gamma e_{1}+\gamma(e_{1}\textrm{ or }e_{2})\}\\ & =\textrm{span}\{e_{1},e_{2}\}.\end{aligned} x 2 ∈ x 0 + span { f ′ ( x 0 ) , f ′ ( x 1 )} = span { γ e 1 , μ ξ 1 γ e 1 + γ ( e 1 or e 2 )} = span { e 1 , e 2 } . Continuing in this manner, we can show that for any i ∈ [ 1 : N ] , i\in[1:N], i ∈ [ 1 : N ] , we have
x i ∈ span { e 1 , e 2 , … , e i } , x_{i}\in\textrm{span}\{e_{1},e_{2},\ldots,e_{i}\}, x i ∈ span { e 1 , e 2 , … , e i } , and as a result, components of x i x_{i} x i corresponding to indices N + 1 , … , d N+1,\ldots,d N + 1 , … , d will be zero for all i ∈ [ 1 : N ] i\in[1:N] i ∈ [ 1 : N ] , i.e. ,
x i [ N + 1 ] = … = x i [ d ] = 0. x_{i}[N+1]=\ldots=x_{i}[d]=0. x i [ N + 1 ] = … = x i [ d ] = 0. Hence, the objective value of the iterates x 1 , … , x N x_{1},\ldots,x_{N} x 1 , … , x N is going to satisfy for all i ∈ [ 1 : N ] i\in[1:N] i ∈ [ 1 : N ]
f ( x i ) = μ 2 ∥ x i ∥ 2 2 + γ max { x i [ 1 ] , x i [ 2 ] , … , x i [ N + 1 ] } ≥ γ max { x i [ 1 ] , x i [ 2 ] , … , x i [ N + 1 ] } ≥ γ x i [ N + 1 ] = 0 , (ObjNotImprv) \begin{aligned} f(x_{i}) & =\frac{\mu}{2}\|x_{i}\|_{2}^{2}+\gamma\max\{x_{i}[1],x_{i}[2],\ldots,x_{i}[N+1]\}\\ & \geq\gamma\max\{x_{i}[1],x_{i}[2],\ldots,x_{i}[N+1]\}\\ & \geq \gamma x_{i}[N+1]\\ & =0,\qquad\textrm{(ObjNotImprv)}\end{aligned} f ( x i ) = 2 μ ∥ x i ∥ 2 2 + γ max { x i [ 1 ] , x i [ 2 ] , … , x i [ N + 1 ]} ≥ γ max { x i [ 1 ] , x i [ 2 ] , … , x i [ N + 1 ]} ≥ γ x i [ N + 1 ] = 0 , (ObjNotImprv)
where in the second last line we have used the simple fact that the maximum element of a list will be greater than any element of the list. This is interesting: because it says that no matter what we do, we cannot improve the initial objective value f ( x 0 ) = 0 f(x_{0})=0 f ( x 0 ) = 0 for the iterates x 1 , … , x N x_{1},\ldots,x_{N} x 1 , … , x N ! Finally, applying (ObjNotImprv) for the N N N -th iterate, we get
f ( x N ) − f ( x ⋆ ) ≥ 0 − f ( x ⋆ ) = − ( − γ 2 2 μ ( N + 1 ) ) = γ 2 2 μ ( N + 1 ) . (ObjGap) \begin{aligned} f(x_{N})-f(x_{\star}) & \geq0-f(x_{\star})\\ & =-\left(-\frac{\gamma^{2}}{2\mu(N+1)}\right)\\ & =\frac{\gamma^{2}}{2\mu(N+1)}.\qquad\textrm{(ObjGap)}\end{aligned} f ( x N ) − f ( x ⋆ ) ≥ 0 − f ( x ⋆ ) = − ( − 2 μ ( N + 1 ) γ 2 ) = 2 μ ( N + 1 ) γ 2 . (ObjGap) Now we are left to picking the values of γ \gamma γ and μ \mu μ in terms of L L L and R R R . For that we need to be mindful of two constraints that γ \gamma γ and μ \mu μ need to satisfy:
L = μ ( R + ∥ x ⋆ ∥ 2 ) + γ = μ ( R + γ μ N + 1 ) + γ , \begin{aligned} L & =\mu\left(R+\|x_{\star}\|_{2}\right)+\gamma\\ & =\mu\left(R+\frac{\gamma}{\mu\sqrt{N+1}}\right)+\gamma,\end{aligned} L = μ ( R + ∥ x ⋆ ∥ 2 ) + γ = μ ( R + μ N + 1 γ ) + γ , and
∥ x 0 − x ⋆ ∥ 2 2 = ∥ 0 − x ⋆ ∥ 2 2 = γ 2 μ 2 ( N + 1 ) ≤ R 2 , \|x_{0}-x_{\star}\|_{2}^{2}=\|0-x_{\star}\|_{2}^{2}=\frac{\gamma^{2}}{\mu^{2}(N+1)}\leq R^{2}, ∥ x 0 − x ⋆ ∥ 2 2 = ∥0 − x ⋆ ∥ 2 2 = μ 2 ( N + 1 ) γ 2 ≤ R 2 , where we have used (NrmOptPnt).
To make our life easier, we consider equality in the last equation, and solver for γ \gamma γ and μ \mu μ . We can do it by hand, but I am lazy, so I have just used Mathematica.
(*Mathematica code*)
In[1 ] := Solve[Reduce[{L == \[Mu] (R + \[Gamma]/(\[Mu] Sqrt[\[CapitalNu] +
1 ])) + \[Gamma], \[Gamma]/(\[Mu] Sqrt[\[CapitalNu] + 1 ]) ==
R , R > 0 , \[CapitalNu] > -1 ,
L > 0 , \[Gamma] > 0 , \[Mu] > 0 }, {\[Gamma], \[Mu]},
Reals], {\[Gamma], \[Mu]}, Reals] // Simplify
The solution is:
γ = L N + 1 2 + N + 1 , μ = L 2 R + R N + 1 . \begin{aligned} \gamma & =\frac{L\sqrt{N+1}}{2+\sqrt{N+1}},\\ \mu & =\frac{L}{2R+R\sqrt{N+1}}.\end{aligned} γ μ = 2 + N + 1 L N + 1 , = 2 R + R N + 1 L . Putting the values of γ \gamma γ and μ \mu μ in (ObjGap),
(*Mathematica code*)
In[2 ] := \[Gamma] = (L Sqrt[1 + \[CapitalNu]])/(2 + Sqrt[1 + \[CapitalNu]]);
\[Mu] = L/(2 R + R Sqrt[1 + \[CapitalNu]]);
ObjGap = \[Gamma]^2 /(2 \[Mu] (\[CapitalNu] + 1 )) // FullSimplify
we find:
f ( x N ) − f ( x ⋆ ) ≥ γ 2 2 μ ( N + 1 ) = L R 4 + 2 N + 1 , \begin{aligned} f(x_{N})-f(x_{\star}) & \geq\frac{\gamma^{2}}{2\mu(N+1)}\\ & =\frac{LR}{4+2\sqrt{N+1}},\end{aligned} f ( x N ) − f ( x ⋆ ) ≥ 2 μ ( N + 1 ) γ 2 = 4 + 2 N + 1 L R , thus completing our proof.