Table of contents
Consider an large-scale convex optimization problem of the form
minimize f ( x ) , ( P ) \begin{array}{ll} \textrm{minimize} & f(x),\quad(P)\end{array} minimize f ( x ) , ( P ) where x ∈ R d x\in\mathbf{R}^{d} x ∈ R d is the decision variable and d d d is a very large number which is common in today's machine learning problems. The function f : R d → R f:\mathbf{R}^{d}\to\mathbf{R} f : R d → R is assumed to be convex with its subgradient norm bounded by some constant L > 0 L>0 L > 0 , and x ⋆ x_{\star} x ⋆ is an optimal solution to ( P ) (P) ( P ) .
We assume that information on this objective function f f f is gained through a first-order oracle. Given an input point x ∈ R d , x\in\mathbf{R}^{d}, x ∈ R d , this oracle returns
Suppose we want to find a solution to ( P ) , (P), ( P ) , so we have to run some iterative algorithm initialized at some point x 0 ∈ R d x_{0}\in\mathbf{R}^{d} x 0 ∈ R d and, then, using some algorithmic rules, compute x 1 , x 2 , … , x N x_{1},x_{2},\ldots,x_{N} x 1 , x 2 , … , x N , i.e. , N N N additional iterates. Because we have a large-scale setup, it is safe to assume that N ≤ d − 1 N\leq d-1 N ≤ d − 1 (if the decision variable is a vector with 10 billion components, it is not realistic to compute 10 10 10 billion iterates of any algorithm!). What type of algorithm should we consider?
To solve this optimization problem, we consider algorithm that satisfies the condition
( ∀ i ∈ [ 1 : N ] ) x i ∈ x 0 + span { f ′ ( x 0 ) , … , f ′ ( x i − 1 ) } , (Span-Cond) \left(\forall i\in[1:N]\right)\quad x_{i}\in x_{0}+\textrm{span}\left\{ f^{\prime}(x_{0}),\ldots,f^{\prime}(x_{i-1})\right\} ,\quad\textrm{(Span-Cond)} ( ∀ i ∈ [ 1 : N ] ) x i ∈ x 0 + span { f ′ ( x 0 ) , … , f ′ ( x i − 1 ) } , (Span-Cond) where g i ∈ ∂ f ( x i ) g_{i}\in\partial f(x_{i}) g i ∈ ∂ f ( x i ) . This condition holds for majority of practical methods.
For any L , R > 0 L,R>0 L , R > 0 , d ∈ N d\in\mathbf{N} d ∈ N with N ≤ d − 1 N\leq d-1 N ≤ d − 1 , and any starting point x 0 ∈ R d x_{0}\in\mathbf{R}^{d} x 0 ∈ R d , there exist (i) a function f : R d → R f:\mathbf{R}^{d}\to\mathbf{R} f : R d → R , which is convex and L L L -Lipschitz continuous, and (ii) a first order oracle providing subgradient and function value information such that we have the lower bound on the objective inaccuracy:
f ( x N ) − f ( x ⋆ ) ≥ L R N + 1 , f(x_{N})-f(x_{\star})\geq\frac{LR}{\sqrt{N+1}}, f ( x N ) − f ( x ⋆ ) ≥ N + 1 L R , for any first-order method satisfying (Span-Cond).
The proof proceeds by constructing a "worst-case" function, on which any first-order method satisfying (Span-Cond) will not be able to improve its initial objective value f ( x 0 ) f(x_{0}) f ( x 0 ) during the first N N N computed iterations.
First, we define two convex functions g g g and h h h as follows. Define the convex function
g ( x ) ≔ g N + 1 ( x ) = max { x [ 1 ] , x [ 2 ] , … x [ N + 1 ] } , g(x)\coloneqq g_{N+1}(x)=\max\{x[1],x[2],\ldots x[N+1]\}, g ( x ) : = g N + 1 ( x ) = max { x [ 1 ] , x [ 2 ] , … x [ N + 1 ]} , where x [ i ] x[i] x [ i ] denotes the i i i -th component of the vector x x x . Using subdifferential calculus [2], we can write down the closed-form expression of g g g 's subdifferential at a point x ∈ R d x\in\mathbf{R}^{d} x ∈ R d , given by
∂ g ( x ) = c o n v { e k ∣ k ∈ I [ 1 : N + 1 ] ( x ) } = { ∑ k ∈ I [ 1 : N + 1 ] ( x ) λ k e k ∣ ( ∀ k ∈ I [ 1 : N + 1 ] ( x ) ) λ k ≥ 0 , ∑ k ∈ I ( x ) λ k = 1 } \begin{aligned} \partial g(x) & =\mathbf{conv}\{e_{k}\mid k\in I_{[1:N+1]}(x)\}\\ & =\left\{ \sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}e_{k}\mid\left(\forall k\in I_{[1:N+1]}(x)\right)\;\lambda_{k}\geq0,\sum_{k\in I(x)}\lambda_{k}=1\right\} \end{aligned} ∂ g ( x ) = conv { e k ∣ k ∈ I [ 1 : N + 1 ] ( x )} = ⎩ ⎨ ⎧ k ∈ I [ 1 : N + 1 ] ( x ) ∑ λ k e k ∣ ( ∀ k ∈ I [ 1 : N + 1 ] ( x ) ) λ k ≥ 0 , k ∈ I ( x ) ∑ λ k = 1 ⎭ ⎬ ⎫ where (recall from notation)
I [ 1 : N + 1 ] ( x ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x [ ℓ ] = max { x [ 1 ] , x [ 2 ] , … x [ N + 1 ] } } , I_{[1:N+1]}(x)=\{\ell\in[1:N+1]\mid x[\ell]=\max\{x[1],x[2],\ldots x[N+1]\}\}, I [ 1 : N + 1 ] ( x ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x [ ℓ ] = max { x [ 1 ] , x [ 2 ] , … x [ N + 1 ]}} , i.e. , any element of I [ 1 : N + 1 ] ( x ) I_{[1:N+1]}(x) I [ 1 : N + 1 ] ( x ) corresponds to an index of a maximal component of vector x = { x [ 1 ] , … , x [ N + 1 ] , … , x [ d ] } x=\{x[1],\ldots,x[N+1],\ldots,x[d]\} x = { x [ 1 ] , … , x [ N + 1 ] , … , x [ d ]} searched over its first N + 1 N+1 N + 1 components. Next, define the function
h ( x ) ≔ h N + 1 ( x ) = ∥ x ∥ 2 − R ( 1 + 1 N + 1 ) , h(x)\coloneqq h_{N+1}(x)=\|x\|_{2}-R\left(1+\frac{1}{\sqrt{N+1}}\right), h ( x ) : = h N + 1 ( x ) = ∥ x ∥ 2 − R ( 1 + N + 1 1 ) , which has subdifferential [2]
∂ h ( x ) = { x ∥ x ∥ 2 , if x ≠ 0 , { y ∣ ∥ y ∥ 2 ≤ 1 } , if x = 0. \partial h(x)=\begin{cases} \frac{x}{\|x\|_{2}}, & \textrm{if }x\neq0,\\ \{y\mid\|y\|_{2}\leq1\}, & \textrm{if }x=0. \end{cases} ∂ h ( x ) = { ∥ x ∥ 2 x , { y ∣ ∥ y ∥ 2 ≤ 1 } , if x = 0 , if x = 0. Finally define a function f f f , which is going to be our worst-case function:
f ( x ) ≔ f N + 1 ( x ) = L max { g ( x ) , h ( x ) } . f(x)\coloneqq f_{N+1}(x)=L\max\left\{ g(x),h(x)\right\} . f ( x ) : = f N + 1 ( x ) = L max { g ( x ) , h ( x ) } . Note that f f f is a convex function because it is point-wise maximum of two convex functions. Using subdifferential calculus, we can write down the closed-form expression of f f f 's subdifferential at a point x ∈ R d x\in\mathbf{R}^{d} x ∈ R d , given by [2] :
∂ f ( x ) = L × { ∂ g ( x ) , if g ( x ) > h ( x ) , ∂ h ( x ) , if g ( x ) < h ( x ) , c o n v { g ′ ( x ) , h ′ ( x ) ∣ g ′ ( x ) ∈ ∂ g ( x ) , h ′ ( x ) ∈ ∂ h ( x ) } , if g ( x ) = h ( x ) . \partial f(x)= L \times \begin{cases} \partial g(x), & \textrm{if } g(x) > h(x),\\ \partial h(x), & \textrm{if } g(x) < h(x),\\ \mathbf{conv}\{g^{\prime}(x),h^{\prime}(x)\mid g^{\prime}(x)\in\partial g(x),h^{\prime}(x)\in\partial h(x)\}, & \textrm{if } g(x) = h(x). \end{cases} ∂ f ( x ) = L × ⎩ ⎨ ⎧ ∂ g ( x ) , ∂ h ( x ) , conv { g ′ ( x ) , h ′ ( x ) ∣ g ′ ( x ) ∈ ∂ g ( x ) , h ′ ( x ) ∈ ∂ h ( x )} , if g ( x ) > h ( x ) , if g ( x ) < h ( x ) , if g ( x ) = h ( x ) . Hence, if we consider a point x x x and any subgradient of g g g at x , x, x , denoted by g ′ ( x ) g^{\prime}(x) g ′ ( x ) , it can be expressed as
g ′ ( x ) = ∑ k ∈ I [ 1 : N + 1 ] ( x ) λ k e k : where ( ∀ k ∈ I [ 1 : N + 1 ] ( x ) ) λ k ≥ 0 , ∑ k ∈ I ( x ) λ k = 1 , g^{\prime}(x)=\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}e_{k}:\textrm{ where }\left(\forall k\in I_{[1:N+1]}(x)\right)\;\lambda_{k}\geq0,\sum_{k\in I(x)}\lambda_{k}=1, g ′ ( x ) = k ∈ I [ 1 : N + 1 ] ( x ) ∑ λ k e k : where ( ∀ k ∈ I [ 1 : N + 1 ] ( x ) ) λ k ≥ 0 , k ∈ I ( x ) ∑ λ k = 1 , with the norm satisfying:
∥ g ′ ( x ) ∥ 2 = ∥ ∑ k ∈ I [ 1 : N + 1 ] ( x ) λ k e k ∥ 2 ≤ a ) ∥ ∑ k ∈ I [ 1 : N + 1 ] ( x ) λ k e k ∥ 2 ≤ b ) ∑ k ∈ I [ 1 : N + 1 ] ( x ) λ k ∥ e k ∥ 2 undefined = 1 = ∑ k ∈ I [ 1 : N + 1 ] ( x ) λ k undefined = 1 = 1 ( 1 ) \begin{aligned} \|g^{\prime}(x)\|_{2} & =\|\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}e_{k}\|_{2}\\ & \overset{a)}{\leq}\|\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}e_{k}\|_{2}\\ & \overset{b)}{\leq}\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}\underbrace{\|e_{k}\|_{2}}_{=1}\\ & =\underbrace{\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}}_{=1}\\ & =1\quad(1)\end{aligned} ∥ g ′ ( x ) ∥ 2 = ∥ k ∈ I [ 1 : N + 1 ] ( x ) ∑ λ k e k ∥ 2 ≤ a ) ∥ k ∈ I [ 1 : N + 1 ] ( x ) ∑ λ k e k ∥ 2 ≤ b ) k ∈ I [ 1 : N + 1 ] ( x ) ∑ λ k = 1 ∥ e k ∥ 2 = = 1 k ∈ I [ 1 : N + 1 ] ( x ) ∑ λ k = 1 ( 1 ) where a ) a) a ) and b ) b) b ) use the triangle inequality and the fact that λ k ≥ 0 \lambda_{k}\geq0 λ k ≥ 0 for all k ∈ I [ 1 : N + 1 ] ( x ) k\in I_{[1:N+1]}(x) k ∈ I [ 1 : N + 1 ] ( x ) . Also, it is clear from the subdifferential of h h h that, any h ′ ( x ) ∈ ∂ h ( x ) h^{\prime}(x)\in\partial h(x) h ′ ( x ) ∈ ∂ h ( x ) would satisfy
∥ h ′ ( x ) ∥ 2 ≤ 1. ( 2 ) \|h^{\prime}(x)\|_{2}\leq1.\quad(2) ∥ h ′ ( x ) ∥ 2 ≤ 1. ( 2 ) Hence, from the subdifferential of f f f above, and (1), (2), we find that any f ′ ( x ) ∈ ∂ f ( x ) f^{\prime}(x)\in\partial f(x) f ′ ( x ) ∈ ∂ f ( x ) will satisfy:
∥ f ′ ( x ) ∥ 2 ≤ L . ( 3 ) \|f^{\prime}(x)\|_{2}\leq L.\quad(3) ∥ f ′ ( x ) ∥ 2 ≤ L . ( 3 ) Let us now verify that f f f is indeed Lipschitz continuous. For any x , y x,y x , y and any f ′ ( x ) ∈ ∂ f ( x ) f^{\prime}(x)\in\partial f(x) f ′ ( x ) ∈ ∂ f ( x ) , from the subgradient inequality, we have
f ( y ) ≥ f ( x ) + ⟨ f ′ ( x ) ∣ y − x ⟩ ⇔ f ( x ) − f ( y ) ≤ − ⟨ f ′ ( x ) ∣ y − x ⟩ = ⟨ f ′ ( x ) ∣ x − y ⟩ ≤ a ) ∥ f ′ ( x ) ∥ 2 ∥ x − y ∥ 2 ≤ L ∥ x − y ∥ 2 \begin{aligned} f(y) & \geq f(x)+\left\langle f^{\prime}(x)\mid y-x\right\rangle \\ \Leftrightarrow f(x)-f(y) & \leq-\left\langle f^{\prime}(x)\mid y-x\right\rangle \\ & =\left\langle f^{\prime}(x)\mid x-y\right\rangle \\ & \overset{a)}{\leq}\|f^{\prime}(x)\|_{2}\|x-y\|_{2}\\ & \leq L\|x-y\|_{2}\end{aligned} f ( y ) ⇔ f ( x ) − f ( y ) ≥ f ( x ) + ⟨ f ′ ( x ) ∣ y − x ⟩ ≤ − ⟨ f ′ ( x ) ∣ y − x ⟩ = ⟨ f ′ ( x ) ∣ x − y ⟩ ≤ a ) ∥ f ′ ( x ) ∥ 2 ∥ x − y ∥ 2 ≤ L ∥ x − y ∥ 2 where a ) a) a ) uses (3). The last inequality is symmetric with respect to x x x and y y y , hence for any x , y x,y x , y , we also have:
f ( y ) − f ( x ) ≤ L ∥ y − x ∥ 2 . f(y)-f(x)\leq L\|y-x\|_{2}. f ( y ) − f ( x ) ≤ L ∥ y − x ∥ 2 . Thus, combining the last two inequalities, for any x , y x,y x , y , we have
∣ f ( y ) − f ( x ) ∣ = max { f ( y ) − f ( x ) , − ( f ( y ) − f ( x ) ) } ≤ L ∥ y − x ∥ 2 , |f(y)-f(x)|=\max\{f(y)-f(x),-\left(f(y)-f(x)\right)\}\leq L\|y-x\|_{2}, ∣ f ( y ) − f ( x ) ∣ = max { f ( y ) − f ( x ) , − ( f ( y ) − f ( x ) ) } ≤ L ∥ y − x ∥ 2 , i.e. , f f f is L L L -Lipschitz continuous.
Now let us define the oracle. When asked for a first-order information about the function at a point x x x , the oracle returns the function value f ( x ) f(x) f ( x ) , and one subgradient of f f f at x x x given by (note that it lies in ∂ f ( x ) \partial f(x) ∂ f ( x ) defined above)
f ′ ( x ) = L × { e i x , if g ( x ) ≥ h ( x ) , x / ∥ x ∥ 2 , if g ( x ) < h ( x ) , f^{\prime}(x)=L \times \begin{cases} e_{i_{x}}, & \textrm{if } g(x) \geq h(x),\\ x/\|x\|_{2}, & \textrm{if } g(x) < h(x), \end{cases} f ′ ( x ) = L × { e i x , x /∥ x ∥ 2 , if g ( x ) ≥ h ( x ) , if g ( x ) < h ( x ) , where (recall from notation) i x i_{x} i x is the first coordinate of x x x that satisfies x [ i x ] = max j ∈ [ 1 : N + 1 ] x [ j ] , x[i_{x}]=\max_{j\in[1:N+1]}x[j], x [ i x ] = max j ∈ [ 1 : N + 1 ] x [ j ] , i.e. ,
i x = min I [ 1 : N + 1 ] ( x ) . i_{x}=\min I_{[1:N+1]}(x). i x = min I [ 1 : N + 1 ] ( x ) . We will see soon that this oracle is providing us the worst possible subgradient for reducing function value in N N N iterates, and this type of oracle is called resisting oracle.
Define the point x ⋆ x_{\star} x ⋆ as:
x ⋆ = { − R N + 1 , if i ∈ [ 1 : N + 1 ] 0 , else . (OptPnt) x_{\star}=\begin{cases} -\frac{R}{\sqrt{N+1}}, & \textrm{if }i\in[1:N+1]\\ 0, & \textrm{else}. \end{cases}\qquad\textrm{(OptPnt)} x ⋆ = { − N + 1 R , 0 , if i ∈ [ 1 : N + 1 ] else . (OptPnt) which has norm
∥ x ⋆ ∥ 2 = R , (NrmOptPnt) \|x_{\star}\|_{2}=R,\qquad\textrm{(NrmOptPnt)} ∥ x ⋆ ∥ 2 = R , (NrmOptPnt) along with
g ( x ⋆ ) = − R N + 1 g(x_{\star})=-\frac{R}{\sqrt{N+1}} g ( x ⋆ ) = − N + 1 R and
h ( x ⋆ ) = ∥ x ⋆ ∥ 2 − R ( 1 + 1 N + 1 ) = − R N + 1 = g ( x ⋆ ) . h(x_{\star})=\|x_{\star}\|_{2}-R\left(1+\frac{1}{\sqrt{N+1}}\right)=-\frac{R}{\sqrt{N+1}}=g(x_{\star}). h ( x ⋆ ) = ∥ x ⋆ ∥ 2 − R ( 1 + N + 1 1 ) = − N + 1 R = g ( x ⋆ ) . As a result,
f ( x ⋆ ) = L × max { g ( x ⋆ ) , h ( x ⋆ ) } = − L R N + 1 . f(x_{\star})=L\times\max\{g(x_{\star}),h(x_{\star})\}=-\frac{LR}{\sqrt{N+1}}. f ( x ⋆ ) = L × max { g ( x ⋆ ) , h ( x ⋆ )} = − N + 1 L R . We claim that x ⋆ x_{\star} x ⋆ is a global minimum of f f f . First, observe that:
I [ 1 : N + 1 ] ( x ⋆ ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x ⋆ [ ℓ ] = max { x ⋆ [ 1 ] , x ⋆ [ 2 ] , … x ⋆ [ N + 1 ] } } = { 1 , 2 , … , N + 1 } , \begin{aligned} I_{[1:N+1]}(x_{\star}) & =\{\ell\in[1:N+1]\mid x_{\star}[\ell]=\max\{x_{\star}[1],x_{\star}[2],\ldots x_{\star}[N+1]\}\}\\ & =\{1,2,\ldots,N+1\},\end{aligned} I [ 1 : N + 1 ] ( x ⋆ ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x ⋆ [ ℓ ] = max { x ⋆ [ 1 ] , x ⋆ [ 2 ] , … x ⋆ [ N + 1 ]}} = { 1 , 2 , … , N + 1 } , as all the first N + 1 N+1 N + 1 components of N N N are the same. Next note that, at x ⋆ x_{\star} x ⋆ , f f f has the subdifferential
∂ f ( x ⋆ ) = L × c o n v { g ′ ( x ⋆ ) , h ′ ( x ⋆ ) ∣ g ′ ( x ⋆ ) ∈ ∂ g ( x ⋆ ) , h ′ ( x ⋆ ) ∈ ∂ h ( x ⋆ ) } = L × { α g ′ ( x ⋆ ) + ( 1 − α ) h ′ ( x ⋆ ) ∣ α ∈ [ 0 , 1 ] , g ′ ( x ⋆ ) ∈ ∂ g ( x ⋆ ) , h ′ ( x ⋆ ) ∈ ∂ h ( x ⋆ ) } = L × { α ∑ k ∈ I [ 1 : N + 1 ] ( x ) λ k e k + ( 1 − α ) x ⋆ ∥ x ⋆ ∥ 2 ∣ α ∈ [ 0 , 1 ] , ( ∀ k ∈ I [ 1 : N + 1 ] ( x ) ) λ k ≥ 0 , ∑ k ∈ I ( x ) λ k = 1 } . \begin{aligned} & \partial f(x_{\star})\\ & =L\times\mathbf{conv}\{g^{\prime}(x_{\star}),h^{\prime}(x_{\star})\mid g^{\prime}(x_{\star})\in\partial g(x_{\star}),h^{\prime}(x_{\star})\in\partial h(x_{\star})\}\\ & =L\times\{\alpha g^{\prime}(x_{\star})+(1-\alpha)h^{\prime}(x_{\star})\mid\alpha\in[0,1],g^{\prime}(x_{\star})\in\partial g(x_{\star}),h^{\prime}(x_{\star})\in\partial h(x_{\star})\}\\ & =L\times\left\{ \alpha\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}e_{k}+(1-\alpha)\frac{x_{\star}}{\|x_{\star}\|_{2}}\mid\alpha\in[0,1],\left(\forall k\in I_{[1:N+1]}(x)\right)\;\lambda_{k}\geq0,\sum_{k\in I(x)}\lambda_{k}=1\right\} .\end{aligned} ∂ f ( x ⋆ ) = L × conv { g ′ ( x ⋆ ) , h ′ ( x ⋆ ) ∣ g ′ ( x ⋆ ) ∈ ∂ g ( x ⋆ ) , h ′ ( x ⋆ ) ∈ ∂ h ( x ⋆ )} = L × { α g ′ ( x ⋆ ) + ( 1 − α ) h ′ ( x ⋆ ) ∣ α ∈ [ 0 , 1 ] , g ′ ( x ⋆ ) ∈ ∂ g ( x ⋆ ) , h ′ ( x ⋆ ) ∈ ∂ h ( x ⋆ )} = L × ⎩ ⎨ ⎧ α k ∈ I [ 1 : N + 1 ] ( x ) ∑ λ k e k + ( 1 − α ) ∥ x ⋆ ∥ 2 x ⋆ ∣ α ∈ [ 0 , 1 ] , ( ∀ k ∈ I [ 1 : N + 1 ] ( x ) ) λ k ≥ 0 , k ∈ I ( x ) ∑ λ k = 1 ⎭ ⎬ ⎫ . In the last equation, if we set α ≔ ( ( 1 / N + 1 ) + 1 ) − 1 \alpha\coloneqq\left((1/\sqrt{N+1})+1\right)^{-1} α : = ( ( 1/ N + 1 ) + 1 ) − 1 and λ k ≔ λ = 1 / ( N + 1 ) \lambda_{k}\coloneqq\lambda=1/(N+1) λ k : = λ = 1/ ( N + 1 ) for all k ∈ [ 1 : N + 1 ] , k\in[1:N+1], k ∈ [ 1 : N + 1 ] , then we have one subdifferential of ∂ f ( x ⋆ ) \partial f(x_{\star}) ∂ f ( x ⋆ ) as follows:
∂ f ( x ⋆ ) ∋ L × ( α ∑ k ∈ I [ 1 : N + 1 ] ( x ) λ k e k + ( 1 − α ) x ⋆ ∥ x ⋆ ∥ 2 ) = L α λ ( 1 undefined index [ 1 ] , 1 undefined index [ 2 ] , … , 1 undefined index [ N + 1 ] , 0 , … , 0 undefined index [ d ] ) + L ( 1 − α ) ( − 1 N + 1 undefined index [ 1 ] , − 1 N + 1 undefined index [ 2 ] , … , − 1 N + 1 undefined index [ N + 1 ] , 0 , … , 0 undefined index [ d ] ) = L α 1 N + 1 ( 1 undefined index [ 1 ] , 1 undefined index [ 2 ] , … , 1 undefined index [ N + 1 ] , 0 , … , 0 undefined index [ d ] ) + L ( 1 − α ) 1 N + 1 ( − 1 undefined index [ 1 ] , − 1 undefined index [ 2 ] , … , − 1 undefined index [ N + 1 ] , 0 , … , 0 undefined index [ d ] ) = ( 0 , 0 , … , 0 ) , \begin{aligned} \partial f(x_{\star}) & \ni L\times(\alpha\sum_{k\in I_{[1:N+1]}(x)}\lambda_{k}e_{k}+(1-\alpha)\frac{x_{\star}}{\|x_{\star}\|_{2}})\\ & =L\alpha\lambda\left(\underbrace{1}_{\textrm{index}[1]},\underbrace{1}_{\textrm{index}[2]},\ldots,\underbrace{1}_{\textrm{index}[N+1]},0,\ldots,\underbrace{0}_{\textrm{index}[d]}\right)+\\ & L(1-\alpha)\left(\underbrace{-\frac{1}{\sqrt{N+1}}}_{\textrm{index}[1]},\underbrace{-\frac{1}{\sqrt{N+1}}}_{\textrm{index}[2]},\ldots,\underbrace{-\frac{1}{\sqrt{N+1}}}_{\textrm{index}[N+1]},0,\ldots,\underbrace{0}_{\textrm{index}[d]}\right)\\ & =L\alpha\frac{1}{N+1}\left(\underbrace{1}_{\textrm{index}[1]},\underbrace{1}_{\textrm{index}[2]},\ldots,\underbrace{1}_{\textrm{index}[N+1]},0,\ldots,\underbrace{0}_{\textrm{index}[d]}\right)+\\ & L(1-\alpha)\frac{1}{\sqrt{N+1}}\left(\underbrace{-1}_{\textrm{index}[1]},\underbrace{-1}_{\textrm{index}[2]},\ldots,\underbrace{-1}_{\textrm{index}[N+1]},0,\ldots,\underbrace{0}_{\textrm{index}[d]}\right)\\ & =(0,0,\ldots,0),\end{aligned} ∂ f ( x ⋆ ) ∋ L × ( α k ∈ I [ 1 : N + 1 ] ( x ) ∑ λ k e k + ( 1 − α ) ∥ x ⋆ ∥ 2 x ⋆ ) = Lα λ ⎝ ⎛ index [ 1 ] 1 , index [ 2 ] 1 , … , index [ N + 1 ] 1 , 0 , … , index [ d ] 0 ⎠ ⎞ + L ( 1 − α ) ⎝ ⎛ index [ 1 ] − N + 1 1 , index [ 2 ] − N + 1 1 , … , index [ N + 1 ] − N + 1 1 , 0 , … , index [ d ] 0 ⎠ ⎞ = Lα N + 1 1 ⎝ ⎛ index [ 1 ] 1 , index [ 2 ] 1 , … , index [ N + 1 ] 1 , 0 , … , index [ d ] 0 ⎠ ⎞ + L ( 1 − α ) N + 1 1 ⎝ ⎛ index [ 1 ] − 1 , index [ 2 ] − 1 , … , index [ N + 1 ] − 1 , 0 , … , index [ d ] 0 ⎠ ⎞ = ( 0 , 0 , … , 0 ) , where in the last line we have used the value of α \alpha α .
alpha = ( 1 + 1 / Sqrt [ n + 1 ] ) ^- 1 ;
alpha / ( n + 1 ) - ( 1 - alpha ) / Sqrt [ n + 1 ] // Simplify
Computing iterate x 1 x_1 x 1 . Let us initialize our algorithm with the initial value x 0 = 0 , x_{0}=0, x 0 = 0 , which satisfies ∥ x ⋆ − x 0 ∥ 2 = R \|x_{\star}-x_{0}\|_{2}=R ∥ x ⋆ − x 0 ∥ 2 = R . For this x 0 , x_{0}, x 0 , we have
g ( x 0 ) = max { 0 , … , 0 } = 0 ≥ h ( x 0 ) = ∥ 0 ∥ 2 − R ( 1 + 1 N + 1 ) = − R ( 1 + 1 N + 1 ) , g(x_{0})=\max\{0,\ldots,0\}=0\geq h(x_{0})=\|\mathbf{0}\|_{2}-R\left(1+\frac{1}{\sqrt{N+1}}\right)=-R\left(1+\frac{1}{\sqrt{N+1}}\right), g ( x 0 ) = max { 0 , … , 0 } = 0 ≥ h ( x 0 ) = ∥ 0 ∥ 2 − R ( 1 + N + 1 1 ) = − R ( 1 + N + 1 1 ) , and
I [ 1 : N + 1 ] ( x 0 ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x [ ℓ ] = max { 0 , 0 , … , 0 } = { 1 , 2 , … , N + 1 } , i x 0 = min I ( x 0 ) = 1. \begin{aligned} I_{[1:N+1]}(x_{0}) & =\{\ell\in[1:N+1]\mid x[\ell]=\max\{0,0,\ldots,0\}\\ & =\{1,2,\ldots,N+1\},\\ i_{x_{0}} & =\min I(x_{0})=1.\end{aligned} I [ 1 : N + 1 ] ( x 0 ) i x 0 = { ℓ ∈ [ 1 : N + 1 ] ∣ x [ ℓ ] = max { 0 , 0 , … , 0 } = { 1 , 2 , … , N + 1 } , = min I ( x 0 ) = 1. We ask the resisting oracle about first-order information about the function at x 0 x_{0} x 0 . It returns
f ( x 0 ) = L max { g ( x 0 ) , h ( x 0 ) } = L × 0 = 0 \begin{aligned} f(x_{0}) & =L\max\left\{ g(x_{0}),h(x_{0})\right\} \\ & =L\times0=0\end{aligned} f ( x 0 ) = L max { g ( x 0 ) , h ( x 0 ) } = L × 0 = 0 and
f ′ ( x 0 ) = L × e i x 0 = L e 1 . f^{\prime}(x_{0})=L\times e_{i_{x_{0}}}=Le_{1}. f ′ ( x 0 ) = L × e i x 0 = L e 1 . So, x 1 x_{1} x 1 , which is the first computed iterate by our first-order method is going to satisfy (Span-Cond):
x 1 ∈ x 0 + span { f ′ ( x 0 ) } = 0 + span { L e 1 } = span { e 1 } ⇒ x 1 = ξ 1 e 1 for some ξ 1 ∈ R . \begin{aligned} x_{1} & \in x_{0}+\textrm{span}\left\{ f^{\prime}(x_{0})\right\} \\ & =0+\textrm{span}\{Le_{1}\}\\ & =\textrm{span}\{e_{1}\}\\ \Rightarrow x_{1} & =\xi_{1}e_{1}\textrm{ for some }\xi_{1}\in\mathbf{R}.\end{aligned} x 1 ⇒ x 1 ∈ x 0 + span { f ′ ( x 0 ) } = 0 + span { L e 1 } = span { e 1 } = ξ 1 e 1 for some ξ 1 ∈ R .
For the point x 1 , x_{1}, x 1 , we have
g ( x 1 ) = max { ξ 1 e 1 , … , 0 } = { ξ 1 , if ξ 1 ≥ 0 0 , if ξ 1 < 0 g(x_{1})=\max\{\xi_{1}e_{1},\ldots,0\}=\begin{cases} \xi_{1}, & \textrm{if }\xi_{1}\geq0\\ 0, & \textrm{if }\xi_{1}<0 \end{cases} g ( x 1 ) = max { ξ 1 e 1 , … , 0 } = { ξ 1 , 0 , if ξ 1 ≥ 0 if ξ 1 < 0
and
h ( x 1 ) = ∥ ξ 1 e 1 ∥ 2 − R ( 1 + 1 N + 1 ) = ∣ ξ 1 ∣ − R ( 1 + 1 N + 1 ) . h(x_{1})=\|\xi_{1}e_{1}\|_{2}-R\left(1+\frac{1}{\sqrt{N+1}}\right)=\vert\xi_{1}\vert-R\left(1+\frac{1}{\sqrt{N+1}}\right). h ( x 1 ) = ∥ ξ 1 e 1 ∥ 2 − R ( 1 + N + 1 1 ) = ∣ ξ 1 ∣ − R ( 1 + N + 1 1 ) . Computing iterate x 2 x_{2} x 2 . Consider the case g ( x 1 ) ≥ h ( x 1 ) , g(x_{1})\geq h(x_{1}), g ( x 1 ) ≥ h ( x 1 ) , then
I [ 1 : N + 1 ] ( x 1 ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x [ ℓ ] = max { ξ 1 , 0 , … , 0 } } = { { 1 , 2 , … , N + 1 } , if ξ 1 = 0 , { 2 , 3 , … , N + 1 } , if ξ 1 < 0 , { 1 } , if ξ 1 > 0 , \begin{aligned} I_{[1:N+1]}(x_{1}) & =\{\ell\in[1:N+1]\mid x[\ell]=\max\{\xi_{1},0,\ldots,0\}\}\\ & =\begin{cases} \{1,2,\ldots,N+1\}, & \textrm{if }\xi_{1}=0,\\ \{2,3,\ldots,N+1\}, & \textrm{if }\xi_{1}<0,\\ \{1\}, & \textrm{if }\xi_{1}>0, \end{cases}\end{aligned} I [ 1 : N + 1 ] ( x 1 ) = { ℓ ∈ [ 1 : N + 1 ] ∣ x [ ℓ ] = max { ξ 1 , 0 , … , 0 }} = ⎩ ⎨ ⎧ { 1 , 2 , … , N + 1 } , { 2 , 3 , … , N + 1 } , { 1 } , if ξ 1 = 0 , if ξ 1 < 0 , if ξ 1 > 0 , and
i x 1 = min I ( x 1 ) = { 1 , if ξ 1 = 0 , 2 , if ξ 1 < 0 , 1 , if ξ 1 > 0. \begin{aligned} i_{x_{1}} & =\min I(x_{1})\\ & =\begin{cases} 1, & \textrm{if }\xi_{1}=0,\\ 2, & \textrm{if }\xi_{1}<0,\\ 1, & \textrm{if }\xi_{1}>0. \end{cases}\end{aligned} i x 1 = min I ( x 1 ) = ⎩ ⎨ ⎧ 1 , 2 , 1 , if ξ 1 = 0 , if ξ 1 < 0 , if ξ 1 > 0.
Now we ask the oracle about first-order information about the function at x 1 x_{1} x 1 . It returns
f ( x 1 ) = L × max { x 1 [ 1 ] , x 1 [ 2 ] , … , x 1 [ N + 1 ] } ≥ L × x 1 [ N + 1 ] , \begin{aligned} f(x_{1}) & =L\times\max\{x_{1}[1],x_{1}[2],\ldots,x_{1}[N+1]\}\\ & \geq L\times x_{1}[N+1],\end{aligned} f ( x 1 ) = L × max { x 1 [ 1 ] , x 1 [ 2 ] , … , x 1 [ N + 1 ]} ≥ L × x 1 [ N + 1 ] , and
f ′ ( x 1 ) = L × e i x 1 = L ( e 1 or e 2 ) \begin{aligned} f^{\prime}(x_{1}) & =L\times e_{i_{x_{1}}}\\ & =L(e_{1}\textrm{ or }e_{2})\end{aligned} f ′ ( x 1 ) = L × e i x 1 = L ( e 1 or e 2 ) Hence, for the case g ( x 1 ) ≥ h ( x 1 ) , g(x_{1})\geq h(x_{1}), g ( x 1 ) ≥ h ( x 1 ) , the second iterate x 2 x_{2} x 2 is going to satisfy (Span-Cond):
x 2 ∈ x 0 + span { f ′ ( x 0 ) , f ′ ( x 1 ) } = span { L e 1 , L ( e 1 or e 2 ) } = span { e 1 , e 2 } . \begin{aligned} x_{2} & \in x_{0}+\textrm{span}\{f^{\prime}(x_{0}),f^{\prime}(x_{1})\}\\ & =\textrm{span}\{Le_{1},L(e_{1}\textrm{ or }e_{2})\}\\ & =\textrm{span}\{e_{1},e_{2}\}.\end{aligned} x 2 ∈ x 0 + span { f ′ ( x 0 ) , f ′ ( x 1 )} = span { L e 1 , L ( e 1 or e 2 )} = span { e 1 , e 2 } .
Now consider the case g ( x 1 ) < h ( x 1 ) g(x_1)<h(x_1) g ( x 1 ) < h ( x 1 ) then
f ( x 1 ) = ∣ ξ 1 ∣ − R ( 1 + 1 N + 1 ) = L × max { g ( x 1 ) , h ( x 1 ) } (by definition) ≥ L × g ( x 1 ) (by definition) = L × max { x 1 [ 1 ] , x 1 [ 2 ] , … , x 1 [ N + 1 ] } ≥ L × x 1 [ N + 1 ] , \begin{aligned} f(x_{1}) & =\vert\xi_{1}\vert-R\left(1+\frac{1}{\sqrt{N+1}}\right)\\ & =L\times\max\{g(x_{1}),h(x_{1})\}\textrm{ (by definition)}\\ & \geq L\times g(x_{1})\textrm{ (by definition)}\\ & =L\times\max\{x_{1}[1],x_{1}[2],\ldots,x_{1}[N+1]\}\\ & \geq L\times x_{1}[N+1],\end{aligned} f ( x 1 ) = ∣ ξ 1 ∣ − R ( 1 + N + 1 1 ) = L × max { g ( x 1 ) , h ( x 1 )} (by definition) ≥ L × g ( x 1 ) (by definition) = L × max { x 1 [ 1 ] , x 1 [ 2 ] , … , x 1 [ N + 1 ]} ≥ L × x 1 [ N + 1 ] ,
and
f ′ ( x 1 ) = L x 1 / ∥ x 1 ∥ 2 = L ξ 1 e 1 / ∣ ξ 1 ∣ = ( L ξ 1 / ∣ ξ 1 ∣ ) e 1 . f^{\prime}(x_{1})=Lx_{1}/\|x_{1}\|_{2}=L\xi_{1}e_{1}/\vert\xi_{1}\vert=(L\xi_{1}/\vert\xi_{1}\vert)e_{1}. f ′ ( x 1 ) = L x 1 /∥ x 1 ∥ 2 = L ξ 1 e 1 /∣ ξ 1 ∣ = ( L ξ 1 /∣ ξ 1 ∣ ) e 1 .
So for the case g ( x 1 ) < h ( x 1 ) g(x_1) < h(x_1) g ( x 1 ) < h ( x 1 ) , we have
x 2 = span { e 1 } . x_{2}=\textrm{span}\{e_{1}\}. x 2 = span { e 1 } . So, irrespective of g ( x 1 ) ≥ h ( x 1 ) g(x_{1})\geq h(x_{1}) g ( x 1 ) ≥ h ( x 1 ) or g ( x 1 ) < h ( x 1 ) , g(x_1)<h(x_1), g ( x 1 ) < h ( x 1 ) , we have
x 2 = span { e 1 , e 2 } . x_{2}=\textrm{span}\{e_{1},e_{2}\}. x 2 = span { e 1 , e 2 } .
Generalization about the iterates. Continuing in this manner, we can show that for any i ∈ [ 1 : N ] , i\in[1:N], i ∈ [ 1 : N ] , we have
x i ∈ span { e 1 , e 2 , … , e i } , x_{i}\in\textrm{span}\{e_{1},e_{2},\ldots,e_{i}\}, x i ∈ span { e 1 , e 2 , … , e i } , and as a result, components of x i x_{i} x i corresponding to indices N + 1 , … , d N+1,\ldots,d N + 1 , … , d will be zero for all i ∈ [ 1 : N ] i\in[1:N] i ∈ [ 1 : N ] , i.e. ,
x i [ N + 1 ] = … = x i [ d ] = 0. x_{i}[N+1]=\ldots=x_{i}[d]=0. x i [ N + 1 ] = … = x i [ d ] = 0. Hence, the objective value of the iterates x 1 , … , x N x_{1},\ldots,x_{N} x 1 , … , x N is going to satisfy for all i ∈ [ 1 : N ] i\in[1:N] i ∈ [ 1 : N ]
f ( x i ) = L × max { g ( x i ) , h ( x i ) } (by definition) ≥ L × g ( x i ) (by definition) = L × max { x i [ 1 ] , x i [ 2 ] , … , x i [ N + 1 ] } ≥ L × x i [ N + 1 ] = 0 , (ObjNotImprv) \begin{aligned} f(x_{i}) & =L\times\max\{g(x_{i}),h(x_{i})\}\textrm{ (by definition)}\\ & \geq L\times g(x_{i})\textrm{ (by definition)}\\ & =L\times\max\{x_{i}[1],x_{i}[2],\ldots,x_{i}[N+1]\}\\ & \geq L\times x_{i}[N+1]=0,\qquad\textrm{(ObjNotImprv)}\end{aligned} f ( x i ) = L × max { g ( x i ) , h ( x i )} (by definition) ≥ L × g ( x i ) (by definition) = L × max { x i [ 1 ] , x i [ 2 ] , … , x i [ N + 1 ]} ≥ L × x i [ N + 1 ] = 0 , (ObjNotImprv) where in the last line we have used the simple fact that the maximum element of a list will be greater than any element of the list. This is interesting: because it says that no matter what we do, we cannot improve the initial objective value f ( x 0 ) = 0 f(x_{0})=0 f ( x 0 ) = 0 for the iterates x 1 , … , x N x_{1},\ldots,x_{N} x 1 , … , x N !
Final lower bound. Finally, applying (ObjNotImprv) for the N N N -th iterate, we get
f ( x N ) − f ( x ⋆ ) ≥ 0 − f ( x ⋆ ) = − ( − L R N + 1 ) = L R N + 1 (ObjGap) \begin{aligned} f(x_{N})-f(x_{\star}) & \geq0-f(x_{\star})\\ & =-\left(-\frac{LR}{\sqrt{N+1}}\right)\\ & =\frac{LR}{\sqrt{N+1}}\qquad\textrm{(ObjGap)}\end{aligned} f ( x N ) − f ( x ⋆ ) ≥ 0 − f ( x ⋆ ) = − ( − N + 1 L R ) = N + 1 L R (ObjGap) thus completing our proof.
[1] Drori, Yoel, and Teboulle, Marc. An optimal variant of Kelley's cutting-plane method. Mathematical Programming 160.1 (2016): 321-351.
[2] Beck, Amir. First-order methods in optimization. Society for Industrial and Applied Mathematics, 2017.