Table of contents
Jupyter notebook for this blog. The jupyter notebook for this blog can be downloaded from this link and viewed here.
Before we implement gradient descent method, we first record some necessary background.
Given a differentiable convex function , our goal is to solve , where is the decision variable. To solve the problem above, we consider gradient descent algorithm. The gradient descent implements the following iteration scheme:
where denotes a gradient of evaluated at the iterate , and is our iteration counter. As our step size rule, we pick a sequence that is square-summable but not summable, e.g., , will do the job.
We will go through the following steps:
Load the packages
Create the types
Write the functions
Let us load the necessary packages that we are going to use.
## Load the packages to be used
# -----------------------------
# comment the first two lines if you already have ProximalOperators
using Pkg
Pkg.add("ProximalOperators")
using ProximalOperators, LinearAlgebra
Next, we define a few Julia types, that we require to write an optimization solver in Julia
.
GD_problem
This type contains information about the problem instance, this bascially tells us what function we are trying to optimize over, one initial point , and what should be the beginning step size .
struct GD_problem{F <: ProximableFunction,A <: AbstractVecOrMat{<:Real}, R <: Real}
# problem structure, contains information regarding the problem
f::F # the objective function
x0::A # the intial condition
γ::R # the stepsize
end
Usage of GD_problem
. For example, the user may wish to solve a simple least-squares problem using gradient descent. Then he can create a problem instance. A list of functions that we can use in this regard can be found in the documentation of ProximalOperators
: https://kul-forbes.github.io/ProximalOperators.jl/latest.
# create a problem instance
# ------------------------
A = randn(6,5)
b = randn(6)
m, n = size(A)
# randomized intial point:
x0 = randn(n)
f = LeastSquares(A, b)
γ = 1.0
# create GD_problem
problem = GD_problem(f, x0, γ)
GD_problem(description : Least squares penalty
domain : n/a
expression : n/a
parameters : n/a, [1.0193204973421695, 1.168794122203917, 1.296856638205154, 0.5001687528217552, 0.1500452441023732], 1.0)
GD_setting
This type contains different parameters required to implement our algorithm, such as,
the initial step size ,
maximum number of iterations ,
what should be the tolerance (i.e., if , we take that to be an optimal solution and terminate our algorithm),
whether to print out information about the iterates or not controlled by a boolean variable , and
how frequently to print out such information controlled by the variable .
The user may specify what values for these parameters above should be used. But if he does not specify anything, we should be able to have a default set of values to be used. We can achieve this by creating a simple constructor function for GD_setting
.
struct GD_setting
# user settings to solve the problem using Gradient Descent
γ::Float64 # the step size
maxit::Int64 # maximum number of iteration
tol::Float64 # tolerance, i.e., if ||∇f(x)|| ≤ tol, we take x to be an optimal solution
verbose::Boolean # whether to print information about the iterates
freq::Int64 # how often print information about the iterates
# constructor for the structure, so if user does not specify any particular values,
# then we create a GD_setting object with default values
function GD_setting(; γ = 1, maxit = 1000, tol = 1e-8, verbose = false, freq = 10)
new(γ, maxit, tol, verbose, freq)
end
end
Usage of GD_setting
. For the previously described least squares problem, we create the following setting instance.
setting = GD_setting(verbose = true, tol = 1e-2, maxit = 1000, freq = 100)
GD_setting(1, 1000, 0.01, true, 100)
GD_state
Now we define the type named GD_state
that describes the state our algorithm at iteration number . The state is controlled by
current iterte ,
the gradient of at the current iterate: ,
the stepsize at iteration : , and
iteration number: .
mutable struct GD_state{T <: AbstractVecOrMat{<: Real}, I <: Integer, R <: Real} # contains information regarding one iterattion sequence
x::T # iterate x_n
∇f_x::T # one gradient ∇f(x_n)
γ::R # stepsize
n::I # iteration counter
end
Also, once the user has given the problem information by creating a problem instance GD_problem
, we need a method to construct the initial value of the type GD_state
, as we did earlier for the least-squares problem. We create the initial state from the problem instance by writing a constructor function.
function GD_state(problem::GD_problem)
# a constructor for the struct GD_state, it will take the problem data and create one state containing all
# the iterate information, current state of the gradient etc so that we can start our gradient descent scheme
# unpack information from iter which is GD_iterable type
x0 = copy(problem.x0) # to be safe
f = problem.f
γ = problem.γ
∇f_x, f_x = gradient(f, x0)
n = 1
return GD_state(x0, ∇f_x, γ, n)
end
GD_state
Now that we are done defining the types, we can now focus on writing the functions that will implement our gradient descent scheme.
GD_iteration!
First, we need a function that will take the problem information and the state of our algorithm at iteration number , and then compute the next state for iteration number according to (1).
function GD_iteration!(problem::GD_problem, state::GD_state)
# this is the main iteration function, that takes the problem information, and the previous state,
# and create the new state using Gradient Descent algorithm
# unpack the current state information
x_n = state.x
∇f_x_n = state.∇f_x
γ_n = state.γ
n = state.n
# compute the next state
x_n_plus_1 = x_n - γ_n*∇f_x_n
# now load the computed values in the state
state.x = x_n_plus_1
state.∇f_x, f_x = gradient(problem.f, x_n_plus_1) # note that f_x is not used anywhere
# gradient(f,x) is a function in the ProximalOperators package, see its documentation
# if more information is required
state.γ = 1/(n+1)
state.n = n+1
# done computing return the new state
return state
end
GD_iteration! (generic function with 1 method)
GD_solver
Now we are in a position to write the main solver function named GD_solver
that will be used by the end user. Internally, this function will take the problem information and the problem setting, and then it will
create the initial state,
keep updating the state using GD_iteration!
function until we reach the termination criterion or the maximum number of iterations,
print state of the algorithm if verbose
is true
at the specified frequency, and
return the final state.
## The solver function
function GD_solver(problem::GD_problem, setting::GD_setting)
# this is the function that the end user will use to solve a particular problem, internally it is using the previously defined types and functions to run Gradient Descent Scheme
# create the intial state
state = GD_state(problem::GD_problem)
## time to run the loop
while (state.n < setting.maxit) & (norm(state.∇f_x, Inf) > setting.tol)
# compute a new state
state = GD_iteration!(problem, state)
# print information if verbose = true
if setting.verbose == true
if mod(state.n, setting.freq) == 0
@info "iteration = $(state.n) | obj val = $(problem.f(state.x)) | gradient norm = $(norm(state.∇f_x, Inf))"
end
end
end
# print information regarding the final state
@info "final iteration = $(state.n) | final obj val = $(problem.f(state.x)) | final gradient norm = $(norm(state.∇f_x, Inf))"
return state
end
GD_solver (generic function with 1 method)
Usage of GD_solver
. For the previously created problem
and setting
, we run our GD_solver
function as follows.
# The following function will run the entire loop over the struct GradientDescent
final_state_GD = GD_solver(problem, setting)
┌ Info: iteration = 100 |
│ obj val = 0.10616975436389622 |
│ gradient norm = 0.02964822736954073
└ @ Main In[9]:16
┌ Info: iteration = 200 |
│ obj val = 0.10526904472309405 |
│ gradient norm = 0.019909231214547213
└ @ Main In[9]:16
┌ Info: iteration = 300 |
│ obj val = 0.10499388213419013 |
│ gradient norm = 0.01578344670011611
└ @ Main In[9]:16
┌ Info: iteration = 400 |
│ obj val = 0.1048632843417343 |
│ gradient norm = 0.013388371065530952
└ @ Main In[9]:16
┌ Info: iteration = 500 |
│ obj val = 0.10478781849295837 |
│ gradient norm = 0.011784780173541287
└ @ Main In[9]:16
┌ Info: iteration = 600 |
│ obj val = 0.10473897267694926 |
│ gradient norm = 0.010618620536011203
└ @ Main In[9]:16
┌ Info: final iteration = 667 |
│ final obj val = 0.10471495099365061 |
│ final gradient norm = 0.009995364642054971
└ @ Main In[9]:25
GD_state([-0.747429923239562, 0.7857409304268766, -1.9752613372498569, -0.8109189160513681, 0.6537705422543739], [-0.002114026011632561, -0.008995100759876307, 0.009995364642054971, 0.0024806200462658134, -0.004665182937515999], 0.0014992503748125937, 667)
println("objective value found by our gradient descent $(f(final_state_GD.x))")
println("real objective value $(f(pinv(A)*b)) ")
objective value found by our gradient descent 0.10471495099365061
real objective value 0.10452813292628083
So, we do decent in terms of finding a good solution!