# Implementing a simple gradient descent algorithm in Julia

Shuvomoy Das Gupta

April 10, 2020

In this blog, we discuss how to implement a simple gradient descent scheme in Julia. To do this, we will use the Julia package ProximalOperators, which is an excellent package to compute proximal operators and gradient of common convex functions. I highly recommend the package for anyone interested in operator splitting algorithms. You can find more information about the package at: https://github.com/kul-forbes/ProximalOperators.jl.

Jupyter notebook for this blog. The jupyter notebook for this blog can be downloaded from this link and viewed here.

Before we implement gradient descent method, we first record some necessary background.

### Background.

Given a differentiable convex function $f$, our goal is to solve $\textrm{minimize}\;f(x)$, where $x\in \mathbf{R}^n$ is the decision variable. To solve the problem above, we consider gradient descent algorithm. The gradient descent implements the following iteration scheme:

$x_{n+1} = x_{n}-\gamma_{n}{\nabla f(x_{n})},\qquad (1)$

where ${\nabla f(x_{n})}$ denotes a gradient of $f$ evaluated at the iterate $x_{n}$, and $n$ is our iteration counter. As our step size rule, we pick a sequence that is square-summable but not summable, e.g., $\gamma_{n}=1/n$, will do the job.

We will go through the following steps:

1. Load the packages

2. Create the types

3. Write the functions

### Load the packages

Let us load the necessary packages that we are going to use.

## Load the packages to be used
# -----------------------------
# comment the first two lines if you already have ProximalOperators
using Pkg
using ProximalOperators, LinearAlgebra

### Create the types

Next, we define a few Julia types, that we require to write an optimization solver in Julia.

### GD_problem

This type contains information about the problem instance, this bascially tells us what function $f$ we are trying to optimize over, one initial point $x_0$, and what should be the beginning step size $\gamma_0$.

struct GD_problem{F <: ProximableFunction,A <: AbstractVecOrMat{<:Real}, R <: Real}

# problem structure, contains information regarding the problem

f::F # the objective function
x0::A # the intial condition
γ::R # the stepsize

end

Usage of GD_problem. For example, the user may wish to solve a simple least-squares problem using gradient descent. Then he can create a problem instance. A list of functions that we can use in this regard can be found in the documentation of ProximalOperators: https://kul-forbes.github.io/ProximalOperators.jl/latest.

# create a problem instance
# ------------------------

A = randn(6,5)

b = randn(6)

m, n = size(A)

# randomized intial point:

x0 = randn(n)

f = LeastSquares(A, b)

γ = 1.0

# create GD_problem

problem = GD_problem(f, x0, γ)
GD_problem(description : Least squares penalty
domain      : n/a
expression  : n/a
parameters  : n/a, [1.0193204973421695, 1.168794122203917, 1.296856638205154, 0.5001687528217552, 0.1500452441023732], 1.0)

### GD_setting

This type contains different parameters required to implement our algorithm, such as,

• the initial step size $\gamma$,

• maximum number of iterations $\textrm{maxit}$,

• what should be the tolerance $\textrm{tol}$ (i.e., if $\| \nabla{f(x)} \| \leq \textrm{tol}$, we take that $x$ to be an optimal solution and terminate our algorithm),

• whether to print out information about the iterates or not controlled by a boolean variable $\textrm{verbose}$, and

• how frequently to print out such information controlled by the variable $\textrm{freq}$.

The user may specify what values for these parameters above should be used. But if he does not specify anything, we should be able to have a default set of values to be used. We can achieve this by creating a simple constructor function for GD_setting.

struct GD_setting

# user settings to solve the problem using Gradient Descent

γ::Float64 # the step size
maxit::Int64 # maximum number of iteration
tol::Float64 # tolerance, i.e., if ||∇f(x)|| ≤ tol, we take x to be an optimal solution
verbose::Boolean # whether to print information about the iterates
freq::Int64 # how often print information about the iterates

# constructor for the structure, so if user does not specify any particular values,
# then we create a GD_setting object with default values
function GD_setting(; γ = 1, maxit = 1000, tol = 1e-8, verbose = false, freq = 10)
new(γ, maxit, tol, verbose, freq)
end

end

Usage of GD_setting. For the previously described least squares problem, we create the following setting instance.

setting = GD_setting(verbose = true, tol = 1e-2, maxit = 1000, freq = 100)
GD_setting(1, 1000, 0.01, true, 100)

### GD_state

Now we define the type named GD_state that describes the state our algorithm at iteration number $n$. The state is controlled by

• current iterte $x_n$,

• the gradient of $f$ at the current iterate: ${\nabla{f}(x_n)}$,

• the stepsize at iteration $n$: $\gamma_n$, and

• iteration number: $n$.

mutable struct GD_state{T <: AbstractVecOrMat{<: Real}, I <: Integer, R <: Real} # contains information regarding one iterattion sequence

x::T # iterate x_n
∇f_x::T # one gradient ∇f(x_n)
γ::R # stepsize
n::I # iteration counter

end

Also, once the user has given the problem information by creating a problem instance GD_problem, we need a method to construct the initial value of the type GD_state, as we did earlier for the least-squares problem. We create the initial state from the problem instance by writing a constructor function.

function GD_state(problem::GD_problem)

# a constructor for the struct GD_state, it will take the problem data and create one state containing all
# the iterate information, current state of the gradient etc so that we can start our gradient descent scheme

# unpack information from iter which is GD_iterable type
x0 = copy(problem.x0) # to be safe
f = problem.f
γ = problem.γ
∇f_x, f_x = gradient(f, x0)
n = 1

return GD_state(x0, ∇f_x, γ, n)

end
GD_state

## Write the functions

Now that we are done defining the types, we can now focus on writing the functions that will implement our gradient descent scheme.

### GD_iteration!

First, we need a function that will take the problem information and the state of our algorithm at iteration number $n$, and then compute the next state for iteration number $n+1$ according to (1).

function GD_iteration!(problem::GD_problem, state::GD_state)

# this is the main iteration function, that takes the problem information, and the previous state,
# and create the new state using Gradient Descent algorithm

# unpack the current state information
x_n = state.x
∇f_x_n = state.∇f_x
γ_n = state.γ
n = state.n

# compute the next state
x_n_plus_1 = x_n - γ_n*∇f_x_n

# now load the computed values in the state
state.x = x_n_plus_1
state.∇f_x, f_x = gradient(problem.f, x_n_plus_1) # note that f_x is not used anywhere
# gradient(f,x) is a function in the ProximalOperators package, see its documentation
state.γ = 1/(n+1)
state.n = n+1

# done computing return the new state
return state

end
GD_iteration! (generic function with 1 method)

### GD_solver

Now we are in a position to write the main solver function named GD_solver that will be used by the end user. Internally, this function will take the problem information and the problem setting, and then it will

• create the initial state,

• keep updating the state using GD_iteration! function until we reach the termination criterion or the maximum number of iterations,

• print state of the algorithm if verbose is true at the specified frequency, and

• return the final state.

## The solver function

function GD_solver(problem::GD_problem, setting::GD_setting)

# this is the function that the end user will use to solve a particular problem, internally it is using the previously defined types and functions to run Gradient Descent Scheme
# create the intial state
state = GD_state(problem::GD_problem)

## time to run the loop
while  (state.n < setting.maxit) & (norm(state.∇f_x, Inf) > setting.tol)
# compute a new state
state =  GD_iteration!(problem, state)
# print information if verbose = true
if setting.verbose == true
if mod(state.n, setting.freq) == 0
@info "iteration = $(state.n) | obj val =$(problem.f(state.x)) | gradient norm = $(norm(state.∇f_x, Inf))" end end end # print information regarding the final state @info "final iteration =$(state.n) | final obj val = $(problem.f(state.x)) | final gradient norm =$(norm(state.∇f_x, Inf))"
return state

end
GD_solver (generic function with 1 method)

Usage of GD_solver. For the previously created problem and setting, we run our GD_solver function as follows.

# The following function will run the entire loop over the struct GradientDescent
final_state_GD = GD_solver(problem, setting)
┌ Info: iteration = 100 |
│                 obj val = 0.10616975436389622 |
│                 gradient norm = 0.02964822736954073
└ @ Main In:16
┌ Info: iteration = 200 |
│                 obj val = 0.10526904472309405 |
│                 gradient norm = 0.019909231214547213
└ @ Main In:16
┌ Info: iteration = 300 |
│                 obj val = 0.10499388213419013 |
│                 gradient norm = 0.01578344670011611
└ @ Main In:16
┌ Info: iteration = 400 |
│                 obj val = 0.1048632843417343 |
│                 gradient norm = 0.013388371065530952
└ @ Main In:16
┌ Info: iteration = 500 |
│                 obj val = 0.10478781849295837 |
│                 gradient norm = 0.011784780173541287
└ @ Main In:16
┌ Info: iteration = 600 |
│                 obj val = 0.10473897267694926 |
│                 gradient norm = 0.010618620536011203
└ @ Main In:16
┌ Info: final iteration = 667 |
│     final obj val = 0.10471495099365061 |
│     final gradient norm = 0.009995364642054971
└ @ Main In:25

GD_state([-0.747429923239562, 0.7857409304268766, -1.9752613372498569, -0.8109189160513681, 0.6537705422543739], [-0.002114026011632561, -0.008995100759876307, 0.009995364642054971, 0.0024806200462658134, -0.004665182937515999], 0.0014992503748125937, 667)
println("objective value found by our gradient descent $(f(final_state_GD.x))") println("real objective value$(f(pinv(A)*b)) ")
objective value found by our gradient descent 0.10471495099365061
real objective value 0.10452813292628083

So, we do decent in terms of finding a good solution!