# Implementing a simple gradient descent algorithm in Julia

**Shuvomoy Das Gupta**

*April 10, 2020*

In this blog, we discuss how to implement a simple gradient descent scheme in `Julia`

. To do this, we will use the Julia package `ProximalOperators`

, which is an excellent package to compute proximal operators and gradient of common convex functions. I highly recommend the package for anyone interested in operator splitting algorithms. You can find more information about the package at: https://github.com/kul-forbes/ProximalOperators.jl.

**Table of contents**

**Jupyter notebook for this blog.** The jupyter notebook for this blog can be downloaded from this link and viewed here.

Before we implement gradient descent method, we first record some necessary background.

### Background.

Given a differentiable convex function $f$, our goal is to solve $\textrm{minimize}\;f(x)$, where $x\in \mathbf{R}^n$ is the decision variable. To solve the problem above, we consider gradient descent algorithm. The gradient descent implements the following iteration scheme:

$x_{n+1} = x_{n}-\gamma_{n}{\nabla f(x_{n})},\qquad (1)$where ${\nabla f(x_{n})}$ denotes a gradient of $f$ evaluated at the iterate $x_{n}$, and $n$ is our iteration counter. As our step size rule, we pick a sequence that is square-summable but not summable, e.g., $\gamma_{n}=1/n$, will do the job.

We will go through the following steps:

Load the packages

Create the types

Write the functions

### Load the packages

Let us load the necessary packages that we are going to use.

```
## Load the packages to be used
# -----------------------------
# comment the first two lines if you already have ProximalOperators
using Pkg
Pkg.add("ProximalOperators")
using ProximalOperators, LinearAlgebra
```

### Create the types

Next, we define a few Julia types, that we require to write an optimization solver in `Julia`

.

`GD_problem`

This type contains information about the problem instance, this bascially tells us what function $f$ we are trying to optimize over, one initial point $x_0$, and what should be the beginning step size $\gamma_0$.

```
struct GD_problem{F <: ProximableFunction,A <: AbstractVecOrMat{<:Real}, R <: Real}
# problem structure, contains information regarding the problem
f::F # the objective function
x0::A # the intial condition
γ::R # the stepsize
end
```

**Usage of GD_problem.** For example, the user may wish to solve a simple least-squares problem using gradient descent. Then he can create a problem instance. A list of functions that we can use in this regard can be found in the documentation of

`ProximalOperators`

: https://kul-forbes.github.io/ProximalOperators.jl/latest.```
# create a problem instance
# ------------------------
A = randn(6,5)
b = randn(6)
m, n = size(A)
# randomized intial point:
x0 = randn(n)
f = LeastSquares(A, b)
γ = 1.0
# create GD_problem
problem = GD_problem(f, x0, γ)
```

```
GD_problem(description : Least squares penalty
domain : n/a
expression : n/a
parameters : n/a, [1.0193204973421695, 1.168794122203917, 1.296856638205154, 0.5001687528217552, 0.1500452441023732], 1.0)
```

`GD_setting`

This type contains different parameters required to implement our algorithm, such as,

the initial step size $\gamma$,

maximum number of iterations $\textrm{maxit}$,

what should be the tolerance $\textrm{tol}$ (i.e., if $\| \nabla{f(x)} \| \leq \textrm{tol}$, we take that $x$ to be an optimal solution and terminate our algorithm),

whether to print out information about the iterates or not controlled by a boolean variable $\textrm{verbose}$, and

how frequently to print out such information controlled by the variable $\textrm{freq}$.

The user may specify what values for these parameters above should be used. But if he does not specify anything, we should be able to have a default set of values to be used. We can achieve this by creating a simple constructor function for `GD_setting`

.

```
struct GD_setting
# user settings to solve the problem using Gradient Descent
γ::Float64 # the step size
maxit::Int64 # maximum number of iteration
tol::Float64 # tolerance, i.e., if ||∇f(x)|| ≤ tol, we take x to be an optimal solution
verbose::Boolean # whether to print information about the iterates
freq::Int64 # how often print information about the iterates
# constructor for the structure, so if user does not specify any particular values,
# then we create a GD_setting object with default values
function GD_setting(; γ = 1, maxit = 1000, tol = 1e-8, verbose = false, freq = 10)
new(γ, maxit, tol, verbose, freq)
end
end
```

**Usage of GD_setting.** For the previously described least squares problem, we create the following setting instance.

`setting = GD_setting(verbose = true, tol = 1e-2, maxit = 1000, freq = 100)`

`GD_setting(1, 1000, 0.01, true, 100)`

`GD_state`

Now we define the type named `GD_state`

that describes the state our algorithm at iteration number $n$. The state is controlled by

current iterte $x_n$,

the gradient of $f$ at the current iterate: ${\nabla{f}(x_n)}$,

the stepsize at iteration $n$: $\gamma_n$, and

iteration number: $n$.

```
mutable struct GD_state{T <: AbstractVecOrMat{<: Real}, I <: Integer, R <: Real} # contains information regarding one iterattion sequence
x::T # iterate x_n
∇f_x::T # one gradient ∇f(x_n)
γ::R # stepsize
n::I # iteration counter
end
```

Also, once the user has given the problem information by creating a problem instance `GD_problem`

, we need a method to construct the initial value of the type `GD_state`

, as we did earlier for the least-squares problem. We create the initial state from the problem instance by writing a constructor function.

```
function GD_state(problem::GD_problem)
# a constructor for the struct GD_state, it will take the problem data and create one state containing all
# the iterate information, current state of the gradient etc so that we can start our gradient descent scheme
# unpack information from iter which is GD_iterable type
x0 = copy(problem.x0) # to be safe
f = problem.f
γ = problem.γ
∇f_x, f_x = gradient(f, x0)
n = 1
return GD_state(x0, ∇f_x, γ, n)
end
```

`GD_state`

## Write the functions

Now that we are done defining the types, we can now focus on writing the functions that will implement our gradient descent scheme.

`GD_iteration!`

First, we need a function that will take the problem information and the state of our algorithm at iteration number $n$, and then compute the next state for iteration number $n+1$ according to (1).

```
function GD_iteration!(problem::GD_problem, state::GD_state)
# this is the main iteration function, that takes the problem information, and the previous state,
# and create the new state using Gradient Descent algorithm
# unpack the current state information
x_n = state.x
∇f_x_n = state.∇f_x
γ_n = state.γ
n = state.n
# compute the next state
x_n_plus_1 = x_n - γ_n*∇f_x_n
# now load the computed values in the state
state.x = x_n_plus_1
state.∇f_x, f_x = gradient(problem.f, x_n_plus_1) # note that f_x is not used anywhere
# gradient(f,x) is a function in the ProximalOperators package, see its documentation
# if more information is required
state.γ = 1/(n+1)
state.n = n+1
# done computing return the new state
return state
end
```

`GD_iteration! (generic function with 1 method)`

`GD_solver`

Now we are in a position to write the main solver function named `GD_solver`

that will be used by the end user. Internally, this function will take the problem information and the problem setting, and then it will

create the initial state,

keep updating the state using

`GD_iteration!`

function until we reach the termination criterion or the maximum number of iterations,print state of the algorithm if

`verbose`

is`true`

at the specified frequency, andreturn the final state.

```
## The solver function
function GD_solver(problem::GD_problem, setting::GD_setting)
# this is the function that the end user will use to solve a particular problem, internally it is using the previously defined types and functions to run Gradient Descent Scheme
# create the intial state
state = GD_state(problem::GD_problem)
## time to run the loop
while (state.n < setting.maxit) & (norm(state.∇f_x, Inf) > setting.tol)
# compute a new state
state = GD_iteration!(problem, state)
# print information if verbose = true
if setting.verbose == true
if mod(state.n, setting.freq) == 0
@info "iteration = $(state.n) | obj val = $(problem.f(state.x)) | gradient norm = $(norm(state.∇f_x, Inf))"
end
end
end
# print information regarding the final state
@info "final iteration = $(state.n) | final obj val = $(problem.f(state.x)) | final gradient norm = $(norm(state.∇f_x, Inf))"
return state
end
```

`GD_solver (generic function with 1 method)`

**Usage of GD_solver.** For the previously created

`problem`

and `setting`

, we run our `GD_solver`

function as follows.`# The following function will run the entire loop over the struct GradientDescent`

`final_state_GD = GD_solver(problem, setting)`

```
┌ Info: iteration = 100 |
│ obj val = 0.10616975436389622 |
│ gradient norm = 0.02964822736954073
└ @ Main In[9]:16
┌ Info: iteration = 200 |
│ obj val = 0.10526904472309405 |
│ gradient norm = 0.019909231214547213
└ @ Main In[9]:16
┌ Info: iteration = 300 |
│ obj val = 0.10499388213419013 |
│ gradient norm = 0.01578344670011611
└ @ Main In[9]:16
┌ Info: iteration = 400 |
│ obj val = 0.1048632843417343 |
│ gradient norm = 0.013388371065530952
└ @ Main In[9]:16
┌ Info: iteration = 500 |
│ obj val = 0.10478781849295837 |
│ gradient norm = 0.011784780173541287
└ @ Main In[9]:16
┌ Info: iteration = 600 |
│ obj val = 0.10473897267694926 |
│ gradient norm = 0.010618620536011203
└ @ Main In[9]:16
┌ Info: final iteration = 667 |
│ final obj val = 0.10471495099365061 |
│ final gradient norm = 0.009995364642054971
└ @ Main In[9]:25
GD_state([-0.747429923239562, 0.7857409304268766, -1.9752613372498569, -0.8109189160513681, 0.6537705422543739], [-0.002114026011632561, -0.008995100759876307, 0.009995364642054971, 0.0024806200462658134, -0.004665182937515999], 0.0014992503748125937, 667)
```

```
println("objective value found by our gradient descent $(f(final_state_GD.x))")
println("real objective value $(f(pinv(A)*b)) ")
```

```
objective value found by our gradient descent 0.10471495099365061
real objective value 0.10452813292628083
```

So, we do decent in terms of finding a good solution!