Running parallel Julia code on MIT Supercloud

Shuvomoy Das Gupta

December 1, 2020

In this blog, we will discuss how to run parallel Julia code on MIT Supercloud. For the basics on how to run Julia code in MIT Supercloud, please see my previous blog post here.


Table of contents


As an example for running parallel code, we will consider solving sparse regression problem.

Sparse regression problem.

The sparse regression problem (also known as regressor selection problem) is concerned with approximating a vector bRmb\in\mathbf{R}^{m} with a linear combination of at most kk columns of a matrix ARm×dA\in\mathbf{R}^{m\times d} with bounded coefficients. The problem can be written as the following optimization problem

minimizeAxb22+β2x2subject tocard(x)kxM, \begin{equation} \begin{array}{ll} \textrm{minimize} & \|Ax-b\|_{2}^{2}+\frac{\beta}{2}\|x\|^{2}\\ \textrm{subject to} & \mathbf{card}(x)\leq k\\ & \|x\|_{\infty}\leq M, \end{array} \end{equation}

where xRdx\in\mathbf{R}^{d} is the decision variable, and ARm×d,bRm,A\in\mathbf{R}^{m\times d},b\in\mathbf{R}^{m}, and M>0M>0 are problem data. The term card(x)\mathbf{card}(x) represents the number of nonzero components of the vector xx.

Setup. We have a nonconvex optimization problem (sparse regression problem) and an algorithm NExOS that is able to compute a locally optimal solution of this problem under certain regularity conditions. Depending on different initializations, NExOS will provide us with different locally optimal solutions. So, naturally we can think of initializing our algorithm with different random points, observe the solutions provided by NExOS for different initializations and then pick the locally optimal solution that corresponds to the smallest objective value. We can achieve this task by using parallelization techniques provided in Julia. For this blog, we will consider pmap.

Julia Code. The code for the julia file is given below, please save it in a text file and name it pmap_julia.jl.

using ClusterManagers,Distributed

# Add in the cores allocated by the scheduler as workers

addprocs(SlurmManager(parse(Int,ENV["SLURM_NTASKS"])-1))

print("Added workers: ")

println(nworkers())
    
using Random, NExOS, ProximalOperators, Distributions # load it centrally

@everywhere using Random, NExOS, ProximalOperators, Distributions # load it on each process

# DATA GENERATION: The following code will be run only on one core
# ----------------------------------------------------------------

# create the array of random initial points

n = 20

M = 1

function random_z(n, M)
    uniDst = Uniform(-M, M)
    x = rand(uniDst,n)
    z = x
    return z
end

## create array of random z s

function create_array_z_randomized(n, N_random_points, M)
    # this array has its first column all zeros and the rest are uniformly distrubted over [-M,M]
    array_z_randomized = zeros(n, N_random_points)
    for i in 2:N_random_points
        array_z_randomized[:,i] = random_z(n, M)
    end
    return array_z_randomized
end

## necessary info to generate the data

bigM = 1e99 # this is the bigM, which is a global variable we are not gonna change

N_random_points = 100

array_z_randomized = create_array_z_randomized(n, N_random_points, M)

array_z_randomized_formatted = [array_z_randomized[:,i] for i in 1:N_random_points]

# DISTRIBUTED CODE: The following part of the code will be run parallely on different cores
# -----------------------------------------------------------------------------------------

# we write the distributed part that is loaded everywhere on the processes we created

@everywhere begin

    ## create data
    m = 10
    n = 20
    A = randn(m,n)
    b = randn(m)
    M = 1
    k = convert(Int64, round(m/3))
    beta = 10^-10


    # Let us put everything in a function, which we are going to use later for parallel implementation.

    C = NExOS.SparseSet(M, k) # Create the set
    f = ProximalOperators.LeastSquares(A, b, iterative = true) # Create the function
    settings = NExOS.Settings(μ_max = 2, μ_min = 1e-8, μ_mult_fact = 0.5, verbose = false, freq = 250, γ_updt_rule = :adaptive, β = beta) # settings

    #function sparse_reg_NExOS(z0, C, f, settings) # z0 is the initial point
    function sparse_reg_NExOS(z0)
        
        problem = NExOS.Problem(f, C, settings.β, z0) # problem instance
        state_final = NExOS.solve!(problem, settings)
        return state_final

    end

end # end the begin block

# serial implementation

output_map =  map(sparse_reg_NExOS, array_z_randomized_formatted)

# parallel implementatin

output_pmap = pmap(sparse_reg_NExOS, array_z_randomized_formatted)

# BENCHMARKING:
# ------------

# benchmarking the results, note that we did not benchmark the first time, because in that 
# case we would also count the precompilation time

using BenchmarkTools 

BenchmarkTools.DEFAULT_PARAMETERS.seconds = 25 # The number of seconds budgeted for the benchmarking process. The trial will terminate if this time is exceeded (regardless of samples), but at least one sample will always be taken.

# benchmark serial implementation
b1 = @benchmark map(sparse_reg_NExOS, array_z_randomized_formatted)

println("benchmark for serial code")
println("*************************")
io = IOBuffer()
show(io, "text/plain", b1)
s = String(take!(io))
println(s)

# benchmark parallel implementation
b2 = @benchmark pmap(sparse_reg_NExOS, array_z_randomized_formatted)

println("benchmark for parallel code")
println("***************************")
io = IOBuffer()
show(io, "text/plain", b2)
s = String(take!(io))
println(s)

Shell script to submit the job

Now we are going to create a shell script that will be used to submit the job. The code for the shell script is below. Please save it in a text file, and name it run_pmap_julia.sh. In the code, SBATCH -o pmap_julia.log-%j indicates the name of the file where the output is written, and SBATCH -n 14 indicates the number of cores or cpus allocated to the job.

#!/bin/bash

# Slurm sbatch options
#SBATCH -o pmap_julia.log-%j
#SBATCH -n 14

# Initialize the module command first source
source /etc/profile

# Load Julia Module
module load julia/1.5.2
 
# Load Gurobi Module
# module load gurobi/gurobi-811            
 
# Call your script as you would from the command line
# Call your script as you would from the command line
julia pmap_julia.jl

Submitting the job

Now log in to MIT supercloud, copy the files created above to your working directory, and run the following command. Ensure that the pwd command results in the working directory, else use the command cd directory_that_contains_the_code to change the working directory.

LLsub run_pmap_julia.sh

That's it! Once the computation is done, we see from the output log file that parallelization has decreased the computation time significantly.

benchmark for serial code
*************************
BenchmarkTools.Trial: 
  memory estimate:  26.20 GiB
  allocs estimate:  210157358
  --------------
  minimum time:     14.033 s (7.09% GC)
  median time:      14.093 s (7.41% GC)
  mean time:        14.093 s (7.41% GC)
  maximum time:     14.154 s (7.74% GC)
  --------------
  samples:          2
  evals/sample:     1
benchmark for parallel code
***************************
BenchmarkTools.Trial: 
  memory estimate:  449.78 KiB
  allocs estimate:  10213
  --------------
  minimum time:     1.337 s (0.00% GC)
  median time:      1.369 s (0.00% GC)
  mean time:        1.377 s (0.00% GC)
  maximum time:     1.436 s (0.00% GC)
  --------------
  samples:          19
  evals/sample:     1