Working with dataframes in Julia

Shuvomoy Das Gupta

May 5, 2020

In this blog, we will discuss how to work with dataframes using the DataFrames package in Julia. As an illustrative example, we will work with the MovieLens dataset. This is an introductory blog, and for learning how to use the DataFrames package in greater details, a great set of tutorials is available with jupyter notebooks at this link.


Table of contents


Jupyter notebook for this blog. The jupyter notebook for this blog can be downloaded from this link and the data files in zip format are available here. Please unzip them in the folder, where your ipynb or julia file is.

What is the MovieLens dataset? The MovieLens dataset is one of the most common datasets people use for testing recommendation system algorithm. The version I will work with contains 1,000,209 ratings for approximately 3,900 movies; these recommendations were made by 6,040 MovieLens users.

Let us start with the necessary packages.

using CSV, DataFrames

First, let us check if the files are loaded correctly. The files are available here: Provide the link.

# First, let us check if the files are loaded correctly
cd("C:\\Users\\shuvo\\Desktop\\MovieLensDataSet_Experiment")
isfile.(["users.csv" "movies.csv" "ratings.csv"])

1×3 BitArray{2}: 1 1 1

Putting the data into dataframes

Let us put the data into dataframes.

# put the data in dataframes
df_users = CSV.read("users.csv");
df_movies = CSV.read("movies.csv");
df_ratings = CSV.read("ratings.csv");

The df_users dataframe. Let us briefly explore the contents of the df_users dataframe.

size(df_users)

(6040, 8)

So, df_users has 6040 rows and 8 columns. It is too large, but let us peek into the first and last few of the rows rows.

first(df_users,3)
Column1user_idgenderageoccupationzipcodeage_descocc_desc
Int64Int64StringInt64Int64StringStringString

3 rows × 8 columns

101F11048067Under 18K-12 student
212M56167007256+self-employed
323M25155511725-34scientist
last(df_users, 4)
Column1user_idgenderageoccupationzipcodeage_descocc_desc
Int64Int64StringInt64Int64StringStringString

4 rows × 8 columns

160366037F4517600645-49academic/educator
260376038F5611470656+academic/educator
360386039F4500106045-49other or not specified
460396040M2561110625-34doctor/health care

Now let us take a look at the basic summary statistics of data in df_users.

# Let us take a look at a quick summary about df_users for a few users
describe(df_users, cols = 1:3)
# if we were interested about all the columns, then we could run the following command:
# describe(df_users)
variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…NothingDataType

3 rows × 8 columns

1Column13019.503019.56039Int64
2user_id3020.513020.56040Int64
3genderFM2String

The get the name of all the columns of df_users, we can use the function names.

names(df_users)

8-element Array{Symbol,1}: :Column1 :user_id :gender :age :occupation :zipcode :age_desc :occ_desc

If we want to know the element types of each column, we can run the following.

eltype.(eachcol(df_users))

8-element Array{DataType,1}: Int64 Int64 String Int64 Int64 String String String

The df_movies dataframe. Let us repeat the previous commands for the df_movies dataframe.

size(df_movies);
first(df_movies,5);
last(df_movies, 3);
describe(df_movies, cols = 1:3);
names(df_movies)

4-element Array{Symbol,1}: :Column1 :movie_id :title :genres

eltype.(eachcol(df_movies));

The df_ratings dataframe. Let us see what is going on with the dataframe df_ratings.

size(df_ratings); # quite large with 1000209 columns and 7 rows
first(df_ratings,3)
Column1user_idmovie_idratingtimestampuser_emb_idmovie_emb_id
Int64Int64Int64Int64Int64Int64Int64

3 rows × 7 columns

1011193597830076001192
21166139783021090660
32191439783019680913
last(df_ratings,3);
describe(df_ratings, cols = 1:3);
names(df_ratings);
eltype.(eachcol(df_ratings));
## Let us create the rating matrix
# first let us create the number of unique users and matrices
n_users = length(unique(df_ratings.user_id))
n_movies = length(unique(df_ratings.movie_id))
println("Number of users = $(n_users) | Number of movies = $(n_movies)")

Number of users = 6040 | Number of movies = 3706

Converting a dataframe into wide format

Now we construct a wide format dataframe for the df_ratings dataframe using the unstack function. Our goal is creating a dataframe, which will have the users as the rows, the movies as the columns, and the rating as the value. The unstack function requires specifying which columns would act as id variable, column variable name, and column values.

df_ratings_unstacked = unstack(df_ratings, :user_id, :movie_id, :rating);
size(df_ratings_unstacked)

(6040, 3707)

first(df_ratings_unstacked, 6)
user_id12345678
Int64Int64⍰Int64⍰Int64⍰Int64⍰Int64⍰Int64⍰Int64⍰Int64⍰

6 rows × 3,707 columns (omitted printing of 3698 columns)

115missingmissingmissingmissingmissingmissingmissing
22missingmissingmissingmissingmissingmissingmissingmissing
33missingmissingmissingmissingmissingmissingmissingmissing
44missingmissingmissingmissingmissingmissingmissingmissing
55missingmissingmissingmissingmissing2missingmissing
664missingmissingmissingmissingmissingmissingmissing

Working with missing values

We have lot of missing values, which we can see the let us replace the missing values with NaN.

# replace the missing values in the data frame
column_names_ru = names(df_ratings_unstacked);
for i in 1:length(column_names_ru)
   df_ratings_unstacked[!, column_names_ru[i]] = replace(df_ratings_unstacked[!, column_names_ru[i]], missing => NaN) # If we want to replace the missing values with any other number, then we use: missing => NaN
end

Let us see if the values have indeed changed.

first(df_ratings_unstacked, 6)
user_id12345678
Float64Float64Float64Float64Float64Float64Float64Float64Float64

6 rows × 3,707 columns (omitted printing of 3698 columns)

11.05.0NaNNaNNaNNaNNaNNaNNaN
22.0NaNNaNNaNNaNNaNNaNNaNNaN
33.0NaNNaNNaNNaNNaNNaNNaNNaN
44.0NaNNaNNaNNaNNaNNaNNaNNaN
55.0NaNNaNNaNNaNNaN2.0NaNNaN
66.04.0NaNNaNNaNNaNNaNNaNNaN

What if the missing values are represented as some other symbol in the original dataframe? In the previous example, the missing values were convenienetly recognized by Julia as missing type. Now, in some data files the missing values are represented by some other symbols, such as "?". In such case the column that contains "?" may be diagnosed as an array of strings by Julia. In that case, we could run the following command to replace the "?"s with missing using the following command.

type_of_columns = eltype.(eachcol(df)) # say the name of the dataframe is df

m, n = size(df)

for i in 3:n # let the important columns containing missing values start from column number 3

  if type_of_columns[i] == String # due to presence of "?" in the column, it may get diagnosed by an array of strings by Julia
  
          df[!, i]  = map(x->
                         begin
                         val = tryparse(Float64, x)
                         ifelse(typeof(val) == Nothing, missing, val)                         
                         end, df[!,i])
                         
  end
end

Saving and loading dataframe

Let us see how to load and safe a dataframe. We can save df_ratings_unstacked using the write function provided by the CSV package.

CSV.write("df_ratings_unstacked.csv", df_ratings_unstacked)

"df_ratings_unstacked.csv"

To load back the dataframe that we just saved into the csv file, we can use read command provided by CSV package. The command use_mmap = false is used, so that the file can be deleted in the same session; this is required only in Windows.

df_loaded = CSV.read("df_ratings_unstacked.csv", use_mmap=false);
first(df_loaded, 5)
user_id12345678
Float64Float64Float64Float64Float64Float64Float64Float64Float64

5 rows × 3,707 columns (omitted printing of 3698 columns)

11.05.0NaNNaNNaNNaNNaNNaNNaN
22.0NaNNaNNaNNaNNaNNaNNaNNaN
33.0NaNNaNNaNNaNNaNNaNNaNNaN
44.0NaNNaNNaNNaNNaNNaNNaNNaN
55.0NaNNaNNaNNaNNaN2.0NaNNaN

Converting a dataframe into a matrix

Now let us convert the df_ratings_unstacked dataframe into a matrix.

ratings_matrix = df_ratings_unstacked[:,  2:length(column_names_ru)];
rating_matrix = convert(Matrix, ratings_matrix)
6040×3706 Array{Float64,2}:
       5.0  NaN    NaN    NaN    NaNNaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
       4.0  NaN    NaN    NaN    NaNNaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
       4.0  NaN    NaN      3.0  NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
       5.0  NaN    NaN    NaN    NaN       NaN  NaN    3.0  NaN  NaN  NaN  NaN
       5.0    5.0  NaN    NaN    NaN       NaN  NaN    4.0  NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaNNaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN      3.0  NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
       ⋮                                ⋱         ⋮                          ⋮
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN      4.0  NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaNNaN  NaN  NaN    NaN  NaN  NaN  NaN
       4.0  NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
       4.0  NaN      1.0    2.0    1.0     NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN      2.0  NaNNaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
       3.0  NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN