Working with dataframes in Julia

Shuvomoy Das Gupta

May 5, 2020

In this blog, we will discuss how to work with dataframes using the DataFrames package in Julia. As an illustrative example, we will work with the MovieLens dataset. This is an introductory blog, and for learning how to use the DataFrames package in greater details, a great set of tutorials is available with jupyter notebooks at this link.

Table of contents

Putting the data into dataframes
Converting a dataframe into wide format
Working with missing values
Saving and loading dataframe
Converting a dataframe into a matrix

Jupyter notebook for this blog. The jupyter notebook for this blog can be downloaded from this link and the data files in zip format are available here. Please unzip them in the folder, where your ipynb or julia file is.

What is the MovieLens dataset? The MovieLens dataset is one of the most common datasets people use for testing recommendation system algorithm. The version I will work with contains 1,000,209 ratings for approximately 3,900 movies; these recommendations were made by 6,040 MovieLens users.

Let us start with the necessary packages.

using CSV, DataFrames

First, let us check if the files are loaded correctly. The files are available here: Provide the link.

# First, let us check if the files are loaded correctly
cd("C:\\Users\\shuvo\\Desktop\\MovieLensDataSet_Experiment")
isfile.(["users.csv" "movies.csv" "ratings.csv"])

1×3 BitArray{2}: 1 1 1

Putting the data into dataframes

Let us put the data into dataframes.

# put the data in dataframes
df_users = CSV.read("users.csv");
df_movies = CSV.read("movies.csv");
df_ratings = CSV.read("ratings.csv");

The df_users dataframe. Let us briefly explore the contents of the df_users dataframe.

size(df_users)

(6040, 8)

So, df_users has 6040 rows and 8 columns. It is too large, but let us peek into the first and last few of the rows rows.

first(df_users,3)

	Column1	user_id	gender	age	occupation	zipcode	age_desc	occ_desc
	Int64	Int64	String	Int64	Int64	String	String	String 3 rows × 8 columns
1	0	1	F	1	10	48067	Under 18	K-12 student
2	1	2	M	56	16	70072	56+	self-employed
3	2	3	M	25	15	55117	25-34	scientist

last(df_users, 4)

	Column1	user_id	gender	age	occupation	zipcode	age_desc	occ_desc
	Int64	Int64	String	Int64	Int64	String	String	String 4 rows × 8 columns
1	6036	6037	F	45	1	76006	45-49	academic/educator
2	6037	6038	F	56	1	14706	56+	academic/educator
3	6038	6039	F	45	0	01060	45-49	other or not specified
4	6039	6040	M	25	6	11106	25-34	doctor/health care

Now let us take a look at the basic summary statistics of data in df_users.

# Let us take a look at a quick summary about df_users for a few users
describe(df_users, cols = 1:3)
# if we were interested about all the columns, then we could run the following command:
# describe(df_users)

	variable	mean	min	median	max	nunique	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Union…	Nothing	DataType 3 rows × 8 columns
1	Column1	3019.5	0	3019.5	6039			Int64
2	user_id	3020.5	1	3020.5	6040			Int64
3	gender		F		M	2		String

The get the name of all the columns of df_users, we can use the function names.

names(df_users)

8-element Array{Symbol,1}: :Column1 :user_id :gender :age :occupation :zipcode :age_desc :occ_desc

If we want to know the element types of each column, we can run the following.

eltype.(eachcol(df_users))

8-element Array{DataType,1}: Int64 Int64 String Int64 Int64 String String String

The df_movies dataframe. Let us repeat the previous commands for the df_movies dataframe.

size(df_movies);

first(df_movies,5);

last(df_movies, 3);

describe(df_movies, cols = 1:3);

names(df_movies)

4-element Array{Symbol,1}: :Column1 :movie_id :title :genres

eltype.(eachcol(df_movies));

The df_ratings dataframe. Let us see what is going on with the dataframe df_ratings.

size(df_ratings); # quite large with 1000209 columns and 7 rows

first(df_ratings,3)

	Column1	user_id	movie_id	rating	timestamp	user_emb_id	movie_emb_id
	Int64	Int64	Int64	Int64	Int64	Int64	Int64 3 rows × 7 columns
1	0	1	1193	5	978300760	0	1192
2	1	1	661	3	978302109	0	660
3	2	1	914	3	978301968	0	913

last(df_ratings,3);

describe(df_ratings, cols = 1:3);

names(df_ratings);

eltype.(eachcol(df_ratings));

## Let us create the rating matrix
# first let us create the number of unique users and matrices
n_users = length(unique(df_ratings.user_id))
n_movies = length(unique(df_ratings.movie_id))
println("Number of users = $(n_users) | Number of movies = $(n_movies)")

Number of users = 6040 | Number of movies = 3706

Converting a dataframe into wide format

Now we construct a wide format dataframe for the df_ratings dataframe using the unstack function. Our goal is creating a dataframe, which will have the users as the rows, the movies as the columns, and the rating as the value. The unstack function requires specifying which columns would act as id variable, column variable name, and column values.

df_ratings_unstacked = unstack(df_ratings, :user_id, :movie_id, :rating);

size(df_ratings_unstacked)

(6040, 3707)

first(df_ratings_unstacked, 6)

	user_id	1	2	3	4	5	6	7	8
	Int64	Int64⍰	Int64⍰	Int64⍰	Int64⍰	Int64⍰	Int64⍰	Int64⍰	Int64⍰ 6 rows × 3,707 columns (omitted printing of 3698 columns)
1	1	5	missing	missing	missing	missing	missing	missing	missing
2	2	missing	missing	missing	missing	missing	missing	missing	missing
3	3	missing	missing	missing	missing	missing	missing	missing	missing
4	4	missing	missing	missing	missing	missing	missing	missing	missing
5	5	missing	missing	missing	missing	missing	2	missing	missing
6	6	4	missing	missing	missing	missing	missing	missing	missing

Working with missing values

We have lot of missing values, which we can see the let us replace the missing values with NaN.

# replace the missing values in the data frame
column_names_ru = names(df_ratings_unstacked);
for i in 1:length(column_names_ru)
   df_ratings_unstacked[!, column_names_ru[i]] = replace(df_ratings_unstacked[!, column_names_ru[i]], missing => NaN) # If we want to replace the missing values with any other number, then we use: missing => NaN
end

Let us see if the values have indeed changed.

first(df_ratings_unstacked, 6)

	user_id	1	2	3	4	5	6	7	8
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64 6 rows × 3,707 columns (omitted printing of 3698 columns)
1	1.0	5.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	3.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	4.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	5.0	NaN	NaN	NaN	NaN	NaN	2.0	NaN	NaN
6	6.0	4.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN

What if the missing values are represented as some other symbol in the original dataframe? In the previous example, the missing values were convenienetly recognized by Julia as missing type. Now, in some data files the missing values are represented by some other symbols, such as "?". In such case the column that contains "?" may be diagnosed as an array of strings by Julia. In that case, we could run the following command to replace the "?"s with missing using the following command.

type_of_columns = eltype.(eachcol(df)) # say the name of the dataframe is df

m, n = size(df)

for i in 3:n # let the important columns containing missing values start from column number 3

  if type_of_columns[i] == String # due to presence of "?" in the column, it may get diagnosed by an array of strings by Julia
  
          df[!, i]  = map(x->
                         begin
                         val = tryparse(Float64, x)
                         ifelse(typeof(val) == Nothing, missing, val)                         
                         end, df[!,i])
                         
  end
end

Saving and loading dataframe

Let us see how to load and safe a dataframe. We can save df_ratings_unstacked using the write function provided by the CSV package.

CSV.write("df_ratings_unstacked.csv", df_ratings_unstacked)

"df_ratings_unstacked.csv"

To load back the dataframe that we just saved into the csv file, we can use read command provided by CSV package. The command use_mmap = false is used, so that the file can be deleted in the same session; this is required only in Windows.

df_loaded = CSV.read("df_ratings_unstacked.csv", use_mmap=false);
first(df_loaded, 5)

	user_id	1	2	3	4	5	6	7	8
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64 5 rows × 3,707 columns (omitted printing of 3698 columns)
1	1.0	5.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	3.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	4.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	5.0	NaN	NaN	NaN	NaN	NaN	2.0	NaN	NaN

Converting a dataframe into a matrix

Now let us convert the df_ratings_unstacked dataframe into a matrix.

ratings_matrix = df_ratings_unstacked[:,  2:length(column_names_ru)];
rating_matrix = convert(Matrix, ratings_matrix)

6040×3706 Array{Float64,2}:
       5.0  NaN    NaN    NaN    NaN    …  NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
       4.0  NaN    NaN    NaN    NaN    …  NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
       4.0  NaN    NaN      3.0  NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
       5.0  NaN    NaN    NaN    NaN       NaN  NaN    3.0  NaN  NaN  NaN  NaN
       5.0    5.0  NaN    NaN    NaN       NaN  NaN    4.0  NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN    …  NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN      3.0  NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
       ⋮                                ⋱         ⋮                          ⋮
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN      4.0  NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN    …  NaN  NaN  NaN    NaN  NaN  NaN  NaN
       4.0  NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
       4.0  NaN      1.0    2.0    1.0     NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN      2.0  NaN    …  NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
     NaN    NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN
       3.0  NaN    NaN    NaN    NaN       NaN  NaN  NaN    NaN  NaN  NaN  NaN