Working with dataframes in Julia
Shuvomoy Das Gupta
May 5, 2020
In this blog, we will discuss how to work with dataframes using the DataFrames package in Julia. As an illustrative example, we will work with the MovieLens dataset. This is an introductory blog, and for learning how to use the DataFrames package in greater details, a great set of tutorials is available with jupyter notebooks at this link.
Table of contents
Jupyter notebook for this blog. The jupyter notebook for this blog can be downloaded from this link and the data files in zip format are available here. Please unzip them in the folder, where your ipynb or julia file is.
What is the MovieLens dataset? The MovieLens dataset is one of the most common datasets people use for testing recommendation system algorithm. The version I will work with contains 1,000,209 ratings for approximately 3,900 movies; these recommendations were made by 6,040 MovieLens users.
Let us start with the necessary packages.
using CSV, DataFrames
First, let us check if the files are loaded correctly. The files are available here: Provide the link.
# First, let us check if the files are loaded correctly
cd("C:\\Users\\shuvo\\Desktop\\MovieLensDataSet_Experiment")
isfile.(["users.csv" "movies.csv" "ratings.csv"])
1×3 BitArray{2}: 1 1 1
Putting the data into dataframes
Let us put the data into dataframes.
# put the data in dataframes
df_users = CSV.read("users.csv");
df_movies = CSV.read("movies.csv");
df_ratings = CSV.read("ratings.csv");
The df_users dataframe. Let us briefly explore the contents of the df_users dataframe.
size(df_users)
(6040, 8)
So, df_users has 6040 rows and 8 columns. It is too large, but let us peek into the first and last few of the rows rows.
first(df_users,3)
| Column1 | user_id | gender | age | occupation | zipcode | age_desc | occ_desc | |
|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | String | Int64 | Int64 | String | String | String 3 rows × 8 columns | |
| 1 | 0 | 1 | F | 1 | 10 | 48067 | Under 18 | K-12 student |
| 2 | 1 | 2 | M | 56 | 16 | 70072 | 56+ | self-employed |
| 3 | 2 | 3 | M | 25 | 15 | 55117 | 25-34 | scientist |
last(df_users, 4)
| Column1 | user_id | gender | age | occupation | zipcode | age_desc | occ_desc | |
|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | String | Int64 | Int64 | String | String | String 4 rows × 8 columns | |
| 1 | 6036 | 6037 | F | 45 | 1 | 76006 | 45-49 | academic/educator |
| 2 | 6037 | 6038 | F | 56 | 1 | 14706 | 56+ | academic/educator |
| 3 | 6038 | 6039 | F | 45 | 0 | 01060 | 45-49 | other or not specified |
| 4 | 6039 | 6040 | M | 25 | 6 | 11106 | 25-34 | doctor/health care |
Now let us take a look at the basic summary statistics of data in df_users.
# Let us take a look at a quick summary about df_users for a few users
describe(df_users, cols = 1:3)
# if we were interested about all the columns, then we could run the following command:
# describe(df_users)
| variable | mean | min | median | max | nunique | nmissing | eltype | |
|---|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Union… | Nothing | DataType 3 rows × 8 columns | |
| 1 | Column1 | 3019.5 | 0 | 3019.5 | 6039 | Int64 | ||
| 2 | user_id | 3020.5 | 1 | 3020.5 | 6040 | Int64 | ||
| 3 | gender | F | M | 2 | String |
The get the name of all the columns of df_users, we can use the function names.
names(df_users)
8-element Array{Symbol,1}: :Column1 :user_id :gender :age :occupation :zipcode :age_desc :occ_desc
If we want to know the element types of each column, we can run the following.
eltype.(eachcol(df_users))
8-element Array{DataType,1}: Int64 Int64 String Int64 Int64 String String String
The df_movies dataframe. Let us repeat the previous commands for the df_movies dataframe.
size(df_movies);
first(df_movies,5);
last(df_movies, 3);
describe(df_movies, cols = 1:3);
names(df_movies)
4-element Array{Symbol,1}: :Column1 :movie_id :title :genres
eltype.(eachcol(df_movies));
The df_ratings dataframe. Let us see what is going on with the dataframe df_ratings.
size(df_ratings); # quite large with 1000209 columns and 7 rows
first(df_ratings,3)
| Column1 | user_id | movie_id | rating | timestamp | user_emb_id | movie_emb_id | |
|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 3 rows × 7 columns | |
| 1 | 0 | 1 | 1193 | 5 | 978300760 | 0 | 1192 |
| 2 | 1 | 1 | 661 | 3 | 978302109 | 0 | 660 |
| 3 | 2 | 1 | 914 | 3 | 978301968 | 0 | 913 |
last(df_ratings,3);
describe(df_ratings, cols = 1:3);
names(df_ratings);
eltype.(eachcol(df_ratings));
## Let us create the rating matrix
# first let us create the number of unique users and matrices
n_users = length(unique(df_ratings.user_id))
n_movies = length(unique(df_ratings.movie_id))
println("Number of users = $(n_users) | Number of movies = $(n_movies)")
Number of users = 6040 | Number of movies = 3706
Converting a dataframe into wide format
Now we construct a wide format dataframe for the df_ratings dataframe using the unstack function. Our goal is creating a dataframe, which will have the users as the rows, the movies as the columns, and the rating as the value. The unstack function requires specifying which columns would act as id variable, column variable name, and column values.
df_ratings_unstacked = unstack(df_ratings, :user_id, :movie_id, :rating);
size(df_ratings_unstacked)
(6040, 3707)
first(df_ratings_unstacked, 6)
| user_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|---|
| Int64 | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ 6 rows × 3,707 columns (omitted printing of 3698 columns) | |
| 1 | 1 | 5 | missing | missing | missing | missing | missing | missing | missing |
| 2 | 2 | missing | missing | missing | missing | missing | missing | missing | missing |
| 3 | 3 | missing | missing | missing | missing | missing | missing | missing | missing |
| 4 | 4 | missing | missing | missing | missing | missing | missing | missing | missing |
| 5 | 5 | missing | missing | missing | missing | missing | 2 | missing | missing |
| 6 | 6 | 4 | missing | missing | missing | missing | missing | missing | missing |
Working with missing values
We have lot of missing values, which we can see the let us replace the missing values with NaN.
# replace the missing values in the data frame
column_names_ru = names(df_ratings_unstacked);
for i in 1:length(column_names_ru)
df_ratings_unstacked[!, column_names_ru[i]] = replace(df_ratings_unstacked[!, column_names_ru[i]], missing => NaN) # If we want to replace the missing values with any other number, then we use: missing => NaN
end
Let us see if the values have indeed changed.
first(df_ratings_unstacked, 6)
| user_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 6 rows × 3,707 columns (omitted printing of 3698 columns) | |
| 1 | 1.0 | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 5.0 | NaN | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN |
| 6 | 6.0 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
What if the missing values are represented as some other symbol in the original dataframe? In the previous example, the missing values were convenienetly recognized by Julia as missing type. Now, in some data files the missing values are represented by some other symbols, such as "?". In such case the column that contains "?" may be diagnosed as an array of strings by Julia. In that case, we could run the following command to replace the "?"s with missing using the following command.
type_of_columns = eltype.(eachcol(df)) # say the name of the dataframe is df
m, n = size(df)
for i in 3:n # let the important columns containing missing values start from column number 3
if type_of_columns[i] == String # due to presence of "?" in the column, it may get diagnosed by an array of strings by Julia
df[!, i] = map(x->
begin
val = tryparse(Float64, x)
ifelse(typeof(val) == Nothing, missing, val)
end, df[!,i])
end
end
Saving and loading dataframe
Let us see how to load and safe a dataframe. We can save df_ratings_unstacked using the write function provided by the CSV package.
CSV.write("df_ratings_unstacked.csv", df_ratings_unstacked)
"df_ratings_unstacked.csv"
To load back the dataframe that we just saved into the csv file, we can use read command provided by CSV package. The command use_mmap = false is used, so that the file can be deleted in the same session; this is required only in Windows.
df_loaded = CSV.read("df_ratings_unstacked.csv", use_mmap=false);
first(df_loaded, 5)
| user_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 5 rows × 3,707 columns (omitted printing of 3698 columns) | |
| 1 | 1.0 | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 5.0 | NaN | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN |
Converting a dataframe into a matrix
Now let us convert the df_ratings_unstacked dataframe into a matrix.
ratings_matrix = df_ratings_unstacked[:, 2:length(column_names_ru)];
rating_matrix = convert(Matrix, ratings_matrix)
6040×3706 Array{Float64,2}:
5.0 NaN NaN NaN NaN … NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4.0 NaN NaN NaN NaN … NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4.0 NaN NaN 3.0 NaN NaN NaN NaN NaN NaN NaN NaN
5.0 NaN NaN NaN NaN NaN NaN 3.0 NaN NaN NaN NaN
5.0 5.0 NaN NaN NaN NaN NaN 4.0 NaN NaN NaN NaN
NaN NaN NaN NaN NaN … NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
⋮ ⋱ ⋮ ⋮
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN … NaN NaN NaN NaN NaN NaN NaN
4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4.0 NaN 1.0 2.0 1.0 NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN 2.0 NaN … NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN