Table of contents
Jupyter notebook for this blog. The jupyter notebook for this blog can be downloaded from this link and the data files in zip format are available here. Please unzip them in the folder, where your ipynb or julia file is.
What is the MovieLens dataset? The MovieLens
dataset is one of the most common datasets people use for testing recommendation system algorithm. The version I will work with contains 1,000,209 ratings for approximately 3,900 movies; these recommendations were made by 6,040 MovieLens users.
Let us start with the necessary packages.
using CSV, DataFrames
First, let us check if the files are loaded correctly. The files are available here: Provide the link.
# First, let us check if the files are loaded correctly
cd("C:\\Users\\shuvo\\Desktop\\MovieLensDataSet_Experiment")
isfile.(["users.csv" "movies.csv" "ratings.csv"])
1×3 BitArray{2}: 1 1 1
Let us put the data into dataframes.
# put the data in dataframes
df_users = CSV.read("users.csv");
df_movies = CSV.read("movies.csv");
df_ratings = CSV.read("ratings.csv");
The df_users
dataframe. Let us briefly explore the contents of the df_users
dataframe.
size(df_users)
(6040, 8)
So, df_users
has 6040 rows and 8 columns. It is too large, but let us peek into the first and last few of the rows rows.
first(df_users,3)
Column1 | user_id | gender | age | occupation | zipcode | age_desc | occ_desc | |
---|---|---|---|---|---|---|---|---|
Int64 | Int64 | String | Int64 | Int64 | String | String | String 3 rows × 8 columns | |
1 | 0 | 1 | F | 1 | 10 | 48067 | Under 18 | K-12 student |
2 | 1 | 2 | M | 56 | 16 | 70072 | 56+ | self-employed |
3 | 2 | 3 | M | 25 | 15 | 55117 | 25-34 | scientist |
last(df_users, 4)
Column1 | user_id | gender | age | occupation | zipcode | age_desc | occ_desc | |
---|---|---|---|---|---|---|---|---|
Int64 | Int64 | String | Int64 | Int64 | String | String | String 4 rows × 8 columns | |
1 | 6036 | 6037 | F | 45 | 1 | 76006 | 45-49 | academic/educator |
2 | 6037 | 6038 | F | 56 | 1 | 14706 | 56+ | academic/educator |
3 | 6038 | 6039 | F | 45 | 0 | 01060 | 45-49 | other or not specified |
4 | 6039 | 6040 | M | 25 | 6 | 11106 | 25-34 | doctor/health care |
Now let us take a look at the basic summary statistics of data in df_users.
# Let us take a look at a quick summary about df_users for a few users
describe(df_users, cols = 1:3)
# if we were interested about all the columns, then we could run the following command:
# describe(df_users)
variable | mean | min | median | max | nunique | nmissing | eltype | |
---|---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Union… | Nothing | DataType 3 rows × 8 columns | |
1 | Column1 | 3019.5 | 0 | 3019.5 | 6039 | Int64 | ||
2 | user_id | 3020.5 | 1 | 3020.5 | 6040 | Int64 | ||
3 | gender | F | M | 2 | String |
The get the name of all the columns of df_users
, we can use the function names
.
names(df_users)
8-element Array{Symbol,1}: :Column1 :user_id :gender :age :occupation :zipcode :age_desc :occ_desc
If we want to know the element types of each column, we can run the following.
eltype.(eachcol(df_users))
8-element Array{DataType,1}: Int64 Int64 String Int64 Int64 String String String
The df_movies
dataframe. Let us repeat the previous commands for the df_movies
dataframe.
size(df_movies);
first(df_movies,5);
last(df_movies, 3);
describe(df_movies, cols = 1:3);
names(df_movies)
4-element Array{Symbol,1}: :Column1 :movie_id :title :genres
eltype.(eachcol(df_movies));
The df_ratings
dataframe. Let us see what is going on with the dataframe df_ratings
.
size(df_ratings); # quite large with 1000209 columns and 7 rows
first(df_ratings,3)
Column1 | user_id | movie_id | rating | timestamp | user_emb_id | movie_emb_id | |
---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 3 rows × 7 columns | |
1 | 0 | 1 | 1193 | 5 | 978300760 | 0 | 1192 |
2 | 1 | 1 | 661 | 3 | 978302109 | 0 | 660 |
3 | 2 | 1 | 914 | 3 | 978301968 | 0 | 913 |
last(df_ratings,3);
describe(df_ratings, cols = 1:3);
names(df_ratings);
eltype.(eachcol(df_ratings));
## Let us create the rating matrix
# first let us create the number of unique users and matrices
n_users = length(unique(df_ratings.user_id))
n_movies = length(unique(df_ratings.movie_id))
println("Number of users = $(n_users) | Number of movies = $(n_movies)")
Number of users = 6040 | Number of movies = 3706
Now we construct a wide format dataframe for the df_ratings
dataframe using the unstack
function. Our goal is creating a dataframe, which will have the users as the rows, the movies as the columns, and the rating as the value. The unstack
function requires specifying which columns would act as id variable, column variable name, and column values.
df_ratings_unstacked = unstack(df_ratings, :user_id, :movie_id, :rating);
size(df_ratings_unstacked)
(6040, 3707)
first(df_ratings_unstacked, 6)
user_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
Int64 | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ 6 rows × 3,707 columns (omitted printing of 3698 columns) | |
1 | 1 | 5 | missing | missing | missing | missing | missing | missing | missing |
2 | 2 | missing | missing | missing | missing | missing | missing | missing | missing |
3 | 3 | missing | missing | missing | missing | missing | missing | missing | missing |
4 | 4 | missing | missing | missing | missing | missing | missing | missing | missing |
5 | 5 | missing | missing | missing | missing | missing | 2 | missing | missing |
6 | 6 | 4 | missing | missing | missing | missing | missing | missing | missing |
We have lot of missing values, which we can see the let us replace the missing values with NaN
.
# replace the missing values in the data frame
column_names_ru = names(df_ratings_unstacked);
for i in 1:length(column_names_ru)
df_ratings_unstacked[!, column_names_ru[i]] = replace(df_ratings_unstacked[!, column_names_ru[i]], missing => NaN) # If we want to replace the missing values with any other number, then we use: missing => NaN
end
Let us see if the values have indeed changed.
first(df_ratings_unstacked, 6)
user_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 6 rows × 3,707 columns (omitted printing of 3698 columns) | |
1 | 1.0 | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 5.0 | NaN | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN |
6 | 6.0 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
What if the missing values are represented as some other symbol in the original dataframe? In the previous example, the missing values were convenienetly recognized by Julia as missing
type. Now, in some data files the missing values are represented by some other symbols, such as "?". In such case the column that contains "?" may be diagnosed as an array of strings by Julia. In that case, we could run the following command to replace the "?"s with missing
using the following command.
type_of_columns = eltype.(eachcol(df)) # say the name of the dataframe is df
m, n = size(df)
for i in 3:n # let the important columns containing missing values start from column number 3
if type_of_columns[i] == String # due to presence of "?" in the column, it may get diagnosed by an array of strings by Julia
df[!, i] = map(x->
begin
val = tryparse(Float64, x)
ifelse(typeof(val) == Nothing, missing, val)
end, df[!,i])
end
end
Let us see how to load and safe a dataframe. We can save df_ratings_unstacked
using the write
function provided by the CSV
package.
CSV.write("df_ratings_unstacked.csv", df_ratings_unstacked)
"df_ratings_unstacked.csv"
To load back the dataframe that we just saved into the csv file, we can use read
command provided by CSV
package. The command use_mmap = false
is used, so that the file can be deleted in the same session; this is required only in Windows.
df_loaded = CSV.read("df_ratings_unstacked.csv", use_mmap=false);
first(df_loaded, 5)
user_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 5 rows × 3,707 columns (omitted printing of 3698 columns) | |
1 | 1.0 | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 5.0 | NaN | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN |
Now let us convert the df_ratings_unstacked
dataframe into a matrix.
ratings_matrix = df_ratings_unstacked[:, 2:length(column_names_ru)];
rating_matrix = convert(Matrix, ratings_matrix)
6040×3706 Array{Float64,2}:
5.0 NaN NaN NaN NaN … NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4.0 NaN NaN NaN NaN … NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4.0 NaN NaN 3.0 NaN NaN NaN NaN NaN NaN NaN NaN
5.0 NaN NaN NaN NaN NaN NaN 3.0 NaN NaN NaN NaN
5.0 5.0 NaN NaN NaN NaN NaN 4.0 NaN NaN NaN NaN
NaN NaN NaN NaN NaN … NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
⋮ ⋱ ⋮ ⋮
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN … NaN NaN NaN NaN NaN NaN NaN
4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4.0 NaN 1.0 2.0 1.0 NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN 2.0 NaN … NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN