Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Recommender Systems using Deep Learning in PyTorch from scratch

$
0
0

Recommender Systems using Deep Learning in PyTorch from scratch
Photo by Susan Yin on Unsplash

Recommender systems (RS) have been around for a long time, and recent advances in deep learning have made them even more exciting. Matrix factorization algorithms have been the workhorse of RS. In this article, I would assume that you are vaguely familiar with collaborative filtering based methods and have basic knowledge about training a neural network in PyTorch.

In this post, my goal is to show you how to implement a RS in PyTorch from scratch. The theory and model presented in this article were made available in this paper . Here is the GitHub repository for this article.

Problem Definition

Given a past record of movies seen by a user, we will build a recommender system that helps the user discover movies of their interest.

Specifically, given <userID, itemID> occurrence pairs, we need to generate a ranked list of movies for each user.

We model the problem as a binary classification problem , where we learn a function to predict whether a particular user will like a particular movie or not.


Recommender Systems using Deep Learning in PyTorch from scratch
Our model will learn thismapping Dataset

We use the MovieLens 100K dataset, which has 100,000 ratings from 1000 users on 1700 movies. The dataset can be downloaded from here .

The ratings are given to us in form of <userID,itemID, rating, timestamp> tuples. Each user has a minimum of 20 ratings.

Training

We drop the exact value of rating (1,2,3,4,5) and instead convert it to an implicit scenario i.e. any positive interaction is given value of 1. All other interactions are given a value of zero, by default.

Since we are training a classifier, we need both positive and negative samples. The records present in the dataset are counted as positive samples. We assume that all entries in the user-item interaction matrix are negative samples (a strong assumption, and easy to implement).

We randomly sample 4 items that are not interacted by the user, for every item interacted by the user. This way, if a user has 20 positive interactions, he will have 80 negative interactions. These negative interactions cannot contain any positive interaction by the user, though they may not be all unique due to random sampling.

Evaluation

We randomly sample 100 items that are not interacted by the user, ranking the test item among the 100 items. This same strategy is used in the paper, which is the inspiration for this post (referenced below). We truncate the ranked list at 10.

Since it is too time-consuming to rank all items for every user, for we will have to calculate 1000*1700 ~10 values. With this strategy, we need 1000*100 ~ 10 values, an order of magnitude less.

For each user, we use the latest rating(according to timestamp) in the test set, and we use the rest for training. This evaluation methodology is also known as leave-one-out strategy and is the same as used in the reference paper.

Metrics

We use Hit Ratio(HR), and Normalized Discounted Cumulative Gain(NDCG) to evaluate the performance for our RS.

Our model gives a confidence score between 0 and 1 for each item present in the test set for a given user. The items are sorted in decreasing order of their score, and top 10 items are given as recommendation. If the test item (which is only one for each user) is present in this list, HR is one for this user, else it is zero. The final HR is reported after averaging for all users. A similar calculation is done for NDCG.

While training, we will be minimizing the cross-entropy loss, which is the standard loss function for a classification problem. The real strength of RS lies in giving a ranked list of top-k items, which a user is most likely to interact. Think about why you mostly click on google search results only on the first page, and never go to other pages. Metrics like NDCG and HR help in capturing this phenomenon by indicating the quality of our ranked lists. Here is a good introduction on evaluating recommender systems .

Baseline: Item Popularity model

A baseline model is one we use to provide a first cut, easy, non-sophisticated solution to the problem. In much of use cases for recommender systems, recommending the same list of most popular items to all users gives a tough to beat baseline.

In the GitHub repository, you will also find the code for implementing item popularity model from scratch. Below are the results for the baseline model.

Deep Learning basedmodel

With all the fancy architecture and jargon of neural networks, we aim to beat this item popularity model.

Our next model is a deep multi-layer perceptron (MLP). The input to the model is userID and itemID, which is fed into an embedding layer. Thus, each user and item is given an embedding. There are multiple dense layers afterward, followed by a single neuron with a sigmoid activation. The exact model definition can be found in the file MLP.py .

The output of the sigmoid neuron can be interpreted as the probability the user is likely to interact with an item. It is interesting to observe that we end up training a classifier for the task of recommendation.


Recommender Systems using Deep Learning in PyTorch from scratch
Figure 2: The architecture for Neural Collaborative Filtering

Our loss function is Binary Cross-entropy loss. We use Adam for gradient descent and L-2 norm for regularization.

Results

For the popularity based model, which takes less than 5 seconds to train, these are the scores:

HR = 0.4221 | NDCG = 0.2269

For the deep learning model, we obtain these results after nearly 30 epochs of training (~3 minutes on CPU):

HR = 0.6013 | NDCG = 0.3294

The results are exciting. There is a huge jump in metrics we care about. We observe a 30% reduction in error according to HR, which is huge. These numbers are obtained from a very coarse hyper-parameter tuning. It might still be possible to extract more juice by hyper-parameter optimization.

Conclusion

State of the art algorithms for matrix factorization, and much more, can be easily replicated using neural networks. For a non-neural perspective, read this excellent post about matrix factorization for recommender systems .

In this post, we saw how neural networks offer a straightforward way of building recommender systems. The trick is to think of recommendation problem as a classification prob

Viewing all articles
Browse latest Browse all 9596

Trending Articles