"One very common problem to solve is when you have a number of users, and a number of products, you then want to recommend which products are most likely to be useful for which users. There are many variations of this, for example, recommending movies (such as on Netflix), figuring out what to highlight for a user on a homepage, deciding what stories to show in a social media feed, and so forth. There is a general solution to this problem, called *collaborative filtering*, which works like this: have a look at what products the current user has used or liked, find other users that have used or liked similar products, and then recommend the products that those other users have used or liked.\n",
"\n",
"For example, on Netflix you may have watched lots of movies that are science-fiction, full of action, and were made in the 1970s. Netflix may not know these particular properties of the films you have watched, but it would be able to see that other people that have watched the same movies that you watched also tended to watch other movies that are science-fiction, full of action, and were made in the 1970s. In other words, to use this approach we don't necessarily need to know anything about the movies, except who like to watch them.\n",
"\n",
"There is actually a more general class of problems that this approach can solve; not necessarily just things involving users and products. Indeed, for collaborative filtering we more commonly refer to *items*, rather than *products*. Items could be links that you click on, diagnoses that are selected for patients, and so forth.\n",
"The key foundational idea is that of *latent factors*. In the above Netflix example, we started with the assumption that you like old action sci-fi movies. But you never actually told Netflix that you like these kinds of movies. And Netflix never actually needed to add columns to their movies table saying which movies are of these types. But there must be some underlying concept of sci-fi, action, and movie age. And these concepts must be relevant for at least some people's movie watching decisions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's get some data suitable for a collaboratie filtering model."
"For this chapter we are going to work on this movie review problem. We do not have access to Netflix's entire dataset of movie watching history, but there is a great dataset that we can use, called MovieLens. This dataset contains tens of millions of movie rankings (that is a combination of a movie ID, a user ID, and a numeric rating), although we will just use a subset of 100,000 of them for our example. If you're interested, it would be a great learning project to try and replicate this approach on the full 25 million recommendation dataset you can get from their website."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset is available through the usual fastai function:"
"According to the `README`, the main table is in the file `u.data`. It is tab-separated and the columns are respectively user, movie, rating and timestamp. Since those names are not encoded, we need to indicate them when reading the file with pandas. Here is a way to open this table and take a look:"
"Although this has all the information we need, it is not a particularly helpful way for humans to look at this data. <<movie_xtab>> shows the same data cross tabulated into a human friendly table."
"<img alt=\"Crosstab of movies and users\" width=\"632\" caption=\"Crosstab of movies and users\" id=\"movie_xtab\" src=\"images/att_00040.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have selected just a few of the most popular movies, and users who watch the most movies, for this crosstab example. The empty cells in this table are the things that we would like our model to learn to fill in. Those other places where a user has not reviewed the movie yet, presumably because they have not watched. So for each user, we would like to figure out which of those movies they might be most likely to enjoy.\n",
"\n",
"If we knew for each user to what degree they liked each important category that a movie might fall into, such as genre, age, preferred directors and actors, and so forth, and we knew the same information about each movie, then a simple way to fill in this table would be to multiply this information together for each movie and use a combination. For instance, assuming these factors range between -1 and positive one, and positive means high match and negative means low match, and the categories are science-fiction, action, and old movies, then we could represent the movie The Last Skywalker as:"
"Here, for instance, we are scoring *very science-fiction* as 0.98, and *very not old* as -0.9. We could represent a user who likes modern sci-fi action movies as:"
"When we multiply two vectors together and add up the results, this is known as the *dot product*. It is used a lot in machine learning, and forms the basis of matrix modification. We will be looking a lot more at matrix modification and dot products in <<chapter_foundations>>."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: dot product: the mathematical operation of multiplying the elements of two vectors together, and then summing up the result."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On the other hand, we might represent the movie Casablanca as:"
"There is surprisingly little distance from specifying the structure of a model, as we did in the last section, and learning one, since we can just use our general gradient descent approach.\n",
"Step one of this approach is to randomly initialise some parameters. These parameters will be a set of latent factors for each user and movie. We will have to decide how many to use. We will discuss how to select this shortly, but for illustrative purposes let's use 5 for now. Because each user will have a set of these factors, and each movie will have a set of these factors, we can show these randomly initialise values right next to the users and movies in our crosstab, and we can then fill in the dot products for each of these combinations in the middle. For example, <<xtab_latent>> shows what it looks like in Microsoft Excel, with the top-left cell formula displayed as an example."
"Step two of this approach is to calculate our predictions. As we've discussed, we can do this by simply taking the dot product of each movie with each user. If for instance the first latent user factor represents how much they like action movies, and the first latent movie factor represents if the movie has a lot of action or not, when the product of those will be particularly high if either the user likes action movie and the movie has a lot of action in it or if the user doesn't like action movie and the movie doesn't have any action in it. On the other hand, if we have a mismatch (a user loves action movies but the movie isn't, or the user doesn't like action movies and it is one), the product will be very low.\n",
"Step three is to calculate our loss. We can use any loss function that we wish; that's pick mean squared error for now, since that is one reasonable way to represent the accuracy of a prediction.\n",
"That's all we need. With this in place, we can optimise our parameters (that is, the latent factors) using stochastic gradient descent, such as to minimise the loss. At each step, the stochastic gradient descent optimiser will calculate the match between each movie and each user using the dot product, and will compare it to the actual rating that each user gave to each movie, and it will then calculate the derivative of this value, and will step the weights by multiplying this by the learning rate. After doing this lots of times, the loss will get better and better, and the recommendations will also get better and better."
"We can then build a `DataLoaders` object from this table. By default, it takes the first column for user, the second column for the item (here our movies) and the third column for the ratings. We need to change the value of `item_name` in our case, to use the titles instead of the ids:"
"In order to represent collaborative filtering in PyTorch we can't just use the crosstab representation directly, especially if we want it to fit into our deep learning framework. We can represent our movie and user latent factor tables as simple matrices:"
"To calculate the result for a particular movie and use a combination we have two look up the index of the movie in our movie latent factors matrix, and the index of the user in our user latent factors matrix, and then we can do our dot product between the two latent factor vectors. But *look up in an index* is not an operation which our deep learning models know how to do. They know how to do matrix products, and activation functions.\n",
"\n",
"It turns out that we can represent *look up in an index* as a matrix product! The trick is to replace our indices with one hot encoded vectors. He is an example of what happens if we multiply a vector by a one hot encoded vector representing the index three:"
"If we do that for a few indices at once, we will have a matrix of one-hot encoded vectors and that operation will be a matrix multiplication! This would be a perfectly acceptable way to build models using this kind of architecture, except that it would use a lot more memory and time than necessary. We know that there is no real underlying reason to store the one hot encoded vector, or to search through it to find the occurrence of the number one — we should just be able to index into an array directly with an integer. Therefore, most deep learning libraries, including PyTorch, include a special layer which does just this; it indexes into a vector using an integer, but has its derivative calculated in such a way that it is identical to what it would have been if it had of done a matrix multiplication with a one hot encoded vector. This is called an *embedding*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: embedding layer: multiplying by a one hot encoded matrix, using the computational shortcut that it can be implemented by simply indexing directly. It is quite a fancy word for a very simple concept. The thing that you multiply the one hot encoded matrix by (or, using the computational shortcut, index into directly) is called the _embedding matrix_."
"In computer vision, we had a very easy way to get all the information of a pixel through its RGB values: each pixel in a coloured imaged is represented by three numbers. Those three numbers gave us the red-ness, the green-ness and the blue-ness, which is enough to get our model to work afterward.\n",
"For the problem at hand, we don't have the same easy way to characterize a user or a movie. There is probably relations with genres: if a given user likes romance, he is likely to put higher scores to romance movie. Or wether the movie is more action-centered vs heavy on dialogue. Or the presence of a specific actor that one use might particularly like. \n",
"\n",
"How do we determine numbers to characterize those? The answer is, we don't. We will let our model *learn* them. By analyzing the existing relations between users and movies, let our model figure out itself the features that seem important or not.\n",
"This is what embeddings are. We will attribute to each of our users and each of our movie a random vector of a certain length (here `n_factors=5`), and we will make those learnable parameters. That means that at each step, when we compute the loss by comparing our predictions to our targets, we will compute the gradients of the loss with respect to those embedding vectors and update them with the rule of SGD (or another optimizer).\n",
"At the beginning, those numbers don't mean anything since we have chosen them randomly, but by the end of training, they will. By learning on existing data between users and movies, without having any other information, we will see that they still get some important features, and can isolate blockbusters from independent cinema, action movies from romance...\n",
"We are now in a position that we can create our whole model from scratch."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Collaborative filtering from scratch"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we can write a model in PyTorch, we first need to learn the basics of object-oriented programming and Python. If you haven't done any object oriented programming before, we will give you a quick introduction here, but we would recommend looking up a tutorial and doing some practice before moving on.\n",
"\n",
"The key idea in object-oriented programming is the *class*. We have been using classes throughout this book, such as DataLoader, string, and Learner. Python makes it easy for us to create new classes. Here is an example of a simple class:"
"The most important piece of this is the special method called `__init__` (pronounced *dunder init*). In Python, any method surrounded in double underscores like this is considered special. It indicates that there is some extra behaviour associated with this method name. In the case of `__init__`, this is the method which Python will call when your new object is created. So, this is where you can set up any state which needs to be done upon object creation. Any parameters included when the user constructs an instance of your class will be passed to the `__init__` method is parameters. Note that the first parameter to any methods defined inside a class is `self`, so you can use this to set and get any attributes that you will need."
"Also note that creating a new PyTorch module requires inheriting from Module. *Inheritance* is an important object-oriented concept which we will not discuss in detail here — in short, it means that we can add additional behaviour to an existing class. PyTorch already provides a Module class, which provides some basic foundations that we want to build on. So, we add the name of this *super class* after the name of the class that we are defining, as you see above.\n",
"\n",
"The final thing that you need to know to create a new PyTorch module, is that when your module is called, PyTorch will call a method in your class called `forward`, and will pass along to that any parameters that are included in the call. Here is our dot product model:"
"If you haven't seen object-oriented programming before, then don't worry, you won't need to use it much in this book. We are just mentioning this approach here, because most online tutorials and documentation will use the object-oriented syntax.\n",
"\n",
"Note that the input of the model is a tensor of shape `batch_size x 2`, where the first columns (`x[:, 0]`) contains the user ids and the second column (`x[:, 1]`) contains the movie ids. As explained before, we use the *embedding* layers to represent our matrices of user and movie latent factors."
"Now that we have defined our architecture, and created our parameter matrices, we need to create a `Learner` to optimize our model. In the past we have used special functions, such as `cnn_learner`, which set up everything for us for a particular application. Since we are doing things from scratch here, we will use the plain `Learner` class:"
"The first thing we can do to make this model a little bit better is to force those predictions between 0 and 5. For this, we just need to use `sigmoid_range`, like in the previous chapter. One thing we discovered empirically is that it's better to have the range go a little bit over 5, so we use `(0, 5.5)`."
"This is a reasonable start, but we can do better. One obvious missing piece is that some users are just more positive or negative in their recommendations and others, and some movies are just plain better or worse than others. But in our dot product representation we do not have any way to encode either of these things. If all you can say, for instance, about the movie is that it is very sci-fi, very action oriented, and very not old, then you don't really have any way to say most people like it. \n",
"\n",
"That's because at this point we only have weights; we do not have biases. If we have a single number for each user which we add to our scores, and ditto for each movie, then this will handle this missing piece very nicely. So first of all, let's adjust our model architecture:"
"Instead of being better, it ends up being worse (at least at the end of training). Why is that? If we look at both trainings carefully, we can see the validation loss stopped improving in the middle and started to get worse. As we've seen, this is a clear indication of overfitting. In this case, there is no way to use data augmentation, so we will have to use another regularisation technique. One approach that can be helpful is *weight decay*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Weight decay"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Weight decay, or L2 regularization, consists in adding to your loss function the sum of all the weights squared. Why do that? Because when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.\n",
"\n",
"Why would it prevent overfitting? The idea is that the larger the coefficient are, the more sharp canyons we will have in the loss function. If we take the basic example of parabola, `y = a * (x**2)`, the larger `a` is, the more *narrow* the parabola is."
"for a,y in zip(a_s,ys): ax.plot(x,y, label=f'a={a}')\n",
"ax.set_ylim([0,5])\n",
"ax.legend();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So by letting our model learn high parameters, it might fit all the data points in the training set with an over-complex function that has very sharp changes, which will lead to overfitting.\n",
"\n",
"Limiting our weights from growing to much is going to hinder the training of the model, but it will yield to a state where it generalizes better. Going back to the theory a little bit, weight decay (or just `wd`) is a parameter that controls that sum of squares we add to our loss (assuming `parameters` is a tensor of all parameters):\n",
"\n",
"``` python\n",
"loss_with_wd = loss + wd * (parameters**2).sum()\n",
"```\n",
"\n",
"In practice though, it would be very inefficient (and maybe numerically unstable) to compute that big sum and add it to the loss. If you remember a little bit of high schoool math, you might recall that the derivative of `p**2` with respect to `p` is `2*p`, so adding that big sum to our loss is exactly the same as doing:\n",
"In practice, since `wd` is a parameter that we choose, we can just make it twice as big, so we don't even need the `*2` in the above equation. To use weight decay in fastai, just pass `wd` in your call to fit:"
"So far, we've used `Embedding` without thinking about how it really works. Let's recreate DotProductBias *without* using this class. We'll need a randomly initialized weight matrix for each of the embeddings. We have to be careful, however. Recall from <<chapter_mnist_basics>> that optimizers require that they can get all the parameters of a module from a module's `parameters()` method. However, this does not happen fully automatically. If we just add a tensor as an attribute to a `Module`, it will not be included in `parameters`:"
"To tell `Module` that we want to treat a tensor as parameters, we have to wrap it in the `nn.Parameter` class. This class doesn't actually add any functionality (other than automatically calling `requires_grad_()` for us). It's only used as a \"marker\" to show what to include in `parameters()`:"
"Our model is already useful, in that it can provide us with recommendations for movies for our users — but it is also interesting to see what parameters it has discovered. The easiest to interpret are the biases. Here are the movies with the lowest values in the bias vector:"
"Have a think about what this means. What this is saying is, that for these movies, even when a user is very well matched to its latent factors (which, as we will see in a moment, tend to represent things like level of action, age of movie, and so forth) they still generally don't like it. We could have simply sorted movies directly by the average rating, but looking at their learned bias tells us something much more interesting. It tells us not just whether a movie is of a kind that people tend not to enjoy watching, but that people type like watching it even if it is of a kind that they would otherwise enjoy! By the same token, here are the movies with the highest bias:"
"It is not quite so easy to directly interpret the embedding matrices. There is just too many factors for a human to look at. But there is a technique which can pull out the most important underlying *directions* in such a matrix, called *principal component analysis* (PCA). We will not be going into this in detail in this book, because it is not particularly important for you to understand to be a deep learning practitioner, but if you are interested then we suggest you check out the fast.ai course, Computational Linear Algebra for Coders. <<img_pca_movie>> shows what our movies look like based on two of the strongest PCA components."
"We can see here that the model seems to have discovered a concept of *classic* versus *pop culture* movies, or perhaps it is *critically acclaimed* that is represented here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> j: no matter how many models I train, I never stop getting moved and surprised by how these randomly initialised bunches of numbers, trained with such simple mechanics, managed to discover things about my data all by themselves. It almost seems like cheating, that I can create code which does useful things, without ever actually telling it how to do those things!"
"On a two-dimensional map we can calculate the distance between two coordinates using the formula of Pythagoras: $\\sqrt{x^{2}+y^{2}}$ (assuming that X and Y are the distances between the coordinates on each axis). For a 50 dimensional embedding we can do exactly the same thing, except that we add up the squares of all 50 of the coordinate distances.\n",
"If there were two movies that were nearly identical, then there embedding vectors would also have to be nearly identical, because the users that would like them would be nearly exactly the same. There is a more general idea here: movie similarity can be defined by the similarity of users that like those movies. And that directly means that the distance between two movies' embedding vectors can define that similarity. We can use this to find the most similar movie to *Silence of the Lambs*:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Dial M for Murder (1954)'"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_factors = learn.model.i_weight.weight\n",
"idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']\n",
"## Boot strapping a collaborative filtering model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The biggest challenge with using collaborative filtering models in practice is the *bootstrapping problem*. The most extreme version of this problem is when you have no users, and therefore no history to learn from. What product do you recommend to your very first user?\n",
"\n",
"But even if you are a well-established company with a long history of user transactions, you still have the question: what do you do when a new user signs up? And indeed, what do you do when you add a new product to your portfolio? There is no magic solution to this problem, and really the solutions that we suggest are just variations of the form *use your common sense*. You can start your new users such that they have the mean of all of the embedding vectors of your other users — although this has the problem that that particular combination of latent factors may be not at all common (for instance the average for the science-fiction factor may be high, and the average for the action factor may be low, but it is not that common to find people who like science-fiction without action). Better would probably be to pick some particular user to represent *average taste*.\n",
"\n",
"Better still is to use a tabular model based on user meta data to construct your initial embedding vector. When a user signs up, think about what questions you could ask them which could help you to understand their tastes. Then you can create a model where the dependent variable is a user's embedding vector, and the independent variables are the results of the questions that you ask them, along with their signup meta data. We will learn in the next section how to create these kinds of tabular models. You may have noticed that when you sign up for services such as Pandora and Netflix that they tend to ask you a few questions about what genres of movie or music that you like; this is how they come up with your initial collaborative filtering recommendations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One thing to be careful of is that a small number of extremely enthusiastic users may end up effectively setting the recommendations for your whole user base. This is a very common problem, for instance, in movie recommendation systems. People that watch anime tend to watch a whole lot of it, and don't watch very much else, and spend a lot of time putting their ratings into websites. As a result, a lot of *best ever movies* lists tend to be heavily overrepresented with anime. In this particular case, it can be fairly obvious that you have a problem of representation bias, but if the bias is occurring in the latent factors then it may not be obvious at all.\n",
"\n",
"Such a problem can change the entire make up of your user base, and the behaviour of your system. This is particularly true because of positive feedback loops. If a small number of your users tend to set the direction of your recommendation system, then they are naturally going to end up attracting more people like them to your system. And that will, of course, amplify the original representation bias. This is a natural tendency to be amplified exponentially. You may have seen examples of company executives expressing surprise at how their online platforms rapidly deteriorate in such a way that they express values that are at odds with the values of the founders. In the presence of these kinds of feedback loops, it is easy to see how such a divergence can happen both quickly, and in a way that is hidden until it is too late.\n",
"\n",
"In a self-reinforcing system like this, we should probably expect these kinds of feedback loops to be the norm, not the exception. Therefore, you should assume that you will see them, plan for that, and identify upfront how you will deal with these issues. Try to think about all of the ways in which feedback loops may be represented in your system, and how you might be able to identify them in your data. In the end, this is coming back to our original advice about how to avoid disaster when rolling out any kind of machine learning system. It's all about ensuring that there are humans in the loop, that there is careful monitoring, and gradual and thoughtful rollout."
"Our dot product model works quite well, and it is the basis of many successful real-world recommendation systems. This approach to collaborative filtering is known as *probabilistic matrix factorisation* (PMF). Another approach, which generally works similarly well given the same data, is deep learning."
"To turn our architecture into a deep learning model the first step is to take the results of the embedding look up, and concatenating those activations together. This gives us a matrix which we can then pass through linear layers and nonlinearities in the usual way.\n",
"\n",
"Since we'll be concatenating the embedding matrices, rather than taking their dot product, that means that the two embedding matrices can have different sizes (i.e. different numbers of latent factors). fastai has a function `get_emb_sz` that returns recommended sizes for embedding matrices for your data, based on a heuristic that fast.ai has found tends to work well in practice:"
"`CollabNN` creates our `Embedding` layers in the same way as previous classes in this chapter, except that we now use the `embs` sizes. Then `self.layers` is identical to the mini neural net we created in <<chapter_mnist_basics>> for MNIST. Then, in `forward`, we apply the embeddings, concatenate the results, and pass it through the mini neural net. Finally, we apply `sigmoid_range` as we have in previous models.\n",
"Fastai provides this model in fastai.collab, if you pass `use_nn=True` in your call to `collab_learner` (including calling `get_emb_sz` for you), plus lets you easily create more layers. For instance, here we're creating two hidden layers, of size 100 and 50, respectively:"
"Wow that's not a lot of code! This class *inherits* from `TabularModel`, which is where it gets all its functionality from. In `__init__` is calls the same method in `TabularModel`, passing `n_cont=0` and `out_sz=1`; other than that, it only passes along whatever arguments it received."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sidebar: kwargs and delegates"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`EmbeddingNN` includes `**kwargs` as a parameter to `__init__`. In python `**kwargs` in a parameter like means \"put any additional keyword arguments into a dict called `kwarg`. And `**kwargs` in an argument list means \"insert all key/value pairs in the `kwargs` dict as named arguments here\". This approach is used in many popular libraries, such as `matplotlib`, in which the main `plot` function simply has the signature `plot(*args, **kwargs)`. The [plot documentation](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot) says \"*The `kwargs` are Line2D properties*\" and then lists those properties.\n",
"\n",
"We're using `**kwargs` in `EmbeddingNN` to avoid having to write all the arguments to `TabularModel` a second time, and keep them in sync. However, this makes our API quite difficult to work with, because now Jupyter Notebook doesn't know what parameters are available, so things like tab-completion of parameter names and popup lists of signatures won't work.\n",
"\n",
"Fastai resolves this by providing a special `@delegates` decorator, which automatically changes the signature of the class or function (`EmbeddingNN` in this case) to insert all of its keyword arguments into the signature"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### End sidebar"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although the results of `EmbeddingNN` are a bit worse than the dot product approach (which shows the power of carefully using an architecture for a domain), it does allow us to do something very important: we can now directly incorporate other user and movie information, time, and other information that may be relevant to the recommendation. That's exactly what `TabularModel` does. In fact, we've now seen that `EmbeddingNN` is just a `TabularModel`, with `n_cont=0` and `out_sz=1`. So we better spend some time learning about `TabularModel`, and how to use it to get great results!"
"1. What problem does collaborative filtering solve?\n",
"1. How does it solve it?\n",
"1. Why might a collaborative filtering predictive model fail to be a very useful recommendation system?\n",
"1. What does a crosstab representation of collaborative filtering data look like?\n",
"1. Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!)\n",
"1. What is a latent factor? Why is it \"latent\"?\n",
"1. What is a dot product? Calculate a dot product manually using pure python with lists.\n",
"1. What does `pandas.DataFrame.merge` do?\n",
"1. What is an embedding matrix?\n",
"1. What is the relationship between an embedding and a matrix of one-hot encoded vectors?\n",
"1. Why do we need `Embedding` if we could use one-hot encoded vectors for the same thing?\n",
"1. What does an embedding contain before we start training (assuming we're not using a prertained model)?\n",
"1. Create a class (without peeking, if possible!) and use it.\n",
"1. What does `x[:,0]` return?\n",
"1. Rewrite the `DotProduct` class (without peeking, if possible!) and train a model with it\n",
"1. What is a good loss function to use for MovieLens? Why? \n",
"1. What would happen if we used `CrossEntropy` loss with MovieLens? How would we need to change the model?\n",
"1. What is the use of bias in a dot product model?\n",
"1. What is another name for weight decay?\n",
"1. Write the equation for weight decay (without peeking!)\n",
"1. Write the equation for the gradient of weight decay. Why does it help reduce weights?\n",
"1. Why does reducing weights lead to better generalization?\n",
"1. What does `argsort` do in PyTorch?\n",
"1. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why / why not?\n",
"1. How do you print the names and details of the layers in a model?\n",
"1. What is the \"bootstrapping problem\" in collaborative filtering?\n",
"1. How could you deal with the bootstrapping problem for new users? For new movies?\n",
"1. How can feedback loops impact collaborative filtering systems?\n",
"1. When using a neural network in collaborative filtering, why can we have different number of factors for movie and user?\n",
"1. Why is there a `nn.Sequential` in the `CollabNN` model?\n",
"1. What kind of model should be use if we want to add metadata about users and items, or information such as date and time, to a collaborative filter model?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further research\n",
"\n",
"1. Take a look at all the differences between the `Embedding` version of `DotProductBias` and the `create_params` version, and try to understand why each of those changes is required. If you're not sure, try reverting each change, to see what happens. (NB: even the type of brackets used in `forward` has changed!)\n",
"1. Find three other areas where collaborative filtering is being used, and find out what pros and cons of this approach in those areas.\n",
"1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book website and forum for ideas. Note that there are more columns in the full dataset--see if you can use those too (the next chapter might give you ideas)\n",
"1. Create a model for MovieLens with works with CrossEntropy loss, and compare it to the model in this chapter."