diff --git a/08_collab.ipynb b/08_collab.ipynb index d291666..1922832 100644 --- a/08_collab.ipynb +++ b/08_collab.ipynb @@ -28,20 +28,20 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "One very common problem to solve is when you have a number of users, and a number of products, you then want to recommend which products are most likely to be useful for which users. There are many variations of this, for example, recommending movies (such as on Netflix), figuring out what to highlight for a user on a homepage, deciding what stories to show in a social media feed, and so forth. There is a general solution to this problem, called *collaborative filtering*, which works like this: have a look at what products the current user has used or liked, find other users that have used or liked similar products, and then recommend the products that those other users have used or liked.\n", + "One very common problem to solve is when you have a number of users and a number of products, and you want to recommend which products are most likely to be useful for which users. There are many variations of this: for example, recommending movies (such as on Netflix), figuring out what to highlight for a user on a home page, deciding what stories to show in a social media feed, and so forth. There is a general solution to this problem, called *collaborative filtering*, which works like this: look at what products the current user has used or liked, find other users that have used or liked similar products, and then recommend other products that those users have used or liked.\n", "\n", - "For example, on Netflix you may have watched lots of movies that are science-fiction, full of action, and were made in the 1970s. Netflix may not know these particular properties of the films you have watched, but it would be able to see that other people that have watched the same movies that you watched also tended to watch other movies that are science-fiction, full of action, and were made in the 1970s. In other words, to use this approach we don't necessarily need to know anything about the movies, except who like to watch them.\n", + "For example, on Netflix you may have watched lots of movies that are science fiction, full of action, and were made in the 1970s. Netflix may not know these particular properties of the films you have watched, but it will be able to see that other people that have watched the same movies that you watched also tended to watch other movies that are science fiction, full of action, and were made in the 1970s. In other words, to use this approach we don't necessarily need to know anything about the movies, except who like to watch them.\n", "\n", - "There is actually a more general class of problems that this approach can solve; not necessarily just things involving users and products. Indeed, for collaborative filtering we more commonly refer to *items*, rather than *products*. Items could be links that you click on, diagnoses that are selected for patients, and so forth.\n", + "There is actually a more general class of problems that this approach can solve, not necessarily involving users and products. Indeed, for collaborative filtering we more commonly refer to *items*, rather than *products*. Items could be links that people click on, diagnoses that are selected for patients, and so forth.\n", "\n", - "The key foundational idea is that of *latent factors*. In the above Netflix example, we started with the assumption that you like old action sci-fi movies. But you never actually told Netflix that you like these kinds of movies. And Netflix never actually needed to add columns to their movies table saying which movies are of these types. But there must be some underlying concept of sci-fi, action, and movie age. And these concepts must be relevant for at least some people's movie watching decisions." + "The key foundational idea is that of *latent factors*. In the Netflix example, we started with the assumption that you like old, action-packed sci-fi movies. But you never actually told Netflix that you like these kinds of movies. And Netflix never actually needed to add columns to its movies table saying which movies are of these types. Still, there must be some underlying concept of sci-fi, action, and movie age, and these concepts must be relevant for at least some people's movie watching decisions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "First, let's get some data suitable for a collaborative filtering model." + "For this chapter we are going to work on this movie recommendation problem. We'll start by getting some data suitable for a collaborative filtering model." ] }, { @@ -55,7 +55,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For this chapter we are going to work on this movie review problem. We do not have access to Netflix's entire dataset of movie watching history, but there is a great dataset that we can use, called MovieLens. This dataset contains tens of millions of movie rankings (that is a combination of a movie ID, a user ID, and a numeric rating), although we will just use a subset of 100,000 of them for our example. If you're interested, it would be a great learning project to try and replicate this approach on the full 25 million recommendation dataset you can get from their website." + "We do not have access to Netflix's entire dataset of movie watching history, but there is a great dataset that we can use, called [MovieLens](https://grouplens.org/datasets/movielens/). This dataset contains tens of millions of movie rankings (a combination of a movie ID, a user ID, and a numeric rating), although we will just use a subset of 100,000 of them for our example. If you're interested, it would be a great learning project to try and replicate this approach on the full 25-million recommendation dataset, which you can get from their website." ] }, { @@ -80,7 +80,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "According to the `README`, the main table is in the file `u.data`. It is tab-separated and the columns are respectively user, movie, rating and timestamp. Since those names are not encoded, we need to indicate them when reading the file with pandas. Here is a way to open this table and take a look:" + "According to the *README*, the main table is in the file *u.data*. It is tab-separated and the columns are, respectively user, movie, rating, and timestamp. Since those names are not encoded, we need to indicate them when reading the file with Pandas. Here is a way to open this table and take a look:" ] }, { @@ -179,7 +179,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Although this has all the information we need, it is not a particularly helpful way for humans to look at this data. <> shows the same data cross tabulated into a human friendly table." + "Although this has all the information we need, it is not a particularly helpful way for humans to look at this data. <> shows the same data cross-tabulated into a human-friendly table." ] }, { @@ -193,9 +193,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We have selected just a few of the most popular movies, and users who watch the most movies, for this crosstab example. The empty cells in this table are the things that we would like our model to learn to fill in. Those other places where a user has not reviewed the movie yet, presumably because they have not watched. So for each user, we would like to figure out which of those movies they might be most likely to enjoy.\n", + "We have selected just a few of the most popular movies, and users who watch the most movies, for this crosstab example. The empty cells in this table are the things that we would like our model to learn to fill in. Those are the places where a user has not reviewed the movie yet, presumably because they have not watched it. For each user, we would like to figure out which of those movies they might be most likely to enjoy.\n", "\n", - "If we knew for each user to what degree they liked each important category that a movie might fall into, such as genre, age, preferred directors and actors, and so forth, and we knew the same information about each movie, then a simple way to fill in this table would be to multiply this information together for each movie and use a combination. For instance, assuming these factors range between -1 and positive one, and positive means high match and negative means low match, and the categories are science-fiction, action, and old movies, then we could represent the movie The Last Skywalker as:" + "If we knew for each user to what degree they liked each important category that a movie might fall into, such as genre, age, preferred directors and actors, and so forth, and we knew the same information about each movie, then a simple way to fill in this table would be to multiply this information together for each movie and use a combination. For instance, assuming these factors range between -1 and +1, with positive numbers indicating stronger matches and negative numbers weaker ones, and the categories are science-fiction, action, and old movies, then we could represent the movie *The Last Skywalker* as:" ] }, { @@ -227,7 +227,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "…and we can now calculate the match between this combination:" + "and we can now calculate the match between this combination:" ] }, { @@ -261,14 +261,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> jargon: dot product: the mathematical operation of multiplying the elements of two vectors together, and then summing up the result." + "> jargon: dot product: The mathematical operation of multiplying the elements of two vectors together, and then summing up the result." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "On the other hand, we might represent the movie Casablanca as:" + "On the other hand, we might represent the movie *Casablanca* as:" ] }, { @@ -284,7 +284,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "…and the match between this combination is:" + "The match between this combination is:" ] }, { @@ -325,9 +325,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "There is surprisingly little distance from specifying the structure of a model, as we did in the last section, and learning one, since we can just use our general gradient descent approach.\n", + "There is surprisingly little difference between specifying the structure of a model, as we did in the last section, and learning one, since we can just use our general gradient descent approach.\n", "\n", - "Step one of this approach is to randomly initialise some parameters. These parameters will be a set of latent factors for each user and movie. We will have to decide how many to use. We will discuss how to select this shortly, but for illustrative purposes let's use 5 for now. Because each user will have a set of these factors, and each movie will have a set of these factors, we can show these randomly initialised values right next to the users and movies in our crosstab, and we can then fill in the dot products for each of these combinations in the middle. For example, <> shows what it looks like in Microsoft Excel, with the top-left cell formula displayed as an example." + "Step 1 of this approach is to randomly initialize some parameters. These parameters will be a set of latent factors for each user and movie. We will have to decide how many to use. We will discuss how to select this shortly, but for illustrative purposes let's use 5 for now. Because each user will have a set of these factors and each movie will have a set of these factors, we can show these randomly initialized values right next to the users and movies in our crosstab, and we can then fill in the dot products for each of these combinations in the middle. For example, <> shows what it looks like in Microsoft Excel, with the top-left cell formula displayed as an example." ] }, { @@ -341,18 +341,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Step two of this approach is to calculate our predictions. As we've discussed, we can do this by simply taking the dot product of each movie with each user. If for instance the first latent user factor represents how much they like action movies, and the first latent movie factor represents if the movie has a lot of action or not, when the product of those will be particularly high if either the user likes action movie and the movie has a lot of action in it or if the user doesn't like action movie and the movie doesn't have any action in it. On the other hand, if we have a mismatch (a user loves action movies but the movie isn't, or the user doesn't like action movies and it is one), the product will be very low.\n", + "Step 2 of this approach is to calculate our predictions. As we've discussed, we can do this by simply taking the dot product of each movie with each user. If, for instance, the first latent user factor represents how much the user likes action movies and the first latent movie factor represents if the movie has a lot of action or not, the product of those will be particularly high if either the user likes action movies and the movie has a lot of action in it or the user doesn't like action movies and the movie doesn't have any action in it. On the other hand, if we have a mismatch (a user loves action movies but the movie isn't an action film, or the user doesn't like action movies and it is one), the product will be very low.\n", "\n", - "Step three is to calculate our loss. We can use any loss function that we wish; that's pick mean squared error for now, since that is one reasonable way to represent the accuracy of a prediction.\n", + "Step 3 is to calculate our loss. We can use any loss function that we wish; let's pick mean squared error for now, since that is one reasonable way to represent the accuracy of a prediction.\n", "\n", - "That's all we need. With this in place, we can optimise our parameters (that is, the latent factors) using stochastic gradient descent, such as to minimise the loss. At each step, the stochastic gradient descent optimiser will calculate the match between each movie and each user using the dot product, and will compare it to the actual rating that each user gave to each movie, and it will then calculate the derivative of this value, and will step the weights by multiplying this by the learning rate. After doing this lots of times, the loss will get better and better, and the recommendations will also get better and better." + "That's all we need. With this in place, we can optimize our parameters (that is, the latent factors) using stochastic gradient descent, such as to minimize the loss. At each step, the stochastic gradient descent optimizer will calculate the match between each movie and each user using the dot product, and will compare it to the actual rating that each user gave to each movie. It will then calculate the derivative of this value and will step the weights by multiplying this by the learning rate. After doing this lots of times, the loss will get better and better, and the recommendations will also get better and better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "To use the usual `Learner` fit function, we will need to get our data into `DataLoaders`, so let's focus on that now." + "To use the usual `Learner.fit` function we will need to get our data into a `DataLoaders`, so let's focus on that now." ] }, { @@ -366,7 +366,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "When showing the data we would rather see movie titles than their ids. The table `u.item` contains the correspondence id to title:" + "When showing the data, we would rather see movie titles than their IDs. The table `u.item` contains the correspondence of IDs to titles:" ] }, { @@ -453,7 +453,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next we merge it to our ratings to get the titles." + "We can merge this with our `ratings` table to get the user ratings by title:" ] }, { @@ -557,7 +557,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can then build a `DataLoaders` object from this table. By default, it takes the first column for user, the second column for the item (here our movies) and the third column for the ratings. We need to change the value of `item_name` in our case, to use the titles instead of the ids:" + "We can then build a `DataLoaders` object from this table. By default, it takes the first column for the user, the second column for the item (here our movies), and the third column for the ratings. We need to change the value of `item_name` in our case to use the titles instead of the IDs:" ] }, { @@ -658,7 +658,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In order to represent collaborative filtering in PyTorch we can't just use the crosstab representation directly, especially if we want it to fit into our deep learning framework. We can represent our movie and user latent factor tables as simple matrices:" + "To represent collaborative filtering in PyTorch we can't just use the crosstab representation directly, especially if we want it to fit into our deep learning framework. We can represent our movie and user latent factor tables as simple matrices:" ] }, { @@ -700,9 +700,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To calculate the result for a particular movie and use a combination we have two look up the index of the movie in our movie latent factors matrix, and the index of the user in our user latent factors matrix, and then we can do our dot product between the two latent factor vectors. But *look up in an index* is not an operation which our deep learning models know how to do. They know how to do matrix products, and activation functions.\n", + "To calculate the result for a particular movie and user combination, we have to look up the index of the movie in our movie latent factor matrix and the index of the user in our user latent factor matrix; then we can do our dot product between the two latent factor vectors. But *look up in an index* is not an operation our deep learning models know how to do. They know how to do matrix products, and activation functions.\n", "\n", - "It turns out that we can represent *look up in an index* as a matrix product! The trick is to replace our indices with one hot encoded vectors. Here is an example of what happens if we multiply a vector by a one hot encoded vector representing the index three:" + "Fortunately, it turns out that we can represent *look up in an index* as a matrix product. The trick is to replace our indices with one-hot-encoded vectors. Here is an example of what happens if we multiply a vector by a one-hot-encoded vector representing the index 3:" ] }, { @@ -714,26 +714,6 @@ "one_hot_3 = one_hot(3, n_users).float()" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "torch.Size([944, 5])" - ] - }, - "execution_count": null, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "user_factors.shape" - ] - }, { "cell_type": "code", "execution_count": null, @@ -785,29 +765,29 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If we do that for a few indices at once, we will have a matrix of one-hot encoded vectors and that operation will be a matrix multiplication! This would be a perfectly acceptable way to build models using this kind of architecture, except that it would use a lot more memory and time than necessary. We know that there is no real underlying reason to store the one hot encoded vector, or to search through it to find the occurrence of the number one — we should just be able to index into an array directly with an integer. Therefore, most deep learning libraries, including PyTorch, include a special layer which does just this; it indexes into a vector using an integer, but has its derivative calculated in such a way that it is identical to what it would have been if it had done a matrix multiplication with a one hot encoded vector. This is called an *embedding*." + "If we do that for a few indices at once, we will have a matrix of one-hot-encoded vectors, and that operation will be a matrix multiplication! This would be a perfectly acceptable way to build models using this kind of architecture, except that it would use a lot more memory and time than necessary. We know that there is no real underlying reason to store the one-hot-encoded vector, or to search through it to find the occurrence of the number one—we should just be able to index into an array directly with an integer. Therefore, most deep learning libraries, including PyTorch, include a special layer that does just this; it indexes into a vector using an integer, but has its derivative calculated in such a way that it is identical to what it would have been if it had done a matrix multiplication with a one-hot-encoded vector. This is called an *embedding*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "> jargon: embedding layer: multiplying by a one hot encoded matrix, using the computational shortcut that it can be implemented by simply indexing directly. It is quite a fancy word for a very simple concept. The thing that you multiply the one hot encoded matrix by (or, using the computational shortcut, index into directly) is called the _embedding matrix_." + "> jargon: Embedding: Multiplying by a one-hot-encoded matrix, using the computational shortcut that it can be implemented by simply indexing directly. This is quite a fancy word for a very simple concept. The thing that you multiply the one-hot-encoded matrix by (or, using the computational shortcut, index into directly) is called the _embedding matrix_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In computer vision, we had a very easy way to get all the information of a pixel through its RGB values: each pixel in a coloured imaged is represented by three numbers. Those three numbers gave us the red-ness, the green-ness and the blue-ness, which is enough to get our model to work afterward.\n", + "In computer vision, we have a very easy way to get all the information of a pixel through its RGB values: each pixel in a colored image is represented by three numbers. Those three numbers give us the redness, the greenness and the blueness, which is enough to get our model to work afterward.\n", "\n", - "For the problem at hand, we don't have the same easy way to characterize a user or a movie. There is probably relations with genres: if a given user likes romance, he is likely to put higher scores to romance movie. Or wether the movie is more action-centered vs heavy on dialogue. Or the presence of a specific actor that one user might particularly like. \n", + "For the problem at hand, we don't have the same easy way to characterize a user or a movie. There are probably relations with genres: if a given user likes romance, they are likely to give higher scores to romance movies. Other factors might be wether the movie is more action-oriented versus heavy on dialogue, or the presence of a specific actor that a user might particularly like. \n", "\n", - "How do we determine numbers to characterize those? The answer is, we don't. We will let our model *learn* them. By analyzing the existing relations between users and movies, let our model figure out itself the features that seem important or not.\n", + "How do we determine numbers to characterize those? The answer is, we don't. We will let our model *learn* them. By analyzing the existing relations between users and movies, our model can figure out itself the features that seem important or not.\n", "\n", - "This is what embeddings are. We will attribute to each of our users and each of our movie a random vector of a certain length (here `n_factors=5`), and we will make those learnable parameters. That means that at each step, when we compute the loss by comparing our predictions to our targets, we will compute the gradients of the loss with respect to those embedding vectors and update them with the rule of SGD (or another optimizer).\n", + "This is what embeddings are. We will attribute to each of our users and each of our movies a random vector of a certain length (here, `n_factors=5`), and we will make those learnable parameters. That means that at each step, when we compute the loss by comparing our predictions to our targets, we will compute the gradients of the loss with respect to those embedding vectors and update them with the rules of SGD (or another optimizer).\n", "\n", - "At the beginning, those numbers don't mean anything since we have chosen them randomly, but by the end of training, they will. By learning on existing data between users and movies, without having any other information, we will see that they still get some important features, and can isolate blockbusters from independent cinema, action movies from romance...\n", + "At the beginning, those numbers don't mean anything since we have chosen them randomly, but by the end of training, they will. By learning on existing data about the relations between users and movies, without having any other information, we will see that they still get some important features, and can isolate blockbusters from independent cinema, action movies from romance, and so on.\n", "\n", "We are now in a position that we can create our whole model from scratch." ] @@ -823,9 +803,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Before we can write a model in PyTorch, we first need to learn the basics of object-oriented programming and Python. If you haven't done any object oriented programming before, we will give you a quick introduction here, but we would recommend looking up a tutorial and doing some practice before moving on.\n", + "Before we can write a model in PyTorch, we first need to learn the basics of object-oriented programming and Python. If you haven't done any object-oriented programming before, we will give you a quick introduction here, but we would recommend looking up a tutorial and getting some practice before moving on.\n", "\n", - "The key idea in object-oriented programming is the *class*. We have been using classes throughout this book, such as DataLoader, string, and Learner. Python makes it easy for us to create new classes. Here is an example of a simple class:" + "The key idea in object-oriented programming is the *class*. We have been using classes throughout this book, such as `DataLoader`, `string`, and `Learner`. Python also makes it easy for us to create new classes. Here is an example of a simple class:" ] }, { @@ -843,7 +823,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The most important piece of this is the special method called `__init__` (pronounced *dunder init*). In Python, any method surrounded in double underscores like this is considered special. It indicates that there is some extra behaviour associated with this method name. In the case of `__init__`, this is the method which Python will call when your new object is created. So, this is where you can set up any state which needs to be done upon object creation. Any parameters included when the user constructs an instance of your class will be passed to the `__init__` method as parameters. Note that the first parameter to any method defined inside a class is `self`, so you can use this to set and get any attributes that you will need." + "The most important piece of this is the special method called `__init__` (pronounced *dunder init*). In Python, any method surrounded in double underscores like this is considered special. It indicates that there is some extra behavior associated with this method name. In the case of `__init__`, this is the method Python will call when your new object is created. So, this is where you can set up any state that needs to be initialized upon object creation. Any parameters included when the user constructs an instance of your class will be passed to the `__init__` method as parameters. Note that the first parameter to any method defined inside a class is `self`, so you can use this to set and get any attributes that you will need:" ] }, { @@ -871,9 +851,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Also note that creating a new PyTorch module requires inheriting from Module. *Inheritance* is an important object-oriented concept which we will not discuss in detail here — in short, it means that we can add additional behaviour to an existing class. PyTorch already provides a Module class, which provides some basic foundations that we want to build on. So, we add the name of this *super class* after the name of the class that we are defining, as you see above.\n", + "Also note that creating a new PyTorch module requires inheriting from `Module`. *Inheritance* is an important object-oriented concept that we will not discuss in detail here—in short, it means that we can add additional behavior to an existing class. PyTorch already provides a `Module` class, which provides some basic foundations that we want to build on. So, we add the name of this *superclass* after the name of the class that we are defining, as shown in the following example.\n", "\n", - "The final thing that you need to know to create a new PyTorch module, is that when your module is called, PyTorch will call a method in your class called `forward`, and will pass along to that any parameters that are included in the call. Here is our dot product model:" + "The final thing that you need to know to create a new PyTorch module is that when your module is called, PyTorch will call a method in your class called `forward`, and will pass along to that any parameters that are included in the call. Here is the class defining our dot product model:" ] }, { @@ -899,7 +879,7 @@ "source": [ "If you haven't seen object-oriented programming before, then don't worry, you won't need to use it much in this book. We are just mentioning this approach here, because most online tutorials and documentation will use the object-oriented syntax.\n", "\n", - "Note that the input of the model is a tensor of shape `batch_size x 2`, where the first columns (`x[:, 0]`) contains the user ids and the second column (`x[:, 1]`) contains the movie ids. As explained before, we use the *embedding* layers to represent our matrices of user and movie latent factors." + "Note that the input of the model is a tensor of shape `batch_size x 2`, where the first column (`x[:, 0]`) contains the user IDs and the second column (`x[:, 1]`) contains the movie IDs. As explained before, we use the *embedding* layers to represent our matrices of user and movie latent factors:" ] }, { @@ -1014,7 +994,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The first thing we can do to make this model a little bit better is to force those predictions between 0 and 5. For this, we just need to use `sigmoid_range`, like in the previous chapter. One thing we discovered empirically is that it's better to have the range go a little bit over 5, so we use `(0, 5.5)`." + "The first thing we can do to make this model a little bit better is to force those predictions to be between 0 and 5. For this, we just need to use `sigmoid_range`, like in <>. One thing we discovered empirically is that it's better to have the range go a little bit over 5, so we use `(0, 5.5)`:" ] }, { @@ -1104,9 +1084,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This is a reasonable start, but we can do better. One obvious missing piece is that some users are just more positive or negative in their recommendations than others, and some movies are just plain better or worse than others. But in our dot product representation we do not have any way to encode either of these things. If all you can say, for instance, about the movie is that it is very sci-fi, very action oriented, and very not old, then you don't really have any way to say most people like it. \n", + "This is a reasonable start, but we can do better. One obvious missing piece is that some users are just more positive or negative in their recommendations than others, and some movies are just plain better or worse than others. But in our dot product representation we do not have any way to encode either of these things. If all you can say about a movie is, for instance, that it is very sci-fi, very action-oriented, and very not old, then you don't really have any way to say whether most people like it. \n", "\n", - "That's because at this point we only have weights; we do not have biases. If we have a single number for each user which we add to our scores, and ditto for each movie, then this will handle this missing piece very nicely. So first of all, let's adjust our model architecture:" + "That's because at this point we only have weights; we do not have biases. If we have a single number for each user that we can add to our scores, and ditto for each movie, that will handle this missing piece very nicely. So first of all, let's adjust our model architecture:" ] }, { @@ -1207,7 +1187,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Instead of being better, it ends up being worse (at least at the end of training). Why is that? If we look at both trainings carefully, we can see the validation loss stopped improving in the middle and started to get worse. As we've seen, this is a clear indication of overfitting. In this case, there is no way to use data augmentation, so we will have to use another regularisation technique. One approach that can be helpful is *weight decay*." + "Instead of being better, it ends up being worse (at least at the end of training). Why is that? If we look at both trainings carefully, we can see the validation loss stopped improving in the middle and started to get worse. As we've seen, this is a clear indication of overfitting. In this case, there is no way to use data augmentation, so we will have to use another regularization technique. One approach that can be helpful is *weight decay*." ] }, { @@ -1221,9 +1201,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Weight decay, or L2 regularization, consists in adding to your loss function the sum of all the weights squared. Why do that? Because when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.\n", + "Weight decay, or *L2 regularization*, consists in adding to your loss function the sum of all the weights squared. Why do that? Because when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.\n", "\n", - "Why would it prevent overfitting? The idea is that the larger the coefficients are, the more sharp canyons we will have in the loss function. If we take the basic example of parabola, `y = a * (x**2)`, the larger `a` is, the more *narrow* the parabola is." + "Why would it prevent overfitting? The idea is that the larger the coefficients are, the sharper canyons we will have in the loss function. If we take the basic example of a parabola, `y = a * (x**2)`, the larger `a` is, the more *narrow* the parabola is (<>)." ] }, { @@ -1248,6 +1228,7 @@ ], "source": [ "#hide_input\n", + "#id parabolas\n", "x = np.linspace(-2,2,100)\n", "a_s = [1,2,5,10,50] \n", "ys = [a * x**2 for a in a_s]\n", @@ -1261,21 +1242,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "So by letting our model learn high parameters, it might fit all the data points in the training set with an over-complex function that has very sharp changes, which will lead to overfitting.\n", + "So, letting our model learn high parameters might cause it to fit all the data points in the training set with an overcomplex function that has very sharp changes, which will lead to overfitting.\n", "\n", - "Limiting our weights from growing to much is going to hinder the training of the model, but it will yield to a state where it generalizes better. Going back to the theory a little bit, weight decay (or just `wd`) is a parameter that controls that sum of squares we add to our loss (assuming `parameters` is a tensor of all parameters):\n", + "Limiting our weights from growing too much is going to hinder the training of the model, but it will yield a state where it generalizes better. Going back to the theory briefly, weight decay (or just `wd`) is a parameter that controls that sum of squares we add to our loss (assuming `parameters` is a tensor of all parameters):\n", "\n", "``` python\n", "loss_with_wd = loss + wd * (parameters**2).sum()\n", "```\n", "\n", - "In practice though, it would be very inefficient (and maybe numerically unstable) to compute that big sum and add it to the loss. If you remember a little bit of high schoool math, you might recall that the derivative of `p**2` with respect to `p` is `2*p`, so adding that big sum to our loss is exactly the same as doing:\n", + "In practice, though, it would be very inefficient (and maybe numerically unstable) to compute that big sum and add it to the loss. If you remember a little bit of high schoool math, you might recall that the derivative of `p**2` with respect to `p` is `2*p`, so adding that big sum to our loss is exactly the same as doing:\n", "\n", "``` python\n", "parameters.grad += wd * 2 * parameters\n", "```\n", "\n", - "In practice, since `wd` is a parameter that we choose, we can just make it twice as big, so we don't even need the `*2` in the above equation. To use weight decay in fastai, just pass `wd` in your call to fit:" + "In practice, since `wd` is a parameter that we choose, we can just make it twice as big, so we don't even need the `*2` in this equation. To use weight decay in fastai, just pass `wd` in your call to `fit` or `fit_one_cycle`:" ] }, { @@ -1361,7 +1342,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "So far, we've used `Embedding` without thinking about how it really works. Let's recreate DotProductBias *without* using this class. We'll need a randomly initialized weight matrix for each of the embeddings. We have to be careful, however. Recall from <> that optimizers require that they can get all the parameters of a module from a module's `parameters()` method. However, this does not happen fully automatically. If we just add a tensor as an attribute to a `Module`, it will not be included in `parameters`:" + "So far, we've used `Embedding` without thinking about how it really works. Let's re-create `DotProductBias` *without* using this class. We'll need a randomly initialized weight matrix for each of the embeddings. We have to be careful, however. Recall from <> that optimizers require that they can get all the parameters of a module from the module's `parameters` method. However, this does not happen fully automatically. If we just add a tensor as an attribute to a `Module`, it will not be included in `parameters`:" ] }, { @@ -1391,7 +1372,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To tell `Module` that we want to treat a tensor as parameters, we have to wrap it in the `nn.Parameter` class. This class doesn't actually add any functionality (other than automatically calling `requires_grad_()` for us). It's only used as a \"marker\" to show what to include in `parameters()`:" + "To tell `Module` that we want to treat a tensor as a parameter, we have to wrap it in the `nn.Parameter` class. This class doesn't actually add any functionality (other than automatically calling `requires_grad_` for us). It's only used as a \"marker\" to show what to include in `parameters`:" ] }, { @@ -1522,7 +1503,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's train it again to check it's around the same results we saw in the previous section:" + "Then let's train it again to check we get around the same results we saw in the previous section:" ] }, { @@ -1594,7 +1575,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now, let's have a look at what our model has learned." + "Now, let's take a look at what our model has learned." ] }, { @@ -1608,7 +1589,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Our model is already useful, in that it can provide us with recommendations for movies for our users — but it is also interesting to see what parameters it has discovered. The easiest to interpret are the biases. Here are the movies with the lowest values in the bias vector:" + "Our model is already useful, in that it can provide us with movie recommendations for our users—but it is also interesting to see what parameters it has discovered. The easiest to interpret are the biases. Here are the movies with the lowest values in the bias vector:" ] }, { @@ -1641,7 +1622,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Have a think about what this means. What this is saying is, that for these movies, even when a user is very well matched to its latent factors (which, as we will see in a moment, tend to represent things like level of action, age of movie, and so forth) they still generally don't like it. We could have simply sorted movies directly by the average rating, but looking at their learned bias tells us something much more interesting. It tells us not just whether a movie is of a kind that people tend not to enjoy watching, but that people type do not like watching it even if it is of a kind that they would otherwise enjoy! By the same token, here are the movies with the highest bias:" + "Think about what this means. What it's saying is that for each of these movies, even when a user is very well matched to its latent factors (which, as we will see in a moment, tend to represent things like level of action, age of movie, and so forth), they still generally don't like it. We could have simply sorted the movies directly by their average rating, but looking at the learned bias tells us something much more interesting. It tells us not just whether a movie is of a kind that people tend not to enjoy watching, but that people tend not to like watching it even if it is of a kind that they would otherwise enjoy! By the same token, here are the movies with the highest bias:" ] }, { @@ -1673,9 +1654,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "So, for instance, even if you don't normally enjoy detective movies, you might enjoy LA Confidential!\n", + "So, for instance, even if you don't normally enjoy detective movies, you might enjoy *LA Confidential*!\n", "\n", - "It is not quite so easy to directly interpret the embedding matrices. There is just too many factors for a human to look at. But there is a technique which can pull out the most important underlying *directions* in such a matrix, called *principal component analysis* (PCA). We will not be going into this in detail in this book, because it is not particularly important for you to understand to be a deep learning practitioner, but if you are interested then we suggest you check out the fast.ai course, Computational Linear Algebra for Coders. <> shows what our movies look like based on two of the strongest PCA components." + "It is not quite so easy to directly interpret the embedding matrices. There are just too many factors for a human to look at. But there is a technique that can pull out the most important underlying *directions* in such a matrix, called *principal component analysis* (PCA). We will not be going into this in detail in this book, because it is not particularly important for you to understand to be a deep learning practitioner, but if you are interested then we suggest you check out the fast.ai course [Computational Linear Algebra for Coders](https://github.com/fastai/numerical-linear-algebra). <> shows what our movies look like based on two of the strongest PCA components." ] }, { @@ -1701,8 +1682,8 @@ "source": [ "#hide_input\n", "#id img_pca_movie\n", - "#caption Representation of movies on two strongest PCA components\n", - "#alt Representation of movies on two strongest PCA components\n", + "#caption Representation of movies based on two strongest PCA components\n", + "#alt Representation of movies based on two strongest PCA components\n", "g = ratings.groupby('title')['rating'].count()\n", "top_movies = g.sort_values(ascending=False).index.values[:1000]\n", "top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])\n", @@ -1731,14 +1712,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> j: no matter how many models I train, I never stop getting moved and surprised by how these randomly initialised bunches of numbers, trained with such simple mechanics, managed to discover things about my data all by themselves. It almost seems like cheating, that I can create code which does useful things, without ever actually telling it how to do those things!" + "> j: No matter how many models I train, I never stop getting moved and surprised by how these randomly initialized bunches of numbers, trained with such simple mechanics, manage to discover things about my data all by themselves. It almost seems like cheating, that I can create code that does useful things without ever actually telling it how to do those things!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We defined our model from scratch to teach you what is inside, but you can directly use the fastai library to build it." + "We defined our model from scratch to teach you what is inside, but you can directly use the fastai library to build it. We'll look at how to do that next." ] }, { @@ -1752,7 +1733,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "fastai can create and train a collaborative filtering model using the exact structure shown above by using `collab_learner`:" + "We can create and train a collaborative filtering model using the exact structure shown earlier by using fastai's `collab_learner`:" ] }, { @@ -1831,7 +1812,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The names of the layers can be seen by printing the model" + "The names of the layers can be seen by printing the model:" ] }, { @@ -1863,7 +1844,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can use these to replicate any of the analyses we did in the previous section, for instance:" + "We can use these to replicate any of the analyses we did in the previous section--for instance:" ] }, { @@ -1896,7 +1877,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "An other interesting thing we can do with these learned embeddings is to look at _distance_." + "Another interesting thing we can do with these learned embeddings is to look at _distance_." ] }, { @@ -1910,7 +1891,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "On a two-dimensional map we can calculate the distance between two coordinates using the formula of Pythagoras: $\\sqrt{x^{2}+y^{2}}$ (assuming that X and Y are the distances between the coordinates on each axis). For a 50 dimensional embedding we can do exactly the same thing, except that we add up the squares of all 50 of the coordinate distances.\n", + "On a two-dimensional map we can calculate the distance between two coordinates using the formula of Pythagoras: $\\sqrt{x^{2}+y^{2}}$ (assuming that *x* and *y* are the distances between the coordinates on each axis). For a 50-dimensional embedding we can do exactly the same thing, except that we add up the squares of all 50 of the coordinate distances.\n", "\n", "If there were two movies that were nearly identical, then their embedding vectors would also have to be nearly identical, because the users that would like them would be nearly exactly the same. There is a more general idea here: movie similarity can be defined by the similarity of users that like those movies. And that directly means that the distance between two movies' embedding vectors can define that similarity. We can use this to find the most similar movie to *Silence of the Lambs*:" ] @@ -1943,7 +1924,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now that we have succesfully trained a model, let's see how to deal when we have no data for a new user, to be able to make recommendations to them." + "Now that we have succesfully trained a model, let's see how to deal with the situation where we have no data for a user. How can we make recommendations to new users?" ] }, { @@ -1957,29 +1938,29 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The biggest challenge with using collaborative filtering models in practice is the *bootstrapping problem*. The most extreme version of this problem is when you have no users, and therefore no history to learn from. What product do you recommend to your very first user?\n", + "The biggest challenge with using collaborative filtering models in practice is the *bootstrapping problem*. The most extreme version of this problem is when you have no users, and therefore no history to learn from. What products do you recommend to your very first user?\n", "\n", - "But even if you are a well-established company with a long history of user transactions, you still have the question: what do you do when a new user signs up? And indeed, what do you do when you add a new product to your portfolio? There is no magic solution to this problem, and really the solutions that we suggest are just variations of the form *use your common sense*. You can start your new users such that they have the mean of all of the embedding vectors of your other users — although this has the problem that that particular combination of latent factors may be not at all common (for instance the average for the science-fiction factor may be high, and the average for the action factor may be low, but it is not that common to find people who like science-fiction without action). Better would probably be to pick some particular user to represent *average taste*.\n", + "But even if you are a well-established company with a long history of user transactions, you still have the question: what do you do when a new user signs up? And indeed, what do you do when you add a new product to your portfolio? There is no magic solution to this problem, and really the solutions that we suggest are just variations of *use your common sense*. You could assign new users the mean of all of the embedding vectors of your other users, but this has the problem that that particular combination of latent factors may be not at all common (for instance, the average for the science-fiction factor may be high, and the average for the action factor may be low, but it is not that common to find people who like science-fiction without action). Better would probably be to pick some particular user to represent *average taste*.\n", "\n", - "Better still is to use a tabular model based on user meta data to construct your initial embedding vector. When a user signs up, think about what questions you could ask them which could help you to understand their tastes. Then you can create a model where the dependent variable is a user's embedding vector, and the independent variables are the results of the questions that you ask them, along with their signup meta data. We will learn in the next section how to create these kinds of tabular models. You may have noticed that when you sign up for services such as Pandora and Netflix, they tend to ask you a few questions about what genres of movie or music you like; this is how they come up with your initial collaborative filtering recommendations." + "Better still is to use a tabular model based on user meta data to construct your initial embedding vector. When a user signs up, think about what questions you could ask them that could help you to understand their tastes. Then you can create a model where the dependent variable is a user's embedding vector, and the independent variables are the results of the questions that you ask them, along with their signup metadata. We will see in the next section how to create these kinds of tabular models. (You may have noticed that when you sign up for services such as Pandora and Netflix, they tend to ask you a few questions about what genres of movie or music you like; this is how they come up with your initial collaborative filtering recommendations.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "One thing to be careful of is that a small number of extremely enthusiastic users may end up effectively setting the recommendations for your whole user base. This is a very common problem, for instance, in movie recommendation systems. People that watch anime tend to watch a whole lot of it, and don't watch very much else, and spend a lot of time putting their ratings into websites. As a result, a lot of *best ever movies* lists tend to be heavily overrepresented with anime. In this particular case, it can be fairly obvious that you have a problem of representation bias, but if the bias is occurring in the latent factors then it may not be obvious at all.\n", + "One thing to be careful of is that a small number of extremely enthusiastic users may end up effectively setting the recommendations for your whole user base. This is a very common problem, for instance, in movie recommendation systems. People that watch anime tend to watch a whole lot of it, and don't watch very much else, and spend a lot of time putting their ratings on websites. As a result, anime tends to be heavily overrepresented in a lot of *best ever movies* lists. In this particular case, it can be fairly obvious that you have a problem of representation bias, but if the bias is occurring in the latent factors then it may not be obvious at all.\n", "\n", - "Such a problem can change the entire make up of your user base, and the behaviour of your system. This is particularly true because of positive feedback loops. If a small number of your users tend to set the direction of your recommendation system, then they are naturally going to end up attracting more people like them to your system. And that will, of course, amplify the original representation bias. This is a natural tendency to be amplified exponentially. You may have seen examples of company executives expressing surprise at how their online platforms rapidly deteriorate in such a way that they express values that are at odds with the values of the founders. In the presence of these kinds of feedback loops, it is easy to see how such a divergence can happen both quickly, and in a way that is hidden until it is too late.\n", + "Such a problem can change the entire makeup of your user base, and the behavior of your system. This is particularly true because of positive feedback loops. If a small number of your users tend to set the direction of your recommendation system, then they are naturally going to end up attracting more people like them to your system. And that will, of course, amplify the original representation bias. This type of bias has a natural tendency to be amplified exponentially. You may have seen examples of company executives expressing surprise at how their online platforms rapidly deteriorated in such a way that they expressed values at odds with the values of the founders. In the presence of these kinds of feedback loops, it is easy to see how such a divergence can happen both quickly and in a way that is hidden until it is too late.\n", "\n", - "In a self-reinforcing system like this, we should probably expect these kinds of feedback loops to be the norm, not the exception. Therefore, you should assume that you will see them, plan for that, and identify upfront how you will deal with these issues. Try to think about all of the ways in which feedback loops may be represented in your system, and how you might be able to identify them in your data. In the end, this is coming back to our original advice about how to avoid disaster when rolling out any kind of machine learning system. It's all about ensuring that there are humans in the loop, that there is careful monitoring, and gradual and thoughtful rollout." + "In a self-reinforcing system like this, we should probably expect these kinds of feedback loops to be the norm, not the exception. Therefore, you should assume that you will see them, plan for that, and identify up front how you will deal with these issues. Try to think about all of the ways in which feedback loops may be represented in your system, and how you might be able to identify them in your data. In the end, this is coming back to our original advice about how to avoid disaster when rolling out any kind of machine learning system. It's all about ensuring that there are humans in the loop; that there is careful monitoring, and a gradual and thoughtful rollout." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Our dot product model works quite well, and it is the basis of many successful real-world recommendation systems. This approach to collaborative filtering is known as *probabilistic matrix factorisation* (PMF). Another approach, which generally works similarly well given the same data, is deep learning." + "Our dot product model works quite well, and it is the basis of many successful real-world recommendation systems. This approach to collaborative filtering is known as *probabilistic matrix factorization* (PMF). Another approach, which generally works similarly well given the same data, is deep learning." ] }, { @@ -1993,9 +1974,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To turn our architecture into a deep learning model the first step is to take the results of the embedding look up, and concatenating those activations together. This gives us a matrix which we can then pass through linear layers and nonlinearities in the usual way.\n", + "To turn our architecture into a deep learning model, the first step is to take the results of the embedding lookup and concatenate those activations together. This gives us a matrix which we can then pass through linear layers and nonlinearities in the usual way.\n", "\n", - "Since we'll be concatenating the embedding matrices, rather than taking their dot product, that means that the two embedding matrices can have different sizes (i.e. different numbers of latent factors). fastai has a function `get_emb_sz` that returns recommended sizes for embedding matrices for your data, based on a heuristic that fast.ai has found tends to work well in practice:" + "Since we'll be concatenating the embedding matrices, rather than taking their dot product, the two embedding matrices can have different sizes (i.e., different numbers of latent factors). fastai has a function `get_emb_sz` that returns recommended sizes for embedding matrices for your data, based on a heuristic that fast.ai has found tends to work well in practice:" ] }, { @@ -2052,7 +2033,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "...and use it to create a model:" + "And use it to create a model:" ] }, { @@ -2068,7 +2049,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "`CollabNN` creates our `Embedding` layers in the same way as previous classes in this chapter, except that we now use the `embs` sizes. Then `self.layers` is identical to the mini neural net we created in <> for MNIST. Then, in `forward`, we apply the embeddings, concatenate the results, and pass it through the mini neural net. Finally, we apply `sigmoid_range` as we have in previous models.\n", + "`CollabNN` creates our `Embedding` layers in the same way as previous classes in this chapter, except that we now use the `embs` sizes. `self.layers` is identical to the mini-neural net we created in <> for MNIST. Then, in `forward`, we apply the embeddings, concatenate the results, and pass this through the mini-neural net. Finally, we apply `sigmoid_range` as we have in previous models.\n", "\n", "Let's see if it trains:" ] @@ -2141,7 +2122,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Fastai provides this model in fastai.collab, if you pass `use_nn=True` in your call to `collab_learner` (including calling `get_emb_sz` for you), plus lets you easily create more layers. For instance, here we're creating two hidden layers, of size 100 and 50, respectively:" + "Fastai provides this model in `fastai.collab` if you pass `use_nn=True` in your call to `collab_learner` (including calling `get_emb_sz` for you), and it lets you easily create more layers. For instance, here we're creating two hidden layers, of size 100 and 50, respectively:" ] }, { @@ -2231,25 +2212,25 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Wow that's not a lot of code! This class *inherits* from `TabularModel`, which is where it gets all its functionality from. In `__init__` is calls the same method in `TabularModel`, passing `n_cont=0` and `out_sz=1`; other than that, it only passes along whatever arguments it received." + "Wow, that's not a lot of code! This class *inherits* from `TabularModel`, which is where it gets all its functionality from. In `__init__` it calls the same method in `TabularModel`, passing `n_cont=0` and `out_sz=1`; other than that, it only passes along whatever arguments it received." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Sidebar: Kwargs and Delegates" + "### Sidebar: kwargs and Delegates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "`EmbeddingNN` includes `**kwargs` as a parameter to `__init__`. In python `**kwargs` in a parameter like means \"put any additional keyword arguments into a dict called `kwarg`. And `**kwargs` in an argument list means \"insert all key/value pairs in the `kwargs` dict as named arguments here\". This approach is used in many popular libraries, such as `matplotlib`, in which the main `plot` function simply has the signature `plot(*args, **kwargs)`. The [plot documentation](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot) says \"*The `kwargs` are Line2D properties*\" and then lists those properties.\n", + "`EmbeddingNN` includes `**kwargs` as a parameter to `__init__`. In Python `**kwargs` in a parameter list means \"put any additional keyword arguments into a dict called `kwargs`. And `**kwargs` in an argument list means \"insert all key/value pairs in the `kwargs` dict as named arguments here\". This approach is used in many popular libraries, such as `matplotlib`, in which the main `plot` function simply has the signature `plot(*args, **kwargs)`. The [`plot` documentation](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot) says \"The `kwargs` are `Line2D` properties\" and then lists those properties.\n", "\n", - "We're using `**kwargs` in `EmbeddingNN` to avoid having to write all the arguments to `TabularModel` a second time, and keep them in sync. However, this makes our API quite difficult to work with, because now Jupyter Notebook doesn't know what parameters are available, so things like tab-completion of parameter names and popup lists of signatures won't work.\n", + "We're using `**kwargs` in `EmbeddingNN` to avoid having to write all the arguments to `TabularModel` a second time, and keep them in sync. However, this makes our API quite difficult to work with, because now Jupyter Notebook doesn't know what parameters are available. Consequently things like tab completion of parameter names and pop-up lists of signatures won't work.\n", "\n", - "Fastai resolves this by providing a special `@delegates` decorator, which automatically changes the signature of the class or function (`EmbeddingNN` in this case) to insert all of its keyword arguments into the signature" + "Fastai resolves this by providing a special `@delegates` decorator, which automatically changes the signature of the class or function (`EmbeddingNN` in this case) to insert all of its keyword arguments into the signature." ] }, { @@ -2263,7 +2244,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Although the results of `EmbeddingNN` are a bit worse than the dot product approach (which shows the power of carefully using an architecture for a domain), it does allow us to do something very important: we can now directly incorporate other user and movie information, time, and other information that may be relevant to the recommendation. That's exactly what `TabularModel` does. In fact, we've now seen that `EmbeddingNN` is just a `TabularModel`, with `n_cont=0` and `out_sz=1`. So we better spend some time learning about `TabularModel`, and how to use it to get great results!" + "Although the results of `EmbeddingNN` are a bit worse than the dot product approach (which shows the power of carefully constructing an architecture for a domain), it does allow us to do something very important: we can now directly incorporate other user and movie information, date and time information, or any other information that may be relevant to the recommendation. That's exactly what `TabularModel` does. In fact, we've now seen that `EmbeddingNN` is just a `TabularModel`, with `n_cont=0` and `out_sz=1`. So, we'd better spend some time learning about `TabularModel`, and how to use it to get great results! We'll do that in the next chapter." ] }, { @@ -2277,7 +2258,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For our first non computer vision application, we looked at recommendation systems and saw how gradient descent can learn intrinsic factors or bias about items from a history of ratings. Those can then give us information about the data. \n", + "For our first non-computer vision application, we looked at recommendation systems and saw how gradient descent can learn intrinsic factors or biases about items from a history of ratings. Those can then give us information about the data. \n", "\n", "We also built our first model in PyTorch. We will do a lot more of this in the next section of the book, but first, let's finish our dive into the other general applications of deep learning, continuing with tabular data." ] @@ -2297,33 +2278,33 @@ "1. How does it solve it?\n", "1. Why might a collaborative filtering predictive model fail to be a very useful recommendation system?\n", "1. What does a crosstab representation of collaborative filtering data look like?\n", - "1. Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!)\n", + "1. Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!).\n", "1. What is a latent factor? Why is it \"latent\"?\n", - "1. What is a dot product? Calculate a dot product manually using pure python with lists.\n", + "1. What is a dot product? Calculate a dot product manually using pure Python with lists.\n", "1. What does `pandas.DataFrame.merge` do?\n", "1. What is an embedding matrix?\n", - "1. What is the relationship between an embedding and a matrix of one-hot encoded vectors?\n", - "1. Why do we need `Embedding` if we could use one-hot encoded vectors for the same thing?\n", - "1. What does an embedding contain before we start training (assuming we're not using a prertained model)?\n", + "1. What is the relationship between an embedding and a matrix of one-hot-encoded vectors?\n", + "1. Why do we need `Embedding` if we could use one-hot-encoded vectors for the same thing?\n", + "1. What does an embedding contain before we start training (assuming we're not using a pretained model)?\n", "1. Create a class (without peeking, if possible!) and use it.\n", "1. What does `x[:,0]` return?\n", - "1. Rewrite the `DotProduct` class (without peeking, if possible!) and train a model with it\n", + "1. Rewrite the `DotProduct` class (without peeking, if possible!) and train a model with it.\n", "1. What is a good loss function to use for MovieLens? Why? \n", - "1. What would happen if we used `CrossEntropy` loss with MovieLens? How would we need to change the model?\n", + "1. What would happen if we used cross-entropy loss with MovieLens? How would we need to change the model?\n", "1. What is the use of bias in a dot product model?\n", "1. What is another name for weight decay?\n", - "1. Write the equation for weight decay (without peeking!)\n", + "1. Write the equation for weight decay (without peeking!).\n", "1. Write the equation for the gradient of weight decay. Why does it help reduce weights?\n", "1. Why does reducing weights lead to better generalization?\n", "1. What does `argsort` do in PyTorch?\n", - "1. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why / why not?\n", + "1. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why/why not?\n", "1. How do you print the names and details of the layers in a model?\n", "1. What is the \"bootstrapping problem\" in collaborative filtering?\n", "1. How could you deal with the bootstrapping problem for new users? For new movies?\n", "1. How can feedback loops impact collaborative filtering systems?\n", - "1. When using a neural network in collaborative filtering, why can we have different number of factors for movie and user?\n", - "1. Why is there a `nn.Sequential` in the `CollabNN` model?\n", - "1. What kind of model should be use if we want to add metadata about users and items, or information such as date and time, to a collaborative filter model?" + "1. When using a neural network in collaborative filtering, why can we have different numbers of factors for movies and users?\n", + "1. Why is there an `nn.Sequential` in the `CollabNN` model?\n", + "1. What kind of model should we use if we want to add metadata about users and items, or information such as date and time, to a collaborative filtering model?" ] }, { @@ -2332,10 +2313,10 @@ "source": [ "### Further Research\n", "\n", - "1. Take a look at all the differences between the `Embedding` version of `DotProductBias` and the `create_params` version, and try to understand why each of those changes is required. If you're not sure, try reverting each change, to see what happens. (NB: even the type of brackets used in `forward` has changed!)\n", - "1. Find three other areas where collaborative filtering is being used, and find out what pros and cons of this approach in those areas.\n", - "1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book website and forum for ideas. Note that there are more columns in the full dataset--see if you can use those too (the next chapter might give you ideas)\n", - "1. Create a model for MovieLens with works with CrossEntropy loss, and compare it to the model in this chapter." + "1. Take a look at all the differences between the `Embedding` version of `DotProductBias` and the `create_params` version, and try to understand why each of those changes is required. If you're not sure, try reverting each change to see what happens. (NB: even the type of brackets used in `forward` has changed!)\n", + "1. Find three other areas where collaborative filtering is being used, and find out what the pros and cons of this approach are in those areas.\n", + "1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book's website and the fast.ai forum for ideas. Note that there are more columns in the full dataset--see if you can use those too (the next chapter might give you ideas).\n", + "1. Create a model for MovieLens that works with cross-entropy loss, and compare it to the model in this chapter." ] }, { diff --git a/09_tabular.ipynb b/09_tabular.ipynb index 994aea5..8c84969 100644 --- a/09_tabular.ipynb +++ b/09_tabular.ipynb @@ -48,9 +48,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Tabular modelling takes data in the form of a table (like a spreadsheet or CSV--comma separated values). The objective is to predict the value in one column, based on the values in the other columns. In this chapter we will not only look at deep learning but also more general machine learning techniques like random forests, as they can give better results depending on your problem.\n", + "Tabular modeling takes data in the form of a table (like a spreadsheet or CSV). The objective is to predict the value in one column based on the values in the other columns. In this chapter we will not only look at deep learning but also more general machine learning techniques like random forests, as they can give better results depending on your problem.\n", "\n", - "We will look at how we should preprocess and clean the data, how to interpret the result of our models after training, but first, we will see how we can feed columns that contain categories into a model that espects numbers by using embeddings." + "We will look at how we should preprocess and clean the data as well as how to interpret the result of our models after training, but first, we will see how we can feed columns that contain categories into a model that expects numbers by using embeddings." ] }, { @@ -64,30 +64,30 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In tabular data some columns contain numerical data, like \"age\", others contain string values, like \"sex\". The numerical data can be directly fed to the model (with some optional preprocessing), but other columns need to be converted to numbers. Since the values in those correspond to different categories, we often call these type of variables *categorical variables*. The first type are called *continuous variables*." + "In tabular data some columns may contain numerical data, like \"age,\" while others contain string values, like \"sex.\" The numerical data can be directly fed to the model (with some optional preprocessing), but the other columns need to be converted to numbers. Since the values in those correspond to different categories, we often call this type of variables *categorical variables*. The first type are called *continuous variables*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "> jargon: Continuous and categorical variables: \"Continuous variables\" are numerical data, such as \"age\" can be directly fed to the model, since you can add and multiply them directly. \"Categorical variables\" contain a number of discrete levels, such as \"movie id\", for which addition and multiplication don't have meaning (even if they're stored as numbers)." + "> jargon: Continuous and Categorical Variables: Continuous variables are numerical data, such as \"age,\" that can be directly fed to the model, since you can add and multiply them directly. Categorical variables contain a number of discrete levels, such as \"movie ID,\" for which addition and multiplication don't have meaning (even if they're stored as numbers)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "At the end of 2015, the [Rossmann sales competition](https://www.kaggle.com/c/rossmann-store-sales) ran on Kaggle. Competitors were given a wide range of information about various stores in Germany, and were tasked with trying to predict sales on a number of days. The goal was to help them to manage stock properly and to be able to properly satisfy the demand without holding unnecessary inventory. The official training set provided a lot of information about the stores. It was also permitted for competitors to use additional data, as long as that data was made public and available to all participants.\n", + "At the end of 2015, the [Rossmann sales competition](https://www.kaggle.com/c/rossmann-store-sales) ran on Kaggle. Competitors were given a wide range of information about various stores in Germany, and were tasked with trying to predict sales on a number of days. The goal was to help the company to manage stock properly and be able to satisfy demand without holding unnecessary inventory. The official training set provided a lot of information about the stores. It was also permitted for competitors to use additional data, as long as that data was made public and available to all participants.\n", "\n", - "One of the gold medalists used deep learning, in one of the earliest known examples of a state of the art deep learning tabular model. Their method involved far less feature engineering, based on domain knowledge, than the other gold medalists. They wrote a paper, [Entity Embeddings of Categorical Variables](https://arxiv.org/abs/1604.06737), about their approach. In an online-only chapter on the book website we show how to replicate their approach from scratch and attain the same accuracy shown in the paper. In the abstract of the paper they say:" + "One of the gold medalists used deep learning, in one of the earliest known examples of a state-of-the-art deep learning tabular model. Their method involved far less feature engineering, based on domain knowledge, than those of the other gold medalists. The paper, [\"Entity Embeddings of Categorical Variables\"](https://arxiv.org/abs/1604.06737) describes their approach. In an online-only chapter on the [book's website](https://book.fast.ai/) we show how to replicate it from scratch and attain the same accuracy shown in the paper. In the abstract of the paper the authors (Cheng Guo and Felix Bekhahn) say:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "> : Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables... it is especially useful for datasets with lots of high cardinality features, where other methods tend to overfit... As entity embedding defines a distance measure for categorical variables it can be used for visualizing categorical data and for data clustering" + "> : Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables... [It] is especially useful for datasets with lots of high cardinality features, where other methods tend to overfit... As entity embedding defines a distance measure for categorical variables it can be used for visualizing categorical data and for data clustering." ] }, { @@ -96,14 +96,14 @@ "source": [ "We have already noticed all of these points when we built our collaborative filtering model. We can clearly see that these insights go far beyond just collaborative filtering, however.\n", "\n", - "The paper also points out that (as we discussed in the last chapter) that an embedding layer is exactly equivalent to placing an ordinary linear layer after every one-hot encoded input layer. They used the diagram in <> to show this equivalence. Note that \"dense layer\" is another term with the same meaning as \"linear layer\", the one-hot encoding layers represent inputs." + "The paper also points out that (as we discussed in the last chapter) an embedding layer is exactly equivalent to placing an ordinary linear layer after every one-hot-encoded input layer. The authors used the diagram in <> to show this equivalence. Note that \"dense layer\" is a term with the same meaning as \"linear layer,\" and the one-hot encoding layers represent inputs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\"Entity" + "\"Entity" ] }, { @@ -112,59 +112,59 @@ "source": [ "The insight is important because we already know how to train linear layers, so this shows that from the point of view of the architecture and our training algorithm the embedding layer is just another layer. We also saw this in practice in the last chapter, when we built a collaborative filtering neural network that looks exactly like this diagram.\n", "\n", - "Where we analyzed the embedding weights for movie reviews, the authors of the entity embeddings paper analyzed the embedding weights for their sales prediction model. What they found was quite amazing, and illustrates their second key insight. This is that the embedding makes the categorical variables into something which is both continuous and also meaningful.\n", + "Where we analyzed the embedding weights for movie reviews, the authors of the entity embeddings paper analyzed the embedding weights for their sales prediction model. What they found was quite amazing, and illustrates their second key insight. This is that the embedding transforms the categorical variables into inputs that are both continuous and meaningful.\n", "\n", - "The images in <> below illustrate these ideas. They are based on the approaches used in the paper, along with some analysis we have added." + "The images in <> illustrate these ideas. They are based on the approaches used in the paper, along with some analysis we have added." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\"State" + "\"State" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "On the left in the image below is a plot of the embedding matrix for the possible values of the `State` category. For a categorical variable we call the possible values of the variable its \"levels\" (or \"categories\" or \"classes\"), so here one level is \"Berlin,\" another is \"Hamburg,\" etc.. On the right is a map of Germany. The actual physical locations of the German states were not part of the provided data; yet, the model itself learned where they must be, based only on the behavior of store sales!\n", + "On the left is a plot of the embedding matrix for the possible values of the `State` category. For a categorical variable we call the possible values of the variable its \"levels\" (or \"categories\" or \"classes\"), so here one level is \"Berlin,\" another is \"Hamburg,\" etc. On the right is a map of Germany. The actual physical locations of the German states were not part of the provided data, yet the model itself learned where they must be, based only on the behavior of store sales!\n", "\n", - "Do you remember how we talked about *distance* between embeddings? The authors of the paper plotted the distance between embeddings between stores against the actual geographic distance between the stores in practice (see <>). They found that they matched very closely!" + "Do you remember how we talked about *distance* between embeddings? The authors of the paper plotted the distance between store embeddings against the actual geographic distance between the stores (see <>). They found that they matched very closely!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\"Store" + "\"Store" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We've even tried plotted the embeddings for days of the week and months of the year, and found that days and months that are near each other on the calendar ended up close as embeddings too, as shown in <>." + "We've even tried plotting the embeddings for days of the week and months of the year, and found that days and months that are near each other on the calendar ended up close as embeddings too, as shown in <>." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\"Date" + "\"Date" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "What stands out in these two examples is that we provide the model fundamentally categorical data about discrete entities (German states or days of the week), and then the model learns an embedding for these entities which defines a continuous notion of distance between them. Because the embedding distance was learned based on real patterns in the data, that distance tends to match up with our intuitions.\n", + "What stands out in these two examples is that we provide the model fundamentally categorical data about discrete entities (e.g., German states or days of the week), and then the model learns an embedding for these entities that defines a continuous notion of distance between them. Because the embedding distance was learned based on real patterns in the data, that distance tends to match up with our intuitions.\n", "\n", - "In addition, it is also valuable in its own right that embeddings are continuous. It is valuable because models are better at understanding continuous variables. This is unsurprising considering models are built of many continuous parameter weights and continuous activation values, which are updated via gradient descent, a learning algorithm for finding the minimums of continuous functions.\n", + "In addition, it is valuable in its own right that embeddings are continuous, because models are better at understanding continuous variables. This is unsurprising considering models are built of many continuous parameter weights and continuous activation values, which are updated via gradient descent (a learning algorithm for finding the minimums of continuous functions).\n", "\n", - "It is also valuable because we can combine our continuous embedding values with truly continuous input data in a straightforward manner: we just concatenate the variables, and feed the concatenation into our first dense layer. In other words, the raw categorical data is transformed by an embedding layer, before it interacts with the raw continuous input data. This is how fastai, and the entity embeddings paper, handle tabular models containing continuous and categorical variables.\n", + "Another benefit is that we can combine our continuous embedding values with truly continuous input data in a straightforward manner: we just concatenate the variables, and feed the concatenation into our first dense layer. In other words, the raw categorical data is transformed by an embedding layer before it interacts with the raw continuous input data. This is how fastai and Guo and Berkham handle tabular models containing continuous and categorical variables.\n", "\n", - "An example using this concatenation approach is how Google do their recommendations on Google Play, as they explained in their paper [Wide & Deep Learning for Recommender Systems](https://arxiv.org/abs/1606.07792), and as shown in this figure from their paper:" + "An example using this concatenation approach is how Google does it recommendations on Google Play, as explained in the paper [\"Wide & Deep Learning for Recommender Systems\"](https://arxiv.org/abs/1606.07792). <> illustrates." ] }, { @@ -178,9 +178,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Interestingly, Google are actually combining both the two approaches we saw in the previous chapter: the *dot product* (which Google call *Cross Product*) and neural network approach.\n", + "Interestingly, the Google team actually combined both approaches we saw in the previous chapter: the dot product (which they call *cross product*) and neural network approaches.\n", "\n", - "But let's pause for a moment. So far, the solution to all of our modelling problems has been: *train a deep learning model*. And indeed, that is a pretty good rule of thumb for complex unstructured data like images, sounds, natural language text, and so forth. Deep learning also works very well for collaborative filtering. But it is not always the best starting point for analysing tabular data." + "Let's pause for a moment. So far, the solution to all of our modeling problems has been: *train a deep learning model*. And indeed, that is a pretty good rule of thumb for complex unstructured data like images, sounds, natural language text, and so forth. Deep learning also works very well for collaborative filtering. But it is not always the best starting point for analyzing tabular data." ] }, { @@ -198,39 +198,39 @@ "\n", "The good news is that modern machine learning can be distilled down to a couple of key techniques that are widely applicable. Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:\n", "\n", - "1. Ensembles of decision trees (i.e. Random Forests and Gradient Boosting Machines), mainly for structured data (such as you might find in a database table at most companies)\n", - "1. Multi-layered neural networks learnt with SGD (i.e. shallow and/or deep learning), mainly for unstructured data (such as audio, vision, and natural language)" + "1. Ensembles of decision trees (i.e., random forests and gradient boosting machines), mainly for structured data (such as you might find in a database table at most companies)\n", + "1. Multilayered neural networks learned with SGD (i.e., shallow and/or deep learning), mainly for unstructured data (such as audio, images, and natural language)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Although deep learning is nearly always clearly superior for unstructured data, these two approaches tend to give quite similar results for many kinds of structured data. But ensembles of decision trees tend to train faster, are often easier to interpret, do not require special GPU hardware for inference at scale, and often require less hyperparameter tuning. They have been popular for quite a lot longer than deep learning, so there is a more mature ecosystem for tooling and documentation around them.\n", + "Although deep learning is nearly always clearly superior for unstructured data, these two approaches tend to give quite similar results for many kinds of structured data. But ensembles of decision trees tend to train faster, are often easier to interpret, do not require special GPU hardware for inference at scale, and often require less hyperparameter tuning. They have also been popular for quite a lot longer than deep learning, so there is a more mature ecosystem of tooling and documentation around them.\n", "\n", - "Most importantly, the critical step of interpreting a model of tabular data is significantly easier for decision tree ensembles. There are tools and methods for answering the pertinent questions. For instance, which columns in the dataset were the most important for your predictions? How are they related to the dependent variable? How do they interact with each other? And which particular features were most important for some particular observation?\n", + "Most importantly, the critical step of interpreting a model of tabular data is significantly easier for decision tree ensembles. There are tools and methods for answering the pertinent questions, like: Which columns in the dataset were the most important for your predictions? How are they related to the dependent variable? How do they interact with each other? And which particular features were most important for some particular observation?\n", "\n", - "Therefore, ensembles of decision trees are our first approach for analysing a new tabular dataset.\n", + "Therefore, ensembles of decision trees are our first approach for analyzing a new tabular dataset.\n", "\n", "The exception to this guideline is when the dataset meets one of these conditions:\n", "\n", - "- There are some high cardinality categorical variables that are very important (\"cardinality\" refers to the number of discrete levels representing categories, so a high cardinality categorical variable is something like a ZIP Code, which can take on thousands of possible levels)\n", - "- There are some columns which contain data which would be best understood with a neural network, such as plaintext data.\n", + "- There are some high-cardinality categorical variables that are very important (\"cardinality\" refers to the number of discrete levels representing categories, so a high-cardinality categorical variable is something like a zip code, which can take on thousands of possible levels).\n", + "- There are some columns that contain data that would be best understood with a neural network, such as plain text data.\n", "\n", - "In practice, when we deal with datasets which meet these exceptional conditions, we would always try both decision tree ensembles and deep learning to see which works best. It is likely that deep learning would be a useful approach in our example of collaborative filtering, as you have at least two high cardinality categorical variables: the users and the movies. But in practice things tend to be less cut and dried, and there will often be a mixture of high and low cardinality categorical variables and continuous variables.\n", + "In practice, when we deal with datasets that meet these exceptional conditions, we always try both decision tree ensembles and deep learning to see which works best. It is likely that deep learning will be a useful approach in our example of collaborative filtering, as we have at least two high-cardinality categorical variables: the users and the movies. But in practice things tend to be less cut-and-dried, and there will often be a mixture of high- and low-cardinality categorical variables and continuous variables.\n", "\n", - "Either way, it's clear that we are going to need to add decision tree ensembles to our modelling toolbox!" + "Either way, it's clear that we are going to need to add decision tree ensembles to our modeling toolbox!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Up to now we've used PyTorch and fastai for pretty much all of our heavy lifting. But these libraries are mainly designed for algorithms that do lots of matrix multiplication and derivatives (that is, stuff like deep learning!) Decision trees don't depend on these operations at all, so PyTorch isn't much use.\n", + "Up to now we've used PyTorch and fastai for pretty much all of our heavy lifting. But these libraries are mainly designed for algorithms that do lots of matrix multiplication and derivatives (that is, stuff like deep learning!). Decision trees don't depend on these operations at all, so PyTorch isn't much use.\n", "\n", - "Instead, we will be largely relying on a library called scikit-learn (also known as *sklearn*). Scikit-learn is a popular library for creating machine learning models, using approaches that are not covered by deep learning. In addition, we'll need to do some tabular data processing and querying, so we'll want to use the Pandas library. Finally, we'll also need numpy, since that's the main numeric programming library that both sklearn and Pandas rely on.\n", + "Instead, we will be largely relying on a library called scikit-learn (also known as `sklearn`). Scikit-learn is a popular library for creating machine learning models, using approaches that are not covered by deep learning. In addition, we'll need to do some tabular data processing and querying, so we'll want to use the Pandas library. Finally, we'll also need NumPy, since that's the main numeric programming library that both sklearn and Pandas rely on.\n", "\n", - "We don't have time to do a deep dive on all these libraries in this book, so we'll just be touching on some of the main parts of each. For a far more in depth discussion, we strongly suggest Wes McKinney's [Python for Data Analysis, 2nd ed](https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662/ref=asap_bc?ie=UTF8). Wes is the creator of Pandas, so you can be sure that the information is accurate!\n", + "We don't have time to do a deep dive into all these libraries in this book, so we'll just be touching on some of the main parts of each. For a far more in depth discussion, we strongly suggest Wes McKinney's [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) (O'Reilly). Wes is the creator of Pandas, so you can be sure that the information is accurate!\n", "\n", "First, let's gather the data we will use." ] @@ -246,9 +246,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For our dataset, we will be looking at the Blue Book for Bulldozers Kaggle Competition: \"The goal of the contest is to predict the sale price of a particular piece of heavy equipment at auction based on its usage, equipment type, and configuration. The data is sourced from auction result postings and includes information on usage and equipment configurations.\"\n", + "The dataser we use in this chapter is from the Blue Book for Bulldozers Kaggle competition, which has the following description: \"The goal of the contest is to predict the sale price of a particular piece of heavy equipment at auction based on its usage, equipment type, and configuration. The data is sourced from auction result postings and includes information on usage and equipment configurations.\"\n", "\n", - "This is a very common type of dataset and prediction problem, and similar to what you may see in your project or workplace. It's available for download on Kaggle, a website that hosts data science competitions." + "This is a very common type of dataset and prediction problem, similar to what you may see in your project or workplace. The dataset is available for download on Kaggle, a website that hosts data science competitions." ] }, { @@ -266,18 +266,18 @@ "\n", "Kaggle provides:\n", "\n", - "1. Interesting datasets\n", - "2. Feedback on how you're doing\n", - "3. A leader board to see what's good, what's possible, and what's state-of-art.\n", - "4. Blog posts by winning contestants sharing useful tips and techniques.\n", + "- Interesting datasets\n", + "- Feedback on how you're doing\n", + "- A leaderboard to see what's good, what's possible, and what's state-of-the-art\n", + "- Blog posts by winning contestants sharing useful tips and techniques\n", "\n", - "Until now all our datasets have been available to download through fastai's integrated dataset system. However, the dataset we will be using in this chapter is only available from Kaggle. Therefore, you will need to sign up to Kaggle, then you need to go to the [page for the competition](https://www.kaggle.com/c/bluebook-for-bulldozers). On that page click on \"rules\", and then \"I understand and accept\". (Although the competition has finished, and you will not be entering it, you still have to agree to the rules to be allowed to download the data).\n", + "Until now all our datasets have been available to download through fastai's integrated dataset system. However, the dataset we will be using in this chapter is only available from Kaggle. Therefore, you will need to register on the site, then go to the [page for the competition](https://www.kaggle.com/c/bluebook-for-bulldozers). On that page click \"Rules,\" then \"I Understand and Accept.\" (Although the competition has finished, and you will not be entering it, you still have to agree to the rules to be allowed to download the data.)\n", "\n", - "The easiest way to download Kaggle datasets is to use the Kaggle API. You can install this using pip by running this in a notebook cell:\n", + "The easiest way to download Kaggle datasets is to use the Kaggle API. You can install this using `pip` by running this in a notebook cell:\n", "\n", " !pip install kaggle\n", "\n", - "You need an API key to use the Kaggle API; to get one, go to \"my account\" on the Kaggle website, and click \"create new API token\". This will save a file called `kaggle.json` to your PC. We need to create this on your GPU server. To do so, open the file you downloaded, copy the contents, and paste them inside `''` below, e.g.: `creds = '{\"username\":\"xxx\",\"key\":\"xxx\"}'`:" + "You need an API key to use the Kaggle API; to get one, click on your profile picture on the Kaggle website, and choose My Account, then click Create New API Token. This will save a file called *kaggle.json* to your PC. You need to copy this key on your GPU server. To do so, open the file you downloaded, copy the contents, and paste them in the following cell in the notebook associated with this chapter (e.g., `creds = '{\"username\":\"xxx\",\"key\":\"xxx\"}'`):" ] }, { @@ -293,7 +293,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "...then execute this cell (only needs to be run once):" + "Then execute this cell (this only needs to be run once):" ] }, { @@ -313,7 +313,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now you can download datasets from Kaggle! We'll pick a path to download the dataset to:" + "Now you can download datasets from Kaggle! Pick a path to download the dataset to:" ] }, { @@ -351,7 +351,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "...and use the Kaggle API to download the dataset to that path, and extract it:" + "And use the Kaggle API to download the dataset to that path, and extract it:" ] }, { @@ -383,7 +383,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now that we have downloaded our dataset, let's have a look at it!" + "Now that we have downloaded our dataset, let's take a look at it!" ] }, { @@ -397,14 +397,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Kaggle provides information about some of the fields of our dataset; on the [Kaggle Data info](https://www.kaggle.com/c/bluebook-for-bulldozers/data) page they say that the key fields in `train.csv` are:\n", + "Kaggle provides information about some of the fields of our dataset. The [Data](https://www.kaggle.com/c/bluebook-for-bulldozers/data) explains that the key fields in *train.csv* are:\n", "\n", - "- SalesID: the unique identifier of the sale\n", - "- MachineID: the unique identifier of a machine. A machine can be sold multiple times\n", - "- saleprice: what the machine sold for at auction (only provided in train.csv)\n", - "- saledate: the date of the sale\n", + "- `SalesID`:: The unique identifier of the sale.\n", + "- `MachineID`:: The unique identifier of a machine. A machine can be sold multiple times.\n", + "- `saleprice`:: What the machine sold for at auction (only provided in *train.csv*).\n", + "- `saledate`:: The date of the sale.\n", "\n", - "In any sort of data science work, it's **important to look at your data directly** to make sure you understand the format, how it's stored, what type of values it holds, etc.. Even if you've read descriptions about your data, the actual data may not be what you expect. We'll start by reading the training set into a Pandas DataFrame; note that we have to tell Pandas which columns contain dates. Generally it's a good idea to also specify `low_memory=False` unless Pandas actually runs out of memory and returns an error. The `low_memory` parameter, which is `True` by default, tells Pandas to only look at a few rows of data at a time to figure out what type of data is in each column. This means that Pandas can actually end up using a different data type for different rows, which generally leads to data processing errors or model training problems later." + "In any sort of data science work, it's important to *look at your data directly* to make sure you understand the format, how it's stored, what types of values it holds, etc. Even if you've read a description of the data, the actual data may not be what you expect. We'll start by reading the training set into a Pandas DataFrame. Generally it's a good idea to specify `low_memory=False` unless Pandas actually runs out of memory and returns an error. The `low_memory` parameter, which is `True` by default, tells Pandas to only look at a few rows of data at a time to figure out what type of data is in each column. This means that Pandas can actually end up using different data type for different rows, which generally leads to data processing errors or model training problems later.\n", + "\n", + "Let's load our data and have a look at the columns:" ] }, { @@ -508,9 +510,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The most important data column is the dependent variable, that is the one we want to predict. Recall that a models metric is a function that reflects how good the predictions are. It's important to note what metric is being used for a project. Generally, selecting the metric is an important part of the project setup. In many cases, choosing a good metric will require more than just selecting a variable that already exists. It is more like a design process. You should think carefully about which metric, or set of metric, actually measures the notion of model quality which matters to you. If no variable represents that metric, you should see if you can build the metric from the variables which are available.\n", + "The most important data column is the dependent variable--that is, the one we want to predict. Recall that a model's metric is a function that reflects how good the predictions are. It's important to note what metric is being used for a project. Generally, selecting the metric is an important part of the project setup. In many cases, choosing a good metric will require more than just selecting a variable that already exists. It is more like a design process. You should think carefully about which metric, or set of metrics, actually measures the notion of model quality that matters to you. If no variable represents that metric, you should see if you can build the metric from the variables that are available.\n", "\n", - "However, in this case Kaggle tells us what metric to use: RMSLE (root mean squared log error) between the actual and predicted auction prices. Here we need do only a small amount of processing to use this: we take the log of the prices, so that m_rmse of that value will give us what we ultimately need." + "However, in this case Kaggle tells us what metric to use: root mean squared log error (RMSLE) between the actual and predicted auction prices. We need do only a small amount of processing to use this: we take the log of the prices, so that `rmse` of that value will give us what we ultimately need:" ] }, { @@ -535,7 +537,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We are now ready to have a look at our first machine learning algorithm for tabular data: decision trees." + "We are now ready to explore our first machine learning algorithm for tabular data: decision trees." ] }, { @@ -549,7 +551,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Decision tree ensembles, as the name suggests, rely on decision trees. So let's start there! A decision tree asks a series of binary (that is, yes or no) questions about the data. After each question the data at that part of the tree is split between a \"yes\" and a \"no\" branch as shown in <>. After one or more questions, either a prediction can be made on the basis of all previous answers or another question is required." + "Decision tree ensembles, as the name suggests, rely on decision trees. So let's start there! A decision tree asks a series of binary (that is, yes or no) questions about the data. After each question the data at that part of the tree is split between a \"yes\" and a \"no\" branch, as shown in <>. After one or more questions, either a prediction can be made on the basis of all previous answers or another question is required." ] }, { @@ -563,28 +565,28 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This sequence of questions is now a procedure for taking any data item, whether an item from the training set or a new one, and assigning that item to a group. Namely, after asking and answering the questions, we can say the item belongs to the group of all the other training data items which yielded the same set of answers to the questions. But what good is this? the goal of our model is to predict values for items, not to assign them into groups from the training dataset. The value of this is that we can now assign a prediction value for each of these groups--for regression, we take the target mean of the items in the group.\n", + "This sequence of questions is now a procedure for taking any data item, whether an item from the training set or a new one, and assigning that item to a group. Namely, after asking and answering the questions, we can say the item belongs to the same group as all the other training data items that yielded the same set of answers to the questions. But what good is this? The goal of our model is to predict values for items, not to assign them into groups from the training dataset. The value is that we can now assign a prediction value for each of these groups--for regression, we take the target mean of the items in the group.\n", "\n", - "Let's consider how we find the right questions to ask. Of course, we wouldn't want to have to create all these questions ourselves — that's what computers are for! The basic steps to train a decision tree can be written down very easily:\n", + "Let's consider how we find the right questions to ask. Of course, we wouldn't want to have to create all these questions ourselves—that's what computers are for! The basic steps to train a decision tree can be written down very easily:\n", "\n", - "1. Loop through each column of the dataset in turn\n", - "1. For each column, loop through each possible level of that column in turn\n", - "1. Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable)\n", - "1. Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. That is, treat this as a very simple \"model\" where our predictions are simply the average sale price of the item's group\n", - "1. After looping through all of the columns and possible levels for each, pick the split point which gave the best predictions using our very simple model\n", - "1. We now have two different groups for our data, based on this selected split. Treat each of these as separate datasets, and find the best split for each, by going back to step one for each group\n", - "1. Continue this process recursively, and until you have reached some stopping criterion for each group — for instance, stop splitting a group further when it has only 20 items in it.\n", + "1. Loop through each column of the dataset in turn.\n", + "1. For each column, loop through each possible level of that column in turn.\n", + "1. Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable).\n", + "1. Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. That is, treat this as a very simple \"model\" where our predictions are simply the average sale price of the item's group.\n", + "1. After looping through all of the columns and all the possible levels for each, pick the split point that gave the best predictions using that simple model.\n", + "1. We now have two different groups for our data, based on this selected split. Treat each of these as separate datasets, and find the best split for each by going back to step 1 for each group.\n", + "1. Continue this process recursively, until you have reached some stopping criterion for each group—for instance, stop splitting a group further when it has only 20 items in it.\n", "\n", - "Although this is an easy enough algorithm to implement yourself (and it is a good exercise to do so) we can save some time by using the implementation built into sklearn.\n", + "Although this is an easy enough algorithm to implement yourself (and it is a good exercise to do so), we can save some time by using the implementation built into sklearn.\n", "\n", - "But even before using sklearn, we have to prepare our data somewhat before we can use it." + "First, however, we need to do a little data preparation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "> A: Here's a productive question to ponder. If you consider that the procedure for defining decision tree essentially chooses one _sequence of splitting questions about variables_, you might ask yourself, how do we know this procedure chooses the _correct sequence_? The rule is to choose the splitting question which produces the best split, and then to apply the same rule to groups that split produces, and so on (this is known in computer science as a \"greedy\" approach). Can you imagine a scenario in which asking a “less powerful” splitting question would enable a better split down the road (or should I say down the trunk!) and lead to a better result overall?" + "> A: Here's a productive question to ponder. If you consider that the procedure for defining a decision tree essentially chooses one _sequence of splitting questions about variables_, you might ask yourself, how do we know this procedure chooses the _correct sequence_? The rule is to choose the splitting question that produces the best split (i.e., that most accurately separates the itmes into two distinct categories), and then to apply the same rule to the groups that split produces, and so on. This is known in computer science as a \"greedy\" approach. Can you imagine a scenario in which asking a “less powerful” splitting question would enable a better split down the road (or should I say down the trunk!) and lead to a better result overall?" ] }, { @@ -598,13 +600,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The first piece of data preparation we need to do is to enrich our representation of dates. The fundamental basis of the decision tree which we just described is bisection -- dividing up a group into two. We look at the ordinal variables and divide up the dataset based on whether the variable's value is greater (or lower) than a threshhold, and we look at the categorical variables and divided up the dataset based on whether the variable's level is a particular level. So this algorithm has a way of dividing up the dataset based on both ordinal and categorical data.\n", + "The first piece of data preparation we need to do is to enrich our representation of dates. The fundamental basis of the decision tree that we just described is *bisection*-- dividing a group into two. We look at the ordinal variables and divide up the dataset based on whether the variable's value is greater (or lower) than a threshhold, and we look at the categorical variables and divide up the dataset based on whether the variable's level is a particular level. So this algorithm has a way of dividing up the dataset based on both ordinal and categorical data.\n", "\n", - "How does this apply to a common data type, the date? You might want to treat a date as an ordinal value, because it is meaningful to say that one date is greater than another. However, dates are a bit different from most ordinal values in that some dates are qualitatively different from others in a way that that is often relevant to the systems we are modelling.\n", + "But how does this apply to a common data type, the date? You might want to treat a date as an ordinal value, because it is meaningful to say that one date is greater than another. However, dates are a bit different from most ordinal values in that some dates are qualitatively different from others in a way that that is often relevant to the systems we are modeling.\n", "\n", - "So in order to help the above algorithm handle dates intelligently, we'd like our model to know more than whether a date is more recent or less recent. We might want our model to make decisions based on that date's day of week, on whether a day is a holiday, on what month it is in, and so forth. To do this, we replace every date column with a set of date metadata columns, such as holiday, day of week, and month. These columns provide categorical data that we suspect will be useful.\n", + "In order to help our algorithm handle dates intelligently, we'd like our model to know more than whether a date is more recent or less recent than another. We might want our model to make decisions based on that date's day of the week, on whether a day is a holiday, on what month it is in, and so forth. To do this, we replace every date column with a set of date metadata columns, such as holiday, day of week, and month. These columns provide categorical data that we suspect will be useful.\n", "\n", - "Fastai comes with a function that will do this for us — we just have to pass a column name which contains dates:" + "Fastai comes with a function that will do this for us—we just have to pass a column name that contains dates:" ] }, { @@ -678,12 +680,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "A second piece of preparatory processing is to be sure we can handle strings and missing data. Out of the box, sklearn cannot do either. Instead we will use fastai's class `TabularPandas`, which wraps a Pandas data frame and provides a few conveniences. To populate a `TabularPandas`, we will use two `TabularProc`s, `Categorify` and `FillMissing`. A `TabularProc` is like a regular `Transform`, except that:\n", + "A second piece of preparatory processing is to be sure we can handle strings and missing data. Out of the box, sklearn cannot do either. Instead we will use fastai's class `TabularPandas`, which wraps a Pandas DataFrame and provides a few conveniences. To populate a `TabularPandas`, we will use two `TabularProc`s, `Categorify` and `FillMissing`. A `TabularProc` is like a regular `Transform`, except that:\n", "\n", - "- It returns the exact same object that's passed to it, after modifying the object *in-place*, and\n", + "- It returns the exact same object that's passed to it, after modifying the object in place.\n", "- It runs the transform once, when data is first passed in, rather than lazily as the data is accessed.\n", "\n", - "`Categorify` is a `TabularProc` which replaces a column with a numeric categorical column. `FillMissing` is a `TabularProc` which replaces missing values with the median of the column, and creates a new boolean column that is set to True for any row where the value was missing. These two transforms are needed for nearly every tabular dataset you will use, so it's a good starting point for your data processing." + "`Categorify` is a `TabularProc` that replaces a column with a numeric categorical column. `FillMissing` is a `TabularProc` that replaces missing values with the median of the column, and creates a new Boolean column that is set to `True` for any row where the value was missing. These two transforms are needed for nearly every tabular dataset you will use, so this is a good starting point for your data processing:" ] }, { @@ -699,19 +701,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "`TabularPandas` will also handle splitting into training vs validation datasets for us. \n", + "`TabularPandas` will also handle splitting the dataset into training and validation sets for us. However we need to be very careful about our validation set. We want to design it so that it is like the *test set* Kaggle will use to judge the contest.\n", "\n", - "We need to be very careful about our validation set here. In particular we want to design it so that it is like the *test set* which Kaggle will use to judge the contest.\n", - "\n", - "Recall the distinction between a validation set and a test set, as discussed in <>. A validation set is data which we hold back from training in order to ensure that the training process does not overfit on the training data. A test set is data which is held back even more deeply, from us ourselves, in order to ensure that *we* don't overfit on the validation data, as we explore various model architectures and hyperparameters.\n", + "Recall the distinction between a validation set and a test set, as discussed in <>. A validation set is data we hold back from training in order to ensure that the training process does not overfit on the training data. A test set is data that is held back even more deeply, from us ourselves, in order to ensure that *we* don't overfit on the validation data, as we explore various model architectures and hyperparameters.\n", "\n", "We don't get to see the test set. But we do want to define our validation data so that it has the same sort of relationship to the training data as the test set will have.\n", "\n", "In some cases, just randomly choosing a subset of your data points will do that. This is not one of those cases, because it is a time series.\n", "\n", - "If you look at the date range represented in the test set, you will discover that it covers a six-month period from May 2012, which is later in time than any date in the training set. This is a good design, because the competition sponsor will want to ensure that a model is able to predict the future. But it means that if we are going to have a useful validation set, we also want the validation set to be later in time. The Kaggle training data ends in April 2012. So we will define a narrower training dataset which consists only of the Kaggle training data from before November 2011, and we define a validation set which is from after November 2011.\n", + "If you look at the date range represented in the test set, you will discover that it covers a six-month period from May 2012, which is later in time than any date in the training set. This is a good design, because the competition sponsor will want to ensure that a model is able to predict the future. But it means that if we are going to have a useful validation set, we also want the validation set to be later in time than the training set. The Kaggle training data ends in April 2012, so we will define a narrower training dataset which consists only of the Kaggle training data from before November 2011, and we'll define a validation set consisting of data from after November 2011.\n", "\n", - "To do this we use `np.where`, a useful function which returns (as the first element of a tuple) the indices of all `True` values:" + "To do this we use `np.where`, a useful function that returns (as the first element of a tuple) the indices of all `True` values:" ] }, { @@ -731,7 +731,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "`TabularPandas` needs to be told which columns are continuous, and which are categorical. We can handle that automatically using the helper function `cont_cat_split`." + "`TabularPandas` needs to be told which columns are continuous and which are categorical. We can handle that automatically using the helper function `cont_cat_split`:" ] }, { @@ -756,7 +756,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "A `TabularPandas` behaves a lot like a fastai `Datasets` object, including `train` and `valid` attributes." + "A `TabularPandas` behaves a lot like a fastai `Datasets` object, including providing `train` and `valid` attributes:" ] }, { @@ -783,7 +783,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can see that the data still is displayed as strings for categories (we only show a few columns because the fulltable is too big to fit on a page)..." + "We can see that the data is still displayed as strings for categories (we only show a few columns here because the full table is too big to fit on a page):" ] }, { @@ -1159,7 +1159,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "...but the underlying items are all numeric:" + "However, the underlying items are all numeric:" ] }, { @@ -1344,7 +1344,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The conversion of categorical columns to numbers is done by simply replacing each unique level with a number. The numbers associated with the levels are chosen consecutively as they are seen in a column. So there's no particular meaning to the numbers in categorical columns after conversion. The exception is if you first convert a column to a pandas ordered category (as we did for `ProductSize` above), in which case the ordering you chose is used. We can see the mapping by looking at the `classes` attribute:" + "The conversion of categorical columns to numbers is done by simply replacing each unique level with a number. The numbers associated with the levels are chosen consecutively as they are seen in a column, so there's no particular meaning to the numbers in categorical columns after conversion. The exception is if you first convert a column to a Pandas ordered category (as we did for `ProductSize` earlier), in which case the ordering you chose is used. We can see the mapping by looking at the `classes` attribute:" ] }, { @@ -1371,7 +1371,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Since it takes a minute or so to process the data to get to this point, we should save it - that way in the future we can continue our work from here without rerunning the previous steps. fastai provides a `save` method that uses Python's *pickle* system to save nearly any Python object." + "Since it takes a minute or so to process the data to get to this point, we should save it--that way in the future we can continue our work from here without rerunning the previous steps. fastai provides a `save` method that uses Python's *pickle* system to save nearly any Python object:" ] }, { @@ -1412,7 +1412,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can read in our data (only needed if you're coming back to this notebook after a break, and don't want to recreate the `TabularPandas` object), and define our independent and dependent variables." + "To begin, we define our independent and dependent variables:" ] }, { @@ -1421,6 +1421,7 @@ "metadata": {}, "outputs": [], "source": [ + "#hide\n", "to = (path/'to.pkl').load()" ] }, @@ -1455,7 +1456,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To see what it's learned, we can display the tree. To keep it simple, we've told sklearn to just create four *leaf nodes*." + "To keep it simple, we've told sklearn to just create four *leaf nodes*. To see what it's learned, we can display the tree:" ] }, { @@ -1594,15 +1595,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Understanding this picture is one of the best ways to understand decision trees. So we will start at the top, and explain each part step-by-step.\n", + "Understanding this picture is one of the best ways to understand decision trees, so we will start at the top and explain each part step by step.\n", "\n", - "The top node represents the *initial model* before any splits have been done, when all the data is in one group. This is the simplest possible model. It is the result of asking zero questions. It always predict the value to be the average value of the whole dataset. In this case, we can see it predicts a value of 10.10 for the logarithm of the sales price. It gives a mean squared error of 0.48. The square root of this is 0.69. Remember that unless you see *m_rmse* or a *root mean squared error* then the value you are looking at is before taking the square root, so it is just the average of the square of the differences. We can also see that there are 404,710 auction records in this group — that is the total size of our training set. The final piece of information shown here is the decision criterion for the very first split that was found, which is to split based on the `coupler_system` column.\n", + "The top node represents the *initial model* before any splits have been done, when all the data is in one group. This is the simplest possible model. It is the result of asking zero questions and will always predict the value to be the average value of the whole dataset. In this case, we can see it predicts a value of 10.10 for the logarithm of the sales price. It gives a mean squared error of 0.48. The square root of this is 0.69. (Remember that unless you see `m_rmse`, or a *root mean squared error*, then the value you are looking at is before taking the square root, so it is just the average of the square of the differences.) We can also see that there are 404,710 auction records in this group—that is the total size of our training set. The final piece of information shown here is the decision criterion for the best split that was found, which is to split based on the `coupler_system` column.\n", "\n", - "Moving down and to the left, this node shows us that there were 360,847 auction records for equipment where `coupler_system` was less than 0.5. The average value of our dependent variable in this group is 10.21. But moving down and to the right from the initial model would take us to the records where `coupler_system` was greater than 0.5.\n", + "Moving down and to the left, this node shows us that there were 360,847 auction records for equipment where `coupler_system` was less than 0.5. The average value of our dependent variable in this group is 10.21. Moving down and to the right from the initial model takes us to the records where `coupler_system` was greater than 0.5.\n", "\n", - "The bottom row contains our *leaf nodes*, the nodes with no answers coming out of them, because there are no more questions to be answered. At the far right of this row is the node for `coupler_system` greater than 0.5, and we can see that the average value is 9.21. So we can see the decision tree algorithm did find a single binary decision which separated high value from low value auction results. Asking only about `coupler_system` predicts an average value of 9.21 vs 10.1. That's if we ask only one question.\n", + "The bottom row contains our *leaf nodes*: the nodes with no answers coming out of them, because there are no more questions to be answered. At the far right of this row is the node containing records where `coupler_system` was greater than 0.5. The average value here is 9.21, so we can see the decision tree algorithm did find a single binary decision that separated high-value from low-value auction results. Asking only about `coupler_system` predicts an average value of 9.21 versus 10.1.\n", "\n", - "Returning back to the top node after the first decision point, we can see that a second binary decision split has been made, based on asking whether `YearMade` is less than or equal to 1991.5. For the group where this is true (remember, this is now following two binary decisions, both `coupler_system`, and `YearMade`) the average value is 9.97, and there are 155,724 auction records in this group. For the group of auctions where this decision is false, the average value is 10.4, and there are 205,123 records. So again, we can see that the decision tree algorithm has successfully split our more expensive auction records into two more groups which differ in value significantly." + "Returning back to the top node after the first decision point, we can see that a second binary decision split has been made, based on asking whether `YearMade` is less than or equal to 1991.5. For the group where this is true (remember, this is now following two binary decisions, based on `coupler_system` and `YearMade`) the average value is 9.97, and there are 155,724 auction records in this group. For the group of auctions where this decision is false, the average value is 10.4, and there are 205,123 records. So again, we can see that the decision tree algorithm has successfully split our more expensive auction records into two more groups which differ in value significantly." ] }, { @@ -4401,7 +4402,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This shows a chart of the distribution of the data for each split point. We can clearly see that there's a problem with our `YearMade` data: there are bulldozers made in the year 1000, apparently! Presumably this is actually just a missing value code. In a decision tree, we can set any value that doesn't otherwise appear in the data as a missing value code. So for modelling purposes, '1000' is fine; but it makes visualization a bit hard to see, as shown above. So let's replace it with '1950':" + "This shows a chart of the distribution of the data for each split point. We can clearly see that there's a problem with our `YearMade` data: there are bulldozers made in the year 1000, apparently! Presumably this is actually just a missing value code (a value that doesn't otherwise appear in the data and that is used as a placeholder in cases where a value is missing). For modeling purposes, 1000 is fine, but as you can see this outlier makes visualization the values we are interested in more difficult. So, let's replace it with 1950:" ] }, { @@ -4418,7 +4419,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "After that change, the split is much clearer in the tree visualization, even although it doesn't actually change the result of the model in any significant way. This is a great example of how resilient decision trees are to data issues!" + "That change makes the split much clearer in the tree visualization, even although it doesn't actually change the result of the model in any significant way. This is a great example of how resilient decision trees are to data issues!" ] }, { @@ -7236,7 +7237,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's now have the decision tree algorithm build a bigger tree (i.e we are not passing in any stopping criteria such as `max_leaf_nodes`):" + "Let's now have the decision tree algorithm build a bigger tree. Here, we are not passing in any stopping criteria such as `max_leaf_nodes`:" ] }, { @@ -7253,7 +7254,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We'll create a little function to check the root mean squared error (m_rmse) of our model, since that's how the competition was judged:" + "We'll create a little function to check the root mean squared error of our model (`m_rmse`), since that's how the competition was judged:" ] }, { @@ -7317,7 +7318,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Oops... it looks like we might be overfitting pretty badly. Here's why:" + "Oops--it looks like we might be overfitting pretty badly. Here's why:" ] }, { @@ -7344,14 +7345,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We've got nearly as many leaf nodes as data points! That seems a little... over-enthusiastic. Indeed, sklearn's default settings allow it to continue splitting nodes until there is only one item in a leaf node. Let's change the stopping rule to tell sklearn to ensure every leaf node has at least 25 auctions:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> A: Here's my intuition for an overfitting decision tree with more leaf nodes than data items: the childhood game of Twenty Questions. In that game, the chooser secretly imagines an object (like, \"our television set\"), and the guesser gets to pose twenty yes-or-no questions to try to guess the object (like \"is it bigger than a breadbox?\"). The guesser is not trying to predict a numerical value but just to identify a particular object out of the set of all imaginable objects. When your decision tree has more leafs than there are possible objects in your domain, then it is essentially a well-trained guesser. It has learned the sequence of questions needed to identify a particular data item in the training set, and it is \"predicting\" only by describing that item's value. This is a way of memorizing the training set, i.e., of overfitting." + "We've got nearly as many leaf nodes as data points! That seems a little over-enthusiastic. Indeed, sklearn's default settings allow it to continue splitting nodes until there is only one item in each leaf node. Let's change the stopping rule to tell sklearn to ensure every leaf node contains at least 25 auction records:" ] }, { @@ -7407,11 +7401,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Much more reasonable!\n", + "Much more reasonable!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> A: Here's my intuition for an overfitting decision tree with more leaf nodes than data items. Consider the game Twenty Questions. In that game, the chooser secretly imagines an object (like, \"our television set\"), and the guesser gets to pose 20 yes or no questions to try to guess what the object is (like \"Is it bigger than a breadbox?\"). The guesser is not trying to predict a numerical value, but just to identify a particular object out of the set of all imaginable objects. When your decision tree has more leaves than there are possible objects in your domain, then it is essentially a well-trained guesser. It has learned the sequence of questions needed to identify a particular data item in the training set, and it is \"predicting\" only by describing that item's value. This is a way of memorizing the training set--i.e., of overfitting." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Building a decision tree is a good way to create a model of our data. It is very flexible, since it can clearly handle nonlinear relationships and interactions between variables. But we can see there is a fundamental compromise between how well it generalizes (which we can achieve by creating small trees) and how accurate it is on the training set (which we can achieve by using large trees).\n", "\n", - "Building a decision tree is a good way to create a model of our data. It is very flexible, since it can clearly handle nonlinear relationships and interactions between variables. But we can see there is a fundamental compromise between how well it generalises (which we can achieve by creating small trees) and how accurate it is on the training set (which we can achieve by using large trees).\n", - "\n", - "But, how do we get the best of both worlds? We'll show you right after we handle an important missing detail: how to handle categorical variables." + "So how do we get the best of both worlds? We'll show you right after we handle an important missing detail: how to handle categorical variables." ] }, { @@ -7425,29 +7431,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This is unlike the situation with deep learning networks, where we one-hot encoded the variables and then fed them to an embedding layer. There, the embedding layer helped to discover the meaning of these variable levels, since each level of a categorical variable does not have a meaning on its own (unless we manually specified an ordering using pandas). So how can these untreated categorical variables do anything useful in a decision tree? For instance, how could something like a product code be used?\n", + "In the previous chapter, when working with deep learning networks, we dealt with categorical variables by one-hot encoding them and feeding them to an embedding layer. The embedding layer helped the model to discover the meaning of the different levels of these variables (the levels of a categorical variable do not have an intrinsic meaning, unless we manually specify an ordering using Pandas). In a decision tree, we don't have embeddings layers--so how can these untreated categorical variables do anything useful in a decision tree? For instance, how could something like a product code be used?\n", "\n", - "The short answer is: it just works! Think about a situation where there is one product code that is far more expensive at auction than any other one. In that case, any binary split will result in that one product code being in some group, and that group will be more expensive than the other group. Therefore, our simple decision tree building algorithm will choose that split. Later during training, the algorithm will be able to further split the subgroup which now contains the expensive product code. Over time, the tree will home in on that one expensive product.\n", + "The short answer is: it just works! Think about a situation where there is one product code that is far more expensive at auction than any other one. In that case, any binary split will result in that one product code being in some group, and that group will be more expensive than the other group. Therefore, our simple decision tree building algorithm will choose that split. Later during training the algorithm will be able to further split the subgroup that contains the expensive product code, and over time, the tree will home in on that one expensive product.\n", "\n", - "It is also possible to use one hot encoding to replace a single categorical variable with multiple one hot encoded columns, where column represents a possible level of the variable. Pandas has a `get_dummies` method which does just that.\n", + "It is also possible to use one-hot encoding to replace a single categorical variable with multiple one-hot-encoded columns, where each column represents a possible level of the variable. Pandas has a `get_dummies` method which does just that.\n", "\n", - "However, there is not really any evidence that such an approach improves the end result. So, we generally avoid it where possible, because it does end up making your dataset harder to work with. In 2019 this issue was explored in the paper [Splitting on categorical predictors in random forests](https://peerj.com/articles/6339/), which said:" + "However, there is not really any evidence that such an approach improves the end result. So, we generally avoid it where possible, because it does end up making your dataset harder to work with. In 2019 this issue was explored in the paper [\"Splitting on Categorical Predictors in Random Forests\"](https://peerj.com/articles/6339/) by Marvin Wright and Inke König, which said:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "> : \"The standard approach for nominal predictors is to consider all (2^(k − 1) − 1) 2-partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k − 1 splits have to be considered for a nominal predictor with k categories.\"" + "> : The standard approach for nominal predictors is to consider all $2^{k-1} − 1$ 2-partitions of the *k* predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only *k* − 1 splits have to be considered for a nominal predictor with *k* categories." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Building a decision tree is a good way to create a model of our data. It is very flexible, since it can clearly handle nonlinear relationships and interactions between variables. But we can see there is a fundamental compromise between how well it generalises (which we can achieve by creating small trees) and how accurate it is on the training set (which we can achieve by using large trees).\n", - "\n", - "But, how do we get the best of both worlds? A solution is to use random forests." + "Now that you understand how decisions tree work, it's time for the best-of-both-worlds solution: random forests." ] }, { @@ -7461,22 +7465,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In 1994 Berkeley professor Leo Breiman, one year after his retirement, published a small technical report called [*Bagging Predictors*](https://www.stat.berkeley.edu/~breiman/bagging.pdf), which turned out to be one of the most influential ideas in modern machine learning. The report began:\n", + "In 1994 Berkeley professor Leo Breiman, one year after his retirement, published a small technical report called [\"Bagging Predictors\"](https://www.stat.berkeley.edu/~breiman/bagging.pdf), which turned out to be one of the most influential ideas in modern machine learning. The report began:\n", "\n", - "> : \"Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions... The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests… show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.\"\n", + "> : Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions... The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests… show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.\n", "\n", "Here is the procedure that Breiman is proposing:\n", "\n", - "1. Randomly choose a subset of the rows of your data (i.e., \"bootstrap replicates of your learning set\")\n", - "1. Train a model using this subset\n", - "1. Save that model, and then return to step one a few times\n", + "1. Randomly choose a subset of the rows of your data (i.e., \"bootstrap replicates of your learning set\").\n", + "1. Train a model using this subset.\n", + "1. Save that model, and then return to step 1 a few times.\n", "1. This will give you a number of trained models. To make a prediction, predict using all of the models, and then take the average of each of those model's predictions.\n", "\n", - "This procedure is known as \"bagging\". It is based on a deep and important insight: although each of the models trained on a subset of data will make more errors than a model trained on the full dataset, those errors will not be correlated with each other. Different models will make different errors. The average of those errors, therefore, is: zero! So if we take the average of all of the models' predictions, then we should end up with a prediction which gets closer and closer to the correct answer, the more models we have. This is an extraordinary result — it means that we can improve the accuracy of nearly any kind of machine learning algorithm by training it multiple times, each time on a different random subset of data, and average its predictions.\n", + "This procedure is known as \"bagging.\" It is based on a deep and important insight: although each of the models trained on a subset of data will make more errors than a model trained on the full dataset, those errors will not be correlated with each other. Different models will make different errors. The average of those errors, therefore, is: zero! So if we take the average of all of the models' predictions, then we should end up with a prediction that gets closer and closer to the correct answer, the more models we have. This is an extraordinary result—it means that we can improve the accuracy of nearly any kind of machine learning algorithm by training it multiple times, each time on a different random subset of the data, and averaging its predictions.\n", "\n", "In 2001 Leo Breiman went on to demonstrate that this approach to building models, when applied to decision tree building algorithms, was particularly powerful. He went even further than just randomly choosing rows for each model's training, but also randomly selected from a subset of columns when choosing each split in each decision tree. He called this method the *random forest*. Today it is, perhaps, the most widely used and practically important machine learning method.\n", "\n", - "In essence a random forest is a model that averages the predictions of large number of decision trees, which are generated by randomly varying various parameters that specify what data is used to train the tree and other tree parameters. \"Bagging\" is a particular approach to \"ensembling\", which refers to any approach that combines the results of multiple models together. Let's get started on creating our own random forest!" + "In essence a random forest is a model that averages the predictions of a large number of decision trees, which are generated by randomly varying various parameters that specify what data is used to train the tree and other tree parameters. Bagging is a particular approach to \"ensembling,\" or combining the results of multiple models together. To see how it works in practice, let's get started on creating our own random forest!" ] }, { @@ -7500,9 +7504,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can create a random forest just like we created a decision tree. Except now, we are also specifying parameters which indicate how many trees should be in the forest, how we should subset the data items (the rows), and how we should subset the fields (the columns).\n", + "We can create a random forest just like we created a decision tree, except now, we are also specifying parameters that indicate how many trees should be in the forest, how we should subset the data items (the rows), and how we should subset the fields (the columns).\n", "\n", - "In the function definition below, `n_estimators` defines the number of trees we want, and `max_samples` defines how many rows to sample for training each tree, while `max_features` defines how many columns to sample at each split point (where `0.5` means \"take half the total number of columns\"). We can also pass parameters for choosing when to stop splitting the tree nodes, effectively limiting the depth of tree, by including the same `min_samples_leaf` parameter we used in the last section. Finally, we pass `n_jobs=-1` to tell sklearn to use all our CPUs to build the trees in parallel. By creating a little function for this, we can more quickly try different variations in the rest of this chapter." + "In the following function definition `n_estimators` defines the number of trees we want, `max_samples` defines how many rows to sample for training each tree, and `max_features` defines how many columns to sample at each split point (where `0.5` means \"take half the total number of columns\"). We can also specify when to stop splitting the tree nodes, effectively limiting the depth of the tree, by including the same `min_samples_leaf` parameter we used in the last section. Finally, we pass `n_jobs=-1` to tell sklearn to use all our CPUs to build the trees in parallel. By creating a little function for this, we can more quickly try different variations in the rest of this chapter:" ] }, { @@ -7531,7 +7535,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Our validation RMSE is now much improved over our last result produced by the `DecisionTreeRegressor`, which made just one tree using all available data:" + "Our validation RMSE is now much improved over our last result produced by the `DecisionTreeRegressor`, which made just one tree using all the available data:" ] }, { @@ -7558,9 +7562,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "One of the most important properties of random forests is that they aren't very sensitive to the hyperparameter choices, such as `max_features`. You can set `n_estimators` to as high a number as you have time to train -- the more trees, the more accurate they will be. `max_samples` can often be left at its default, unless you have over 200,000 data points, in which case setting it to 200,000 will make it train faster, with little impact on accuracy. `max_features=0.5`, and `min_samples_leaf=4` both tend to work well, although sklearn's defaults work well too.\n", + "One of the most important properties of random forests is that they aren't very sensitive to the hyperparameter choices, such as `max_features`. You can set `n_estimators` to as high a number as you have time to train--the more trees you have, the more accurate the model will be. `max_samples` can often be left at its default, unless you have over 200,000 data points, in which case setting it to 200,000 will make it train faster with little impact on accuracy. `max_features=0.5` and `min_samples_leaf=4` both tend to work well, although sklearn's defaults work well too.\n", "\n", - "The sklearn docs [show an example](http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html) of different `max_features` choices, with increasing numbers of trees. In the plot, the blue plot line uses the fewest features and the green line uses the most, since it uses all the features. As you can see in <>, the models with the lowest error result from using a subset of features but with a larger number of trees." + "The sklearn docs [show an example](http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html) of the effects different `max_features` choices, with increasing numbers of trees. In the plot, the blue plot line uses the fewest features and the green line uses the most (it uses all the features). As you can see in <>, the models with the lowest error result from using a subset of features but with a larger number of trees." ] }, { @@ -7569,7 +7573,7 @@ "hide_input": true }, "source": [ - "\"sklearn" + "\"sklearn" ] }, { @@ -7619,14 +7623,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> question: Why does this give the same result as our random forest?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here's how RMSE improves as we add more and more trees. As you can see, the improvement levels off quite a bit after around 30 trees:" + "Let's see what happens to the RMSE as we add more and more trees. As you can see, the improvement levels off quite a bit after around 30 trees:" ] }, { @@ -7655,7 +7652,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Is our validation set worse than our training set because we're over-fitting, or because the validation set is for a different time period, or a bit of both? With the existing information we've shown, we can't tell. However, random forests have a very clever trick called *out-of-bag (OOB) error* which can handle this (and more!)" + "The performance on our validation set is worse than on our training set. But is that because we're overfitting, or because the validation set covers a different time period, or a bit of both? With the existing information we've seen, we can't tell. However, random forests have a very clever trick called *out-of-bag* (OOB) error that can help us with this (and more!)." ] }, { @@ -7669,11 +7666,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The idea is to calculate error on the training set, but only include the trees in the calculation of a row's error where that row was *not* included in training that tree. This allows us to see whether the model is over-fitting, without needing a separate validation set.\n", + "Recall that in a random forest, each tree is trained on a different subset of the training data. The OOB error is a way of measuring prediction error on the training set by only including in the calculation of a row's error trees where that row was *not* included in training. This allows us to see whether the model is overfitting, without needing a separate validation set.\n", "\n", - "This also has the benefit of allowing us to see whether our model generalizes, even if we have such a small amount of data that we want to avoid removing items to create a validation set. The OOB predictions are available in the `oob_prediction_` attribute. Note that we compare to *training* labels, since this is being calculated on the OOB trees on the training set.\n", + "> A: My intuition for this is that, since every tree was trained with a different randomly selected subset of rows, out-of-bag error is a little like imagining that every tree therefore also has its own validation set. That validation set is simply the rows that were not selected for that tree's training.\n", "\n", - "> A: My intuition for this is that, since every tree was trained with a different randomly selected subset of rows, out-of-bag error is a little like imagining that every tree therefore also has its own validation set. That validation set is simply the rows that were notselected for that tree's training." + "This is particularly beneficial in cases where we have only a small amount of training data, as it allows us to see whether our model generalizes without removing items to create a validation set. The OOB predictions are available in the `oob_prediction_` attribute. Note that we compare them to the training labels, since this is being calculated on trees using the training set." ] }, { @@ -7700,21 +7697,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can see that our OOB error is much lower than our validation set error. This means that something else is causing our error, in *addition* to normal generalization error. We'll discuss the reasons for this later in this chapter." + "We can see that our OOB error is much lower than our validation set error. This means that something else is causing that error, in *addition* to normal generalization error. We'll discuss the reasons for this later in this chapter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "> question: Make a list of reasons why this model's validation set error on this dataset might be worse than the OOB error. How could you test your hypotheses?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This is one way to interpret our model predictions, let's focus on more of those now." + "This is one way to interpret our model's predictions--let's focus on more of those now." ] }, { @@ -7750,7 +7740,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We saw how the model averages the individual tree predictions to get an overall prediction, that is, an estimate of the value. But how can we know the confidence of the estimate? One simple way is to use the standard deviation of predictions across the trees, instead of just the mean. This tells us the *relative* confidence of predictions. That is, for rows where trees give very different results, you would want to be more cautious of using those results, compared to cases where they are more consistent.\n", + "We saw how the model averages the individual tree's predictions to get an overall prediction--that is, an estimate of the value. But how can we know the confidence of the estimate? One simple way is to use the standard deviation of predictions across the trees, instead of just the mean. This tells us the *relative* confidence of predictions. In general, we would want to be more cautious of using the results for rows where trees give very different results (higher standard deviations), compared to cases where they are more consistent (lower standard deviations).\n", "\n", "In the earlier section on creating a random forest, we saw how to get predictions over the validation set, using a Python list comprehension to do this for each tree in the forest:" ] @@ -7788,7 +7778,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now we have a prediction for every tree and every auction, for 160 trees and 7988 auctions in the validation set.\n", + "Now we have a prediction for every tree and every auction (40 trees and 7,988 auctions) in the validation set.\n", "\n", "Using this we can get the standard deviation of the predictions over all the trees, for each auction:" ] @@ -7806,7 +7796,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Here are the prediction standard deviations for the first 5 auctions, that is, the first 5 rows of the validation set:" + "Here are the standard deviations for the predictions for the first five auctions--that is, the first five rows of the validation set:" ] }, { @@ -7833,7 +7823,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As you can see, the confidence of the predictions varies widely. For some auctions, there is a low standard deviation because the trees agree. For others, it's higher, as the trees don't agree. This is information that would be useful to use in a production setting; for instance, if you were using this model to decide what items to bid on at auction, a low-confidence prediction may cause you to look more carefully into an item before you made a bid." + "As you can see, the confidence in the predictions varies widely. For some auctions, there is a low standard deviation because the trees agree. For others it's higher, as the trees don't agree. This is information that would be useful in a production setting; for instance, if you were using this model to decide what items to bid on at auction, a low-confidence prediction might cause you to look more carefully at an item before you made a bid." ] }, { @@ -7847,7 +7837,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "It's not normally enough to just to know that a model can make accurate predictions -- we also want to know *how* it's making predictions. The most important way to see this is with *feature importance*. We can get these directly from sklearn's random forest, by looking in the `feature_importances_` attribute. Here's a simple function we can use to pop them into a DataFrame and sort them:" + "It's not normally enough to just to know that a model can make accurate predictions--we also want to know *how* it's making predictions. *feature importance* gives us insight into this. We can get these directly from sklearn's random forest by looking in the `feature_importances_` attribute. Here's a simple function we can use to pop them into a DataFrame and sort them:" ] }, { @@ -7865,7 +7855,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The feature importances for our model show that the first few most important columns have a much higher importance score than the rest, with (not surprisingly) `YearMade` and `ProductSize` being at the top of the list:" + "The feature importances for our model show that the first few most important columns have much higher importance scores than the rest, with (not surprisingly) `YearMade` and `ProductSize` being at the top of the list:" ] }, { @@ -8013,7 +8003,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The way these importances are calculated is quite simple yet elegant. The feature importance algorithm loops through each tree, and then recursively explores each branch. At each branch, it looks to see what feature was used for that split, and how much the model improves as a result of that split. The improvement (weighted by the number of rows in that group) is added to the importance score for that feature. This is added across all branches of all trees, and finally the scores are normalized such that they add to 1.0." + "The way these importances are calculated is quite simple yet elegant. The feature importance algorithm loops through each tree, and then recursively explores each branch. At each branch, it looks to see what feature was used for that split, and how much the model improves as a result of that split. The improvement (weighted by the number of rows in that group) is added to the importance score for that feature. This is summed across all branches of all trees, and finally the scores are normalized such that they add to 1." ] }, { @@ -8027,7 +8017,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "It seems likely that we could use just a subset of the columns by removing the variables of low-importance and still get good results. Let's try just keeping those with a feature importance greater than 0.005." + "It seems likely that we could use just a subset of the columns by removing the variables of low importance and still get good results. Let's try just keeping those with a feature importance greater than 0.005:" ] }, { @@ -8055,7 +8045,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We'll retrain our model, using just this subset of the columns." + "We can retrain our model using just this subset of the columns:" ] }, { @@ -8081,7 +8071,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "...and here's the result:" + "And here's the result:" ] }, { @@ -8108,7 +8098,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Our accuracy is about the same, but we have far less columns to study:" + "Our accuracy is about the same, but we have far fewer columns to study:" ] }, { @@ -8135,9 +8125,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We've found that generally the first step to improving a model is simplifying it. 78 columns are too many for us to study them all in depth! Furthermore, in practice often a simpler, more interpretable model is easier to roll out and maintain.\n", + "We've found that generally the first step to improving a model is simplifying it--78 columns was too many for us to study them all in depth! Furthermore, in practice often a simpler, more interpretable model is easier to roll out and maintain.\n", "\n", - "It also makes our feature importance plot easier to interpret. Let's look at it again:" + "This also makes our feature importance plot easier to interpret. Let's look at it again:" ] }, { @@ -8166,7 +8156,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "One thing that makes this harder to interpret is that there seem to be some variables with very similar meanings, for example `ProductGroup` and `ProductGroupDesc`. Let's try to remove redundent features. " + "One thing that makes this harder to interpret is that there seem to be some variables with very similar meanings: for example, `ProductGroup` and `ProductGroupDesc`. Let's try to remove any redundent features. " ] }, { @@ -8176,6 +8166,13 @@ "### Removing Redundant Features" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's start with:" + ] + }, { "cell_type": "code", "execution_count": null, @@ -8202,16 +8199,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> note: The most similar pairs are found by calculating the *rank correlation*, which means that all the values are replaced with their *rank* (i.e. first, second, third, etc within the column), and then the *correlation* is calculated. (Feel free to skip over this minor detail though, since it's not going to come up again in the book!)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In the chart above, you can see pairs of columns that were extremely similar as the ones that were merged together early, far away from the \"root\" of the tree at the left. Unsurprisingly, the fields `ProductGroup` and `ProductGroupDesc` were merged quite early, as were `saleYear` and `saleElapsed`, and as were `fiModelDesc` and `fiBaseModel`. These might be so closely correlated they are practically synonyms for each other.\n", + "In this chart, the pairs of columns that are most similar are the ones that were merged together early, far from the \"root\" of the tree at the left. Unsurprisingly, the fields `ProductGroup` and `ProductGroupDesc` were merged quite early, as were `saleYear` and `saleElapsed` and `fiModelDesc` and `fiBaseModel`. These might be so closely correlated they are practically synonyms for each other.\n", "\n", - "Let's try removing some of these closely related features to see if the model can be simplified without impacting the accuracy. First, we create a function that quickly trains a random forest and returns the OOB score, by using a lower `max_samples` and higher `min_samples_leaf` . The *score* is a number returned by sklearn that is 1.0 for a perfect model, and 0.0 for a random model. (In statistics it's called *R^2*, although the details aren't important for this explanation). We don't need it to be very accurate--we're just going to use it to compare different models, based on removing some of the possibly redundant columns." + "> note: Determining Similarity: The most similar pairs are found by calculating the _rank correlation_, which means that all the values are replaced with their _rank_ (i.e., first, second, third, etc. within the column), and then the _correlation_ is calculated. (Feel free to skip over this minor detail though, since it's not going to come up again in the book!)\n", + "\n", + "Let's try removing some of these closely related features to see if the model can be simplified without impacting the accuracy. First, we create a function that quickly trains a random forest and returns the OOB score, by using a lower `max_samples` and higher `min_samples_leaf`. The OOB score is a number returned by sklearn that ranges between 1.0 for a perfect model and 0.0 for a random model. (In statistics it's called *R^2*, although the details aren't important for this explanation.) We don't need it to be very accurate--we're just going to use it to compare different models, based on removing some of the possibly redundant columns:" ] }, { @@ -8231,7 +8223,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Here's our baseline." + "Here's our baseline:" ] }, { @@ -8258,7 +8250,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now we try removing each variable one at a time." + "Now we try removing each of our potentially redundant variables, one at a time:" ] }, { @@ -8296,7 +8288,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now let's try dropping multiple variables. We'll drop one from each of the tightly aligned pairs we noticed above. Let's see what that does." + "Now let's try dropping multiple variables. We'll drop one from each of the tightly aligned pairs we noticed earlier. Let's see what that does:" ] }, { @@ -8368,7 +8360,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now we can check our RMSE again, to confirm it is still a similar accuracy." + "Now we can check our RMSE again, to confirm that the accuracy hasn't substantially changed." ] }, { @@ -8396,7 +8388,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now that we know which variable influence the most our predictions, we can have a look at how they affect the results using partial dependence plots." + "By focusing on the most important variables, and removing some redundant ones, we've greatly simplified our model. Now, let's see how those variables affect our predictions using partial dependence plots." ] }, { @@ -8410,7 +8402,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The two most important predictors are `ProductSize` and `YearMade`. We'd like to understand the relationship between these predictors and sale price. It's a good idea to first check the count of values per category (provided by the Pandas `value_counts` method), to see how common each category is:" + "As we've seen, the two most important predictors are `ProductSize` and `YearMade`. We'd like to understand the relationship between these predictors and sale price. It's a good idea to first check the count of values per category (provided by the Pandas `value_counts` method), to see how common each category is:" ] }, { @@ -8443,7 +8435,7 @@ "source": [ "The largrest group is `#na#`, which is the label fastai applies to missing values.\n", "\n", - "Let's do the same thing for `YearMade`. However, since this is a numeric feature, we'll need to draw a histogram, which groups the year values into a few discrete bins:" + "Let's do the same thing for `YearMade`. Since this is a numeric feature, we'll need to draw a histogram, which groups the year values into a few discrete bins:" ] }, { @@ -8472,19 +8464,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Other than the special value 1950 which we used for coding missing year values, most of the data is after 1990.\n", + "Other than the special value 1950 which we used for coding missing year values, most of the data is from after 1990.\n", "\n", "Now we're ready to look at *partial dependence plots*. Partial dependence plots try to answer the question: if a row varied on nothing other than the feature in question, how would it impact the dependent variable?\n", "\n", "For instance, how does `YearMade` impact sale price, all other things being equal?\n", "\n", - "To answer this question, we can't just take the average sale price for each `YearMade`. The problem with that approach is that many other things vary from year to year as well, such as which products are sold, how many products have air-conditioning, inflation, and so forth. So merely averaging over all the auctions that have the same `YearMade` would also capture the effect of how every other field also changed along with `YearMade` and how that overall change affected price.\n", + "To answer this question, we can't just take the average sale price for each `YearMade`. The problem with that approach is that many other things vary from year to year as well, such as which products are sold, how many products have air-conditioning, inflation, and so forth. So, merely averaging over all the auctions that have the same `YearMade` would also capture the effect of how every other field also changed along with `YearMade` and how that overall change affected price.\n", "\n", "Instead, what we do is replace every single value in the `YearMade` column with 1950, and then calculate the predicted sale price for every auction, and take the average over all auctions. Then we do the same for 1951, 1952, and so forth until our final year of 2011. This isolates the effect of only `YearMade` (even if it does so by averaging over some imagined records where we assign a `YearMade` value that might never actually exist alongside some other values). \n", "\n", - "> A: If you are philosophically minded it is somewhat dizzying to contemplate the different kinds of hypotheticality that we are juggling to make this calculation. First, there's the fact that *every* prediction is hypothetical, because we are not noting empirical data. Second, there's the point that we're *not* merely interested in asking how would sale price change if we changed `YearMade` and everything else along with it. Rather, we're very specifically asking, how would sale price change in a hypothetical world where only `YearMade` changed. Phew! It is impressive that we can ask such questions. I recommend Judea Pearl's recent book on causality, *The Book of Why*, if you're interested in more deeply exploring formalisms for analyzing these subtleties.\n", + "> A: If you are philosophically minded it is somewhat dizzying to contemplate the different kinds of hypotheticality that we are juggling to make this calculation. First, there's the fact that _every_ prediction is hypothetical, because we are not noting empirical data. Second, there's the point that we're _not_ merely interested in asking how sale price would change if we changed `YearMade` and everything else along with it. Rather, we're very specifically asking, how sale price would change in a hypothetical world where only `YearMade` changed. Phew! It is impressive that we can ask such questions. I recommend Judea Pearl and Dana Mackenzie's recent book on causality, _The Book of Why_ (Basic Books), if you're interested in more deeply exploring formalisms for analyzing these subtleties.\n", "\n", - "With these averages, we can then plot each of these years on the x-axis, versus each of the predictions on the Y axis. This, finally, is a partial dependence plot. Let's take a look:" + "With these averages, we can then plot each of these years on the x-axis, and each of the predictions on the y-axis. This, finally, is a partial dependence plot. Let's take a look:" ] }, { @@ -8517,9 +8509,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Looking first of all at the YearMade plot, and specifically at the section covering after 1990 (since as we noted this is where we have most of the data), we can see a nearly linear relationship between year and price. Remember that our dependent variable is after taking the logarithm, so this means that in practice there is a exponential increase in price. This is what we would expect: depreciation is generally recognised as being a multiplicative factor over time. So, for a given sale date, varying year made ought to show an exponential relationship with sale price.\n", + "Looking first of all at the `YearMade` plot, and specifically at the section covering the years after 1990 (since as we noted this is where we have the most data), we can see a nearly linear relationship between year and price. Remember that our dependent variable is after taking the logarithm, so this means that in practice there is an exponential increase in price. This is what we would expect: depreciation is generally recognized as being a multiplicative factor over time, so, for a given sale date, varying year made ought to show an exponential relationship with sale price.\n", "\n", - "The `ProductSize` partial plot is a bit concerning. It shows that the final group, which we saw before is for missing values, has the lowest price. To use this insight in practice, we would want to find out *why* it's missing so often, and what that *means*. Missing values can sometimes be useful predictors--it entirely depends on what causes them to be missing. Sometimes, however, it can show *data leakage*." + "The `ProductSize` partial plot is a bit concerning. It shows that the final group, which we saw is for missing values, has the lowest price. To use this insight in practice, we would want to find out *why* it's missing so often, and what that *means*. Missing values can sometimes be useful predictors--it entirely depends on what causes them to be missing. Sometimes, however, they can indicate *data leakage*." ] }, { @@ -8533,37 +8525,37 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In the paper [Leakage in Data Mining: Formulation, Detection, and Avoidance](https://dl.acm.org/doi/10.1145/2020408.2020496) the authors introduce leakage as \n", + "In the paper [\"Leakage in Data Mining: Formulation, Detection, and Avoidance\"](https://dl.acm.org/doi/10.1145/2020408.2020496), Shachar Kaufman, Saharon Rosset, and Claudia Perlich describe leakage as: \n", "\n", - "> : \"the introduction of information about the target of a data mining problem, which should not be legitimately available to mine from. A trivial example of leakage would be a model that uses the target itself as an input, thus concluding for example that 'it rains on rainy days'. In practice, the introduction of this illegitimate information is unintentional, and facilitated by the data collection, aggregation and preparation process.\"\n", + "> : The introduction of information about the target of a data mining problem, which should not be legitimately available to mine from. A trivial example of leakage would be a model that uses the target itself as an input, thus concluding for example that 'it rains on rainy days'. In practice, the introduction of this illegitimate information is unintentional, and facilitated by the data collection, aggregation and preparation process.\n", "\n", - "They give as an example\n", + "They give as an example:\n", "\n", - "> : \"a real-life business intelligence project at IBM where potential customers for certain products were identified, among other things, based on keywords found on their websites. This turned out to be leakage since the website content used for training had been sampled at the point in time where the potential customer has already become a customer, and where the website contained traces of the IBM products purchased, such as the word 'Websphere' (e.g. in a press release about the purchase or a specific product feature the client uses).\"\n", + "> : A real-life business intelligence project at IBM where potential customers for certain products were identified, among other things, based on keywords found on their websites. This turned out to be leakage since the website content used for training had been sampled at the point in time where the potential customer has already become a customer, and where the website contained traces of the IBM products purchased, such as the word 'Websphere' (e.g., in a press release about the purchase or a specific product feature the client uses).\n", "\n", "Data leakage is subtle and can take many forms. In particular, missing values often represent data leakage.\n", "\n", - "For instance, Jeremy competed in a Kaggle competition designed to predict which researchers would end up receiving research grants. The information was provided by a university, and included thousands of examples of research projects, along with information about the researchers involved, along with whether or not the grant was eventually accepted. The University hoped that they would be able to use models developed in this competition to help them rank which grant applications were most likely to succeed, so that they could prioritise their processing.\n", + "For instance, Jeremy competed in a Kaggle competition designed to predict which researchers would end up receiving research grants. The information was provided by a university and included thousands of examples of research projects, along with information about the researchers involved and data on whether or not each grant was eventually accepted. The university hoped to be able to use the models developed in this competition to rank which grant applications were most likely to succeed, so it could prioritize its processing.\n", "\n", "Jeremy used a random forest to model the data, and then used feature importance to find out which features were most predictive. He noticed three surprising things:\n", "\n", - "- The model was able to correctly predict who would receive grants over 95% of the time\n", - "- Apparently meaningless identifier columns were the most important predictors\n", - "- The columns day of week and day of year were also highly predictive; for instance, the vast majority of grant applications dated on a Sunday were accepted, and many accepted grant applications were dated on January 1.\n", + "- The model was able to correctly predict who would receive grants over 95% of the time.\n", + "- Apparently meaningless identifier columns were the most important predictors.\n", + "- The day of week and day of year columns were also highly predictive; for instance, the vast majority of grant applications dated on a Sunday were accepted, and many accepted grant applications were dated on January 1.\n", "\n", - "For the identifier columns, a partial dependence plots showed that when the information was missing the grant was almost always rejected. It turned out that in practice, the University only filled out much of this information *after* a grant application was accepted. Often, for applications that were not accepted, it was just left blank. Therefore, this information was not something that was actually available at the time that the application was received, and would therefor not be available for a predictive model — it was data leakage.\n", + "For the identifier columns, one partial dependence plot per column showed that when the information was missing the application was almost always rejected. It turned out that in practice, the university only filled out much of this information *after* a grant application was accepted. Often, for applications that were not accepted, it was just left blank. Therefore, this information was not something that was actually available at the time that the application was received, and it would not be available for a predictive model—it was data leakage.\n", "\n", "In the same way, the final processing of successful applications was often done automatically as a batch at the end of the week, or the end of the year. It was this final processing date which ended up in the data, so again, this information, while predictive, was not actually available at the time that the application was received.\n", "\n", - "This example shows the most practical and simple approaches to identifying data leakage, which are to build a model, and then:\n", + "This example showcases the most practical and simple approaches to identifying data leakage, which are to build a model and then:\n", "\n", - "- Check whether the accuracy of the model is *too good to be true*\n", - "- Look for important predictors which don't make sense in practice\n", - "- Look for partial dependence plot results which don't make sense in practice.\n", + "- Check whether the accuracy of the model is *too good to be true*.\n", + "- Look for important predictors that don't make sense in practice.\n", + "- Look for partial dependence plot results that don't make sense in practice.\n", "\n", - "Thinking back to our bear detector, this mirrors the advice that we also provided there — it is often a good idea to build a model first, and then do your data cleaning, rather than vice versa. The model can help you identify potentially problematic data issues.\n", + "Thinking back to our bear detector, this mirrors the advice that we provided in <>--it is often a good idea to build a model first and then do your data cleaning, rather than vice versa. The model can help you identify potentially problematic data issues.\n", "\n", - "It can also help you interpret which factors influences specific predictions, with tree interpreters." + "It can also help you identifyt which factors influence specific predictions, with tree interpreters." ] }, { @@ -8593,13 +8585,13 @@ "source": [ "At the start of this section, we said that we wanted to be able to answer five questions:\n", "\n", - "- How confident are we in our projections using a particular row of data?\n", + "- How confident are we in our predictions using a particular row of data?\n", "- For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?\n", "- Which columns are the strongest predictors?\n", "- Which columns are effectively redundant with each other, for purposes of prediction?\n", "- How do predictions vary, as we vary these columns?\n", "\n", - "We've handled four of these already--so just one to go, which is: \"For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?\" To answer this question, we need to use the `treeinterpreter` library. We'll also use the `waterfallcharts` library to draw the chart of the results.\n", + "We've handled four of these already; only the second question remains. To answer this question, we need to use the `treeinterpreter` library. We'll also use the `waterfallcharts` library to draw the chart of the results.\n", "\n", " !pip install treeinterpreter\n", " !pip install waterfallcharts" @@ -8609,9 +8601,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We have already seen how to compute feature importances across the entire random forest. The basic idea was to look at the contribution of each variable towards improving the model, at each branch of every tree, and then to add up all of these contributions per variable.\n", + "We have already seen how to compute feature importances across the entire random forest. The basic idea was to look at the contribution of each variable to improving the model, at each branch of every tree, and then add up all of these contributions per variable.\n", "\n", - "We can do exactly the same thing, but for just a single row of data. For instance, let's say we are looking at some particular item at auction. Our model might predict that this item will be very expensive, and we want to know why. So we take that one row of data, and put it through the first decision tree, looking to see what split is used at each point throughout the tree. For each split, we see what the increase or decrease in the addiction is, compared to the parent node of the tree. We do this for every tree, and add up the total change in importance by split variable.\n", + "We can do exactly the same thing, but for just a single row of data. For instance, let's say we are looking at some particular item at auction. Our model might predict that this item will be very expensive, and we want to know why. So, we take that one row of data and put it through the first decision tree, looking to see what split is used at each point throughout the tree. For each split, we see what the increase or decrease in the addition is, compared to the parent node of the tree. We do this for every tree, and add up the total change in importance by split variable.\n", "\n", "For instance, let's pick the first few rows of our validation set:" ] @@ -8629,7 +8621,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can pass these to `treeinterpreter`:" + "We can then pass these to `treeinterpreter`:" ] }, { @@ -8645,7 +8637,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "`prediction` is simply the prediction that the random forest makes. `bias` is the prediction based on simply taking the mean of the dependent variable (i.e. the *model* that is the root of every tree). `contributions` is the most interesting bit--it tells us the total change in predicition due to each of the independent variables. Therefore, the sum of `contributions` plus `bias` must equal the `prediction`, for each row. Let's look just at the first row:" + "`prediction` is simply the prediction that the random forest makes. `bias` is the prediction based on taking the mean of the dependent variable (i.e., the *model* that is the root of every tree). `contributions` is the most interesting bit--it tells us the total change in predicition due to each of the independent variables. Therefore, the sum of `contributions` plus `bias` must equal the `prediction`, for each row. Let's look just at the first row:" ] }, { @@ -8672,7 +8664,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The clearest way to display the contributions is with a *waterfall plot*. This shows how each positive and negative contribution from all the independent variables sum up to create the final prediction, which is the right-hand column labeled \"net\" here:" + "The clearest way to display the contributions is with a *waterfall plot*. This shows how the positive and negative contributions from all the independent variables sum up to create the final prediction, which is the righthand column labeled \"net\" here:" ] }, { @@ -8709,7 +8701,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now that we covered some classic machine learning to solve this problem, let's see how deep learning can help!" + "Now that we covered some classic machine learning techniques to solve this problem, let's see how deep learning can help!" ] }, { @@ -8723,7 +8715,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "A problem with random forests, like all machine learning or deep learning algorithms, is that they don't always generalize well to new data. Random forests can help us identify out-of-domain data, and we will see in which situations neural network generalize better, but first, let's look at the extrapolation problem that random forests have." + "A problem with random forests, like all machine learning or deep learning algorithms, is that they don't always generalize well to new data. We will see in which situations neural networks generalize better, but first, let's look at the extrapolation problem that random forests have." ] }, { @@ -8849,7 +8841,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "...and we test the model on the full dataset. The blue dots are the training data, and the red dots are the predictions." + "Then we'll test the model on the full dataset. The blue dots are the training data, and the red dots are the predictions:" ] }, { @@ -8879,27 +8871,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We have a big problem! Our predictions outside of the domain that our training data covered are all too low. Have a think about why this is…\n", + "We have a big problem! Our predictions outside of the domain that our training data covered are all too low. Why do you suppose this is?\n", "\n", - "Remember, a random forest is just the average of the predictions of a number of trees. And a tree simply predicts the average value of the rows in a leaf. Therefore, a tree and a random forest can never predict values outside of the range of the training data. This is particularly problematic for data where there is a trend over time, such as inflation, and you wish to make predictions for a future time.. Your predictions will be systematically too low.\n", + "Remember, a random forest just averages the predictions of a number of trees. And a tree simply predicts the average value of the rows in a leaf. Therefore, a tree and a random forest can never predict values outside of the range of the training data. This is particularly problematic for data where there is a trend over time, such as inflation, and you wish to make predictions for a future time.. Your predictions will be systematically too low.\n", "\n", - "But the problem is actually more general than just time variables. Random forests are not able to extrapolate outside of the types of data you have seen, in a more general sense. That's why we need to make sure our validation set does not contain out of domain data." + "But the problem iextends beyond time variables. Random forests are not able to extrapolate outside of the types of data they have seen, in a more general sense. That's why we need to make sure our validation set does not contain out-of-domain data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Finding out of Domain Data" + "### Finding Out-of-Domain Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Sometimes it is hard to even know whether your test set is distributed in the same way as your training data or, if it is different, then what columns reflect that difference. There's actually a nice easy way to figure this out, which is to use a random forest!\n", + "Sometimes it is hard to know whether your test set is distributed in the same way as your training data, or, if it is different, what columns reflect that difference. There's actually an easy way to figure this out, which is to use a random forest!\n", "\n", - "But in this case we don't use a random forest to predict our actual dependent variable. Instead we try to predict whether a row is in the validation set, or the training set. To see this in action, let's combine our training and validation sets together, create a dependent variable which represents which dataset each row comes from, build a random forest using that data, and get its feature importance:" + "But in this case we don't use the random forest to predict our actual dependent variable. Instead, we try to predict whether a row is in the validation set or the training set. To see this in action, let's combine our training and validation sets together, create a dependent variable that represents which dataset each row comes from, build a random forest using that data, and get its feature importance:" ] }, { @@ -8994,9 +8986,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This shows that there are three columns that are very different between training and validation set: `saleElapsed`, `SalesID`, and `MachineID`. `saleElapsed` is fairly obvious, since it's the number of days between the start of the dataset and each row, so it directly encodes the date. `SalesID` suggests that identifiers for auction sales might increment over time. `MachineID` suggests something similar might be happening for individual items sold in those auctions.\n", + "This shows that there are three columns that differ significantly between the training and validation sets: `saleElapsed`, `SalesID`, and `MachineID`. It's fairly obvious why this is the case for `saleElapsed`: it's the number of days between the start of the dataset and each row, so it directly encodes the date. The difference in `SalesID` suggests that identifiers for auction sales might increment over time. `MachineID` suggests something similar might be happening for individual items sold in those auctions.\n", "\n", - "We'll try training the original RF model, removing each of these in turn, and also checking the baseline model RMSE:" + "Let's get a baseline of the original random forest model's RMSE, then see what the effect is of removing each of these columns in turn:" ] }, { @@ -9028,7 +9020,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "It looks like we should be able to remove `SalesID` and `MachineID` without losing any accuracy; let's check:" + "It looks like we should be able to remove `SalesID` and `MachineID` without losing any accuracy. Let's check:" ] }, { @@ -9060,7 +9052,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Removing these variables has slightly improved the model's accuracy; but more importantly, it should make it more resilient over time, and easier to maintain and understand. We recommend that for all datasets you try building a model where your dependent variable is `is_valid`, like the above. It can often uncover subtle *domain shift* issues that you may otherwise miss.\n", + "Removing these variables has slightly improved the model's accuracy; but more importantly, it should make it more resilient over time, and easier to maintain and understand. We recommend that for all datasets you try building a model where your dependent variable is `is_valid`, like we did heree. It can often uncover subtle *domain shift* issues that you may otherwise miss.\n", "\n", "One thing that might help in our case is to simply avoid using old data. Often, old data shows relationships that just aren't valid any more. Let's try just using the most recent few years of the data:" ] @@ -9087,6 +9079,13 @@ "xs['saleYear'].hist();" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here's the result of training on this subset:" + ] + }, { "cell_type": "code", "execution_count": null, @@ -9098,13 +9097,6 @@ "y_filt = y[filt]" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here's the result of training on this subset:" - ] - }, { "cell_type": "code", "execution_count": null, @@ -9166,7 +9158,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can leverage the work we did to trim unwanted column in the random forest, by using the same set of columns for our neural network." + "We can leverage the work we did to trim unwanted columns in the random forest by using the same set of columns for our neural network:" ] }, { @@ -9182,7 +9174,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Categorical columns are handled very differently in neural networks, compared to decision tree approaches. As we have seen in <>, a great way to handle categorical variables is by using embeddings. In order to create embeddings, fastai needs to know which columns should be treated as categorical variables. It does this by comparing the number of distinct levels in the variable (this is known as the *cardinality* of the variable) to the `max_card` parameter. Anything lower than this is going to be treated as a categorical variable by fastai. Embedding sizes larger than 10,000 should generally only be used after you've tested whether there are better ways to group the variable, so we'll use 9000 as our `max_card`." + "Categorical columns are handled very differently in neural networks, compared to decision tree approaches. As we saw in <>, in a neural net a great way to handle categorical variables is by using embeddings. To create embeddings, fastai needs to determine which columns should be treated as categorical variables. It does this by comparing the number of distinct levels in the variable to the value of the `max_card` parameter. If it's lower, fastai will treat the variable as categorical. Embedding sizes larger than 10,000 should generally only be used after you've tested whether there are better ways to group the variable, so we'll use 9,000 as our `max_card`:" ] }, { @@ -9198,7 +9190,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "However, one variable that we absolutely do not want to treat as categorical is the saleElapsed variable. A categorical variable cannot, by definition, extrapolate outside the range of values that it has seen. But we want to be able to predict auction sale prices in the future. Therefore, we need to make this a continuous variable:" + "In this case, however, there's one variable that we absolutely do not want to treat as categorical: the `saleElapsed` variable. A categorical variable cannot, by definition, extrapolate outside the range of values that it has seen, but we want to be able to predict auction sale prices in the future. Therefore, we need to make this a continuous variable:" ] }, { @@ -9215,7 +9207,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's take a look at the cardinality of each of our categorical variables that we have chosen so far:" + "Let's take a look at the cardinality of each of the categorical variables that we have chosen so far:" ] }, { @@ -9256,7 +9248,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The fact that there are two variables pertaining to the \"model\" of the equipment, both with similar very high cardinalities, suggests that they may contain similar, redundant information. Note that we would not necessarily see this in the dendrogram, since that relies on similar variables being sorted in the same order (that is, they need to have similarly named levels). Having a column with 5000 levels means needing a number 5000 columns in our embedding matrix, so this would be nice to avoid if possible. Let's see what the impact of removing one of these model columns has on the random forest:" + "The fact that there are two variables pertaining to the \"model\" of the equipment, both with similar very high cardinalities, suggests that they may contain similar, redundant information. Note that we would not necessarily see this when analyzing redundant features, since that relies on similar variables being sorted in the same order (that is, they need to have similarly named levels). Having a column with 5,000 levels means needing 5,000 columns in our embedding matrix, which would be nice to avoid if possible. Let's see what the impact of removing one of these model columns has on the random forest:" ] }, { @@ -9286,7 +9278,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "There's minimal impact, so we will remove it as a predictor for our neural network." + "There's minimal impact, so we will remove it as a predictor for our neural network:" ] }, { @@ -9302,7 +9294,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can create our `TabularPandas` object in the same way as when we created our random forest, with one very important addition: normalisation. A random forest does not need any normalisation--the tree building procedure cares only about the order of values in a variable, not at all about how they are scaled. But as we have seen, a neural network definitely does care about this. Therefore, we add the `Normalize` processor when we build our `TabularPandas` object." + "We can create our `TabularPandas` object in the same way as when we created our random forest, with one very important addition: normalization. A random forest does not need any normalization--the tree building procedure cares only about the order of values in a variable, not at all about how they are scaled. But as we have seen, a neural network definitely does care about this. Therefore, we add the `Normalize` processor when we build our `TabularPandas` object:" ] }, { @@ -9320,7 +9312,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Tabular models and data don't generally require much GPU RAM, so we can use larger batch sizes." + "Tabular models and data don't generally require much GPU RAM, so we can use larger batch sizes:" ] }, { @@ -9366,7 +9358,7 @@ "source": [ "We can now create the `Learner` to create this tabular model. As usual, we use the application-specific learner function, to take advantage of its application-customized defaults. We set the loss function to MSE, since that's what this competition uses.\n", "\n", - "By default, for tabular data fastai creates a neural network with two hidden layers, with 200 and 100 activations each, respectively. This works quite well for small datasets, but here we've got quite a large dataset, so we increase the layer sizes to 500 and 250." + "By default, for tabular data fastai creates a neural network with two hidden layers, with 200 and 100 activations, respectively. This works quite well for small datasets, but here we've got quite a large dataset, so we increase the layer sizes to 500 and 250:" ] }, { @@ -9434,7 +9426,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "There's no need to use `fine_tune`, so we'll train with 1-cycle for a few epochs and see how it looks..." + "There's no need to use `fine_tune`, so we'll train with `fit_one_cycle` for a few epochs and see how it looks:" ] }, { @@ -9504,7 +9496,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can use our `r_mse` function to compare to the random forest result we got earlier." + "We can use our `r_mse` function to compare the result to the random forest result we got earlier:" ] }, { @@ -9542,9 +9534,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "It's quite a bit better than the random forest (although it took longer to train, and it's more fussy about hyperparameter tuning).\n", + "It's quite a bit better than the random forest (although it took longer to train, and it's fussier about hyperparameter tuning).\n", "\n", - "Before we move on, let's save our model in case we want to come back to it again later." + "Before we move on, let's save our model in case we want to come back to it again later:" ] }, { @@ -9567,11 +9559,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In fastai, a tabular model is simply a model which takes columns of continuous or categorical data, and predicts a category (a classification model) or a continuous value (a regression model). Categorical independent variables are passed through an embedding, and concatenated, as we saw in the neural net we used for collaborative filtering, and then continuous variables are concatenated as well.\n", + "In fastai, a tabular model is simply a model that takes columns of continuous or categorical data, and predicts a category (a classification model) or a continuous value (a regression model). Categorical independent variables are passed through an embedding, and concatenated, as we saw in the neural net we used for collaborative filtering, and then continuous variables are concatenated as well.\n", "\n", - "The model created in `tabular_learner` is an object of class `TabularModel`. Take a look at the source for `tabular_learner` now (remember, that's `tabular_learner??` in Jupyter). You'll see that like `collab_learner`, it first calls `get_emb_sz` to calculate appropriate embedding sizes (which you can override by using the `emb_szs` parameter, which is a dictionary containing any column names you want to set sizes for manually), and it sets a few other defaults. Other than that, it just creates the `TabularModel`, and passes that to `TabularLearner` (and note that `TabularLearner` is identical to `Learner`, except for a customized `predict` method).\n", + "The model created in `tabular_learner` is an object of class `TabularModel`. Take a look at the source for `tabular_learner` now (remember, that's `tabular_learner??` in Jupyter). You'll see that like `collab_learner`, it first calls `get_emb_sz` to calculate appropriate embedding sizes (you can override these by using the `emb_szs` parameter, which is a dictionary containing any column names you want to set sizes for manually), and it sets a few other defaults. Other than that, it just creates the `TabularModel`, and passes that to `TabularLearner` (note that `TabularLearner` is identical to `Learner`, except for a customized `predict` method).\n", "\n", - "That means that really all the work is happening in `TabularModel`, so take a look at the source for that now. With the exception of the `BatchNorm1d` and `Dropout` layers (which we'll be learning about shortly) you now have the knowledge required to understand this whole class. Take a look at the discussion of `EmbeddingNN` at the end of the last chapter. Recall that it passed `n_cont=0` to `TabularModel`. We now can see why that was: because there are zero continuous variables (in fastai the `n_` prefix means \"number of\", and `cont` is an abbreviation for \"continuous\")." + "That means that really all the work is happening in `TabularModel`, so take a look at the source for that now. With the exception of the `BatchNorm1d` and `Dropout` layers (which we'll be learning about shortly), you now have the knowledge required to understand this whole class. Take a look at the discussion of `EmbeddingNN` at the end of the last chapter. Recall that it passed `n_cont=0` to `TabularModel`. We now can see why that was: because there are zero continuous variables (in fastai the `n_` prefix means \"number of,\" and `cont` is an abbreviation for \"continuous\")." ] }, { @@ -9585,14 +9577,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Another thing that can help with generalization is to use several models and average their predictions, a technique known as ensembling." + "Another thing that can help with generalization is to use several models and average their predictions--a technique, as mentioned earlier, known as *ensembling*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Ensembling" + "## Ensembling" ] }, { @@ -9603,9 +9595,9 @@ "\n", "In our case, we have two very different models, trained using very different algorithms: a random forest, and a neural network. It would be reasonable to expect that the kinds of errors that each one makes would be quite different. Therefore, we might expect that the average of their predictions would be better than either one's individual predictions.\n", "\n", - "As we mentioned earlier in this chapter, the approach of combining multiple models' predictions together is called *ensembling*. A random forest is itself an ensemble. But we can then include a random forest in *another* ensemble--an ensemble of the random forest and the neural network! Whilst it is not going to make the difference between a successful and unsuccessful modelling process, it can certainly add a nice little boost to any models that you have built.\n", + "As we saw earlier, a random forest is itself an ensemble. But we can then include a random forest in *another* ensemble--an ensemble of the random forest and the neural network! While ensembling won't make the difference between a successful and an unsuccessful modeling process, it can certainly add a nice little boost to any models that you have built.\n", "\n", - "One minor issue we have to be aware of is that our PyTorch model and our sklearn model create data of different types--PyTorch gives us a rank 2 tensor (i.e a column matrix), whereas numpy gives us a rank 1 array (a vector). `squeeze()` removes any unit axes from a tensor, and `to_np` converts it into a numpy array." + "One minor issue we have to be aware of is that our PyTorch model and our sklearn model create data of different types: PyTorch gives us a rank-2 tensor (i.e, a column matrix), whereas NumPy gives us a rank-1 array (a vector). `squeeze` removes any unit axes from a tensor, and `to_np` converts it into a NumPy array:" ] }, { @@ -9649,14 +9641,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In fact, this result is better than any score shown on the Kaggle leaderboard. This is not directly comparable, however, because the Kaggle leaderboard uses a separate dataset that we do not have access to. Kaggle does not allow us to submit to this old competition, to find out how we would have gone, so we have no way to directly compare. But our results certainly look very encouraging!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "There is another important approach to ensembling, called *boosting*, where we add models, instead of averaging them. " + "In fact, this result is better than any score shown on the Kaggle leaderboard. It's not directly comparable, however, because the Kaggle leaderboard uses a separate dataset that we do not have access to. Kaggle does not allow us to submit to this old competition to find out how we would done, but our results certainly look very encouraging!" ] }, { @@ -9670,30 +9655,30 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "So far our approach to ensembling has been to use *bagging*, which involves combining many models together by averaging them, where each model is trained on a different data subset. When this is applied to decision trees, this is called a *random forest*.\n", + "So far our approach to ensembling has been to use *bagging*, which involves combining many models (each trained on a different data subset) together by averaging them. As we saw, when this is applied to decision trees, this is called a *random forest*.\n", "\n", - "Here is how boosting works:\n", + "There is another important approach to ensembling, called *boosting*, where we add models instead of averaging them. Here is how boosting works:\n", "\n", - "- Train a small model which under fits your dataset\n", - "- Calculate the predictions in the training set for this model\n", - "- Subtract the predictions from the targets; these are called the \"residuals\", and represent the error for each point in the training set\n", - "- Go back to step one, but instead of using the original targets, use the residuals as the target for the training\n", + "- Train a small model that underfits your dataset.\n", + "- Calculate the predictions in the training set for this model.\n", + "- Subtract the predictions from the targets; these are called the \"residuals\" and represent the error for each point in the training set.\n", + "- Go back to step 1, but instead of using the original targets, use the residuals as the targets for the training.\n", "- Continue doing this until you reach some stopping criterion, such as a maximum number of trees, or you observe your validation set error getting worse.\n", "\n", "Using this approach, each new tree will be attempting to fit the error of all of the previous trees combined. Because we are continually creating new residuals, by subtracting the predictions of each new tree from the residuals from the previous tree, the residuals will get smaller and smaller.\n", "\n", - "To make predictions with an ensemble of boosted trees, we calculate the predictions from each tree, and then add them all together. There are many models following this basic approach, and many names for the same models! *Gradient boosting machines* (GBMs) and *gradient boosted decision trees* (GBDTs) are the terms you're most likely to come across, or you may see the names of specific libraries implementing these; at the time of writing, *XGBoost* is the most popular.\n", + "To make predictions with an ensemble of boosted trees, we calculate the predictions from each tree, and then add them all together. There are many models following this basic approach, and many names for the same models. *Gradient boosting machines* (GBMs) and *gradient boosted decision trees* (GBDTs) are the terms you're most likely to come across, or you may see the names of specific libraries implementing these; at the time of writing, *XGBoost* is the most popular.\n", "\n", - "Note that, unlike random forests, there is nothing to stop us from overfitting. Using more trees in a random forest does not lead to overfitting, because each tree is independent of the others. But in a boosted ensemble, the more trees you have, the better the training error becomes, and eventually you will see overfitting on the validation set.\n", + "Note that, unlike with random forests, with this approach there is nothing to stop us from overfitting. Using more trees in a random forest does not lead to overfitting, because each tree is independent of the others. But in a boosted ensemble, the more trees you have, the better the training error becomes, and eventually you will see overfitting on the validation set.\n", "\n", - "We are not going to go into details as to how to train a gradient boosted tree ensemble here, because the field is moving rapidly, and any guidance we give will almost certainly be outdated by the time you read this! As we write this, sklearn has just added a `HistGradientBoostingRegressor` class, which provides excellent performance. There are many hyperparameters to tweak for this class, and for all gradient boosted tree methods we have seen. Unlike random forests, gradient boosted trees are extremely sensitive to the choices of these hyperparameters. So in practice, most people will use a loop which tries a range of different hyperparameters, to find which works best." + "We are not going to go into detail on how to train a gradient boosted tree ensemble here, because the field is moving rapidly, and any guidance we give will almost certainly be outdated by the time you read this. As we write this, sklearn has just added a `HistGradientBoostingRegressor` class that provides excellent performance. There are many hyperparameters to tweak for this class, and for all gradient boosted tree methods we have seen. Unlike random forests, gradient boosted trees are extremely sensitive to the choices of these hyperparameters; in practice, most people use a loop that tries a range of different hyperparameters to find the ones that work best." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "A last technique that has gotten great results is to use embeddings learned by a neural net in a machine learning model." + "One more technique that has gotten great results is to use embeddings learned by a neural net in a machine learning model." ] }, { @@ -9707,25 +9692,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The abstract of the entity embedding paper we mentioned at the start of this chapter states: \"*the embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead*\". It includes this very interesting table:" + "The abstract of the entity embedding paper we mentioned at the start of this chapter states: \"the embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead\". It includes the very interesting table in <>." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "hide_input": false + }, + "source": [ + "\"Embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\"Embeddings" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This is showing the mean average percent error (MAPE) compared amongst four different modelling techniques, three of which we have already seen, along with \"KNN\" (K nearest neighbours), which is a very simple baseline method. The first numeric column contains the results using just the methods on the data provided in the competition; the second column shows what happens if you first train a neural network with categorical embeddings, and then use those categorical embeddings instead of the raw categorical columns in the model. As you see, in every case, the models are dramatically improved by using the embeddings, instead of the raw category.\n", + "This is showing the mean average percent error (MAPE) compared among four different modeling techniques, three of which we have already seen, along with *k*-nearest neighbors (KNN), which is a very simple baseline method. The first numeric column contains the results of using the methods on the data provided in the competition; the second column shows what happens if you first train a neural network with categorical embeddings, and then use those categorical embeddings instead of the raw categorical columns in the model. As you see, in every case, the models are dramatically improved by using the embeddings instead of the raw categories.\n", "\n", - "This is a really important result, because it shows that you can get much of the performance improvement of a neural network, without actually having to use a neural network at all at inference time. You could just use an embedding, which is literally just an array lookup, along with a small decision tree ensemble.\n", + "This is a really important result, because it shows that you can get much of the performance improvement of a neural network without actually having to use a neural network at inference time. You could just use an embedding, which is literally just an array lookup, along with a small decision tree ensemble.\n", "\n", - "These embeddings need not even be necessarily learned separately for each model or task in an organisation. Instead, once a set of embeddings are learned for some column for some task, they could be stored in a central place, and reused across multiple models. In fact, we know from private communication with other practitioners at large companies that this is already happening in many places." + "These embeddings need not even be necessarily learned separately for each model or task in an organization. Instead, once a set of embeddings are learned for some column for some task, they could be stored in a central place, and reused across multiple models. In fact, we know from private communication with other practitioners at large companies that this is already happening in many places." ] }, { @@ -9739,13 +9726,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We have dicussed two approaches to tabular modelling: decision tree ensembles, and neural networks. And we have mentioned two different decision tree ensembles: random forests, and gradient boosting machines. Each is very effective, but each also has compromises:\n", + "We have dicussed two approaches to tabular modeling: decision tree ensembles and neural networks. We've also mentioned two different decision tree ensembles: random forests, and gradient boosting machines. Each is very effective, but each also has compromises:\n", "\n", - "**Random forests** are the easiest to train, because they are extremely resilient to hyperparameter choices, and require very little preprocessing. They are very fast to train, and should not overfit, if you have enough trees. But, they can be a little less accurate, especially if extrapolation is required, such as predicting future time periods\n", + "- *Random forests* are the easiest to train, because they are extremely resilient to hyperparameter choices and require very little preprocessing. They are very fast to train, and should not overfit if you have enough trees. But they can be a little less accurate, especially if extrapolation is required, such as predicting future time periods.\n", "\n", - "**Gradient boosting machines** in theory are just as fast to train as random forests, but in practice you will have to try lots of different hyperparameters. They can overfit. But they are often a little bit more accurate than random forests.\n", + "- *Gradient boosting machines* in theory are just as fast to train as random forests, but in practice you will have to try lots of different hyperparameters. They can overfit, but they are often a little more accurate than random forests.\n", "\n", - "**Neural networks** take the longest time to train, and require extra preprocessing such as normalisation; this normalisation needs to be used at inference time as well. They can provide great results, and extrapolate well, but only if you are careful with your hyperparameters, and are careful to avoid overfitting.\n", + "- *Neural networks* take the longest time to train, and require extra preprocessing, such as normalization; this normalization needs to be used at inference time as well. They can provide great results and extrapolate well, but only if you are careful with your hyperparameters and take care to avoid overfitting.\n", "\n", "We suggest starting your analysis with a random forest. This will give you a strong baseline, and you can be confident that it's a reasonable starting point. You can then use that model for feature selection and partial dependence analysis, to get a better understanding of your data.\n", "\n", @@ -9765,12 +9752,12 @@ "source": [ "1. What is a continuous variable?\n", "1. What is a categorical variable?\n", - "1. Provide 2 of the words that are used for the possible values of a categorical variable.\n", + "1. Provide two of the words that are used for the possible values of a categorical variable.\n", "1. What is a \"dense layer\"?\n", "1. How do entity embeddings reduce memory usage and speed up neural networks?\n", - "1. What kind of datasets are entity embeddings especially useful for?\n", + "1. What kinds of datasets are entity embeddings especially useful for?\n", "1. What are the two main families of machine learning algorithms?\n", - "1. Why do some categorical columns need a special ordering in their classes? How do you do this in pandas?\n", + "1. Why do some categorical columns need a special ordering in their classes? How do you do this in Pandas?\n", "1. Summarize what a decision tree algorithm does.\n", "1. Why is a date different from a regular categorical or continuous variable, and how can you preprocess it to allow it to be used in a model?\n", "1. Should you pick a random validation set in the bulldozer competition? If no, what kind of validation set should you pick?\n", @@ -9781,19 +9768,20 @@ "1. What is bagging?\n", "1. What is the difference between `max_samples` and `max_features` when creating a random forest?\n", "1. If you increase `n_estimators` to a very high value, can that lead to overfitting? Why or why not?\n", - "1. What is *out of bag error*?\n", + "1. In the section \"Creating a Random Forest\", just after <>, why did `preds.mean(0)` give the same result as our random forest?\n", + "1. What is \"out-of-bag-error\"?\n", "1. Make a list of reasons why a model's validation set error might be worse than the OOB error. How could you test your hypotheses?\n", - "1. How can you answer each of these things with a random forest? How do they work?:\n", - " - How confident are we in our projections using a particular row of data?\n", + "1. Explain why random forests are well suited to answering each of the following question:\n", + " - How confident are we in our predictions using a particular row of data?\n", " - For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?\n", " - Which columns are the strongest predictors?\n", - " - How do predictions vary, as we vary these columns?\n", + " - How do predictions vary as we vary these columns?\n", "1. What's the purpose of removing unimportant variables?\n", "1. What's a good type of plot for showing tree interpreter results?\n", - "1. What is the *extrapolation problem*?\n", - "1. How can you tell if your test or validation set is distributed in a different way to your training set?\n", - "1. Why do we make `saleElapsed` a continuous variable, even although it has less than 9000 distinct values?\n", - "1. What is boosting?\n", + "1. What is the \"extrapolation problem\"?\n", + "1. How can you tell if your test or validation set is distributed in a different way than your training set?\n", + "1. Why do we make `saleElapsed` a continuous variable, even although it has less than 9,000 distinct values?\n", + "1. What is \"boosting\"?\n", "1. How could we use embeddings with a random forest? Would we expect this to help?\n", "1. Why might we not always use a neural net for tabular modeling?" ] @@ -9809,8 +9797,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "1. Pick a competition on Kaggle with tabular data (current or past) and try to adapt the techniques seen in this chapter to get the best possible results. Compare yourself to the private leaderboard.\n", - "1. Implement the decision tree algorithm in this chapter from scratch yourself, and try it on this dataset.\n", + "1. Pick a competition on Kaggle with tabular data (current or past) and try to adapt the techniques seen in this chapter to get the best possible results. Compare your results to the private leaderboard.\n", + "1. Implement the decision tree algorithm in this chapter from scratch yourself, and try it on the datase you used in the first exercise.\n", "1. Use the embeddings from the neural net in this chapter in a random forest, and see if you can improve on the random forest results we saw.\n", "1. Explain what each line of the source of `TabularModel` does (with the exception of the `BatchNorm1d` and `Dropout` layers)." ] diff --git a/clean/08_collab.ipynb b/clean/08_collab.ipynb index 9a54e88..76ef6e3 100644 --- a/clean/08_collab.ipynb +++ b/clean/08_collab.ipynb @@ -523,26 +523,6 @@ "one_hot_3 = one_hot(3, n_users).float()" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "torch.Size([944, 5])" - ] - }, - "execution_count": null, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "user_factors.shape" - ] - }, { "cell_type": "code", "execution_count": null, @@ -1670,7 +1650,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Sidebar: Kwargs and Delegates" + "### Sidebar: kwargs and Delegates" ] }, { @@ -1702,33 +1682,33 @@ "1. How does it solve it?\n", "1. Why might a collaborative filtering predictive model fail to be a very useful recommendation system?\n", "1. What does a crosstab representation of collaborative filtering data look like?\n", - "1. Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!)\n", + "1. Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!).\n", "1. What is a latent factor? Why is it \"latent\"?\n", - "1. What is a dot product? Calculate a dot product manually using pure python with lists.\n", + "1. What is a dot product? Calculate a dot product manually using pure Python with lists.\n", "1. What does `pandas.DataFrame.merge` do?\n", "1. What is an embedding matrix?\n", - "1. What is the relationship between an embedding and a matrix of one-hot encoded vectors?\n", - "1. Why do we need `Embedding` if we could use one-hot encoded vectors for the same thing?\n", - "1. What does an embedding contain before we start training (assuming we're not using a prertained model)?\n", + "1. What is the relationship between an embedding and a matrix of one-hot-encoded vectors?\n", + "1. Why do we need `Embedding` if we could use one-hot-encoded vectors for the same thing?\n", + "1. What does an embedding contain before we start training (assuming we're not using a pretained model)?\n", "1. Create a class (without peeking, if possible!) and use it.\n", "1. What does `x[:,0]` return?\n", - "1. Rewrite the `DotProduct` class (without peeking, if possible!) and train a model with it\n", + "1. Rewrite the `DotProduct` class (without peeking, if possible!) and train a model with it.\n", "1. What is a good loss function to use for MovieLens? Why? \n", - "1. What would happen if we used `CrossEntropy` loss with MovieLens? How would we need to change the model?\n", + "1. What would happen if we used cross-entropy loss with MovieLens? How would we need to change the model?\n", "1. What is the use of bias in a dot product model?\n", "1. What is another name for weight decay?\n", - "1. Write the equation for weight decay (without peeking!)\n", + "1. Write the equation for weight decay (without peeking!).\n", "1. Write the equation for the gradient of weight decay. Why does it help reduce weights?\n", "1. Why does reducing weights lead to better generalization?\n", "1. What does `argsort` do in PyTorch?\n", - "1. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why / why not?\n", + "1. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why/why not?\n", "1. How do you print the names and details of the layers in a model?\n", "1. What is the \"bootstrapping problem\" in collaborative filtering?\n", "1. How could you deal with the bootstrapping problem for new users? For new movies?\n", "1. How can feedback loops impact collaborative filtering systems?\n", - "1. When using a neural network in collaborative filtering, why can we have different number of factors for movie and user?\n", - "1. Why is there a `nn.Sequential` in the `CollabNN` model?\n", - "1. What kind of model should be use if we want to add metadata about users and items, or information such as date and time, to a collaborative filter model?" + "1. When using a neural network in collaborative filtering, why can we have different numbers of factors for movies and users?\n", + "1. Why is there an `nn.Sequential` in the `CollabNN` model?\n", + "1. What kind of model should we use if we want to add metadata about users and items, or information such as date and time, to a collaborative filtering model?" ] }, { @@ -1737,10 +1717,10 @@ "source": [ "### Further Research\n", "\n", - "1. Take a look at all the differences between the `Embedding` version of `DotProductBias` and the `create_params` version, and try to understand why each of those changes is required. If you're not sure, try reverting each change, to see what happens. (NB: even the type of brackets used in `forward` has changed!)\n", - "1. Find three other areas where collaborative filtering is being used, and find out what pros and cons of this approach in those areas.\n", - "1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book website and forum for ideas. Note that there are more columns in the full dataset--see if you can use those too (the next chapter might give you ideas)\n", - "1. Create a model for MovieLens with works with CrossEntropy loss, and compare it to the model in this chapter." + "1. Take a look at all the differences between the `Embedding` version of `DotProductBias` and the `create_params` version, and try to understand why each of those changes is required. If you're not sure, try reverting each change to see what happens. (NB: even the type of brackets used in `forward` has changed!)\n", + "1. Find three other areas where collaborative filtering is being used, and find out what the pros and cons of this approach are in those areas.\n", + "1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book's website and the fast.ai forum for ideas. Note that there are more columns in the full dataset--see if you can use those too (the next chapter might give you ideas).\n", + "1. Create a model for MovieLens that works with cross-entropy loss, and compare it to the model in this chapter." ] }, { diff --git a/clean/09_tabular.ipynb b/clean/09_tabular.ipynb index 2a76d5d..d016a27 100644 --- a/clean/09_tabular.ipynb +++ b/clean/09_tabular.ipynb @@ -954,6 +954,7 @@ "metadata": {}, "outputs": [], "source": [ + "#hide\n", "to = (path/'to.pkl').load()" ] }, @@ -7779,7 +7780,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Finding out of Domain Data" + "### Finding Out-of-Domain Data" ] }, { @@ -8311,7 +8312,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Ensembling" + "## Ensembling" ] }, { @@ -8378,12 +8379,12 @@ "source": [ "1. What is a continuous variable?\n", "1. What is a categorical variable?\n", - "1. Provide 2 of the words that are used for the possible values of a categorical variable.\n", + "1. Provide two of the words that are used for the possible values of a categorical variable.\n", "1. What is a \"dense layer\"?\n", "1. How do entity embeddings reduce memory usage and speed up neural networks?\n", - "1. What kind of datasets are entity embeddings especially useful for?\n", + "1. What kinds of datasets are entity embeddings especially useful for?\n", "1. What are the two main families of machine learning algorithms?\n", - "1. Why do some categorical columns need a special ordering in their classes? How do you do this in pandas?\n", + "1. Why do some categorical columns need a special ordering in their classes? How do you do this in Pandas?\n", "1. Summarize what a decision tree algorithm does.\n", "1. Why is a date different from a regular categorical or continuous variable, and how can you preprocess it to allow it to be used in a model?\n", "1. Should you pick a random validation set in the bulldozer competition? If no, what kind of validation set should you pick?\n", @@ -8394,19 +8395,20 @@ "1. What is bagging?\n", "1. What is the difference between `max_samples` and `max_features` when creating a random forest?\n", "1. If you increase `n_estimators` to a very high value, can that lead to overfitting? Why or why not?\n", - "1. What is *out of bag error*?\n", + "1. In the section \"Creating a Random Forest\", just after <>, why did `preds.mean(0)` give the same result as our random forest?\n", + "1. What is \"out-of-bag-error\"?\n", "1. Make a list of reasons why a model's validation set error might be worse than the OOB error. How could you test your hypotheses?\n", - "1. How can you answer each of these things with a random forest? How do they work?:\n", - " - How confident are we in our projections using a particular row of data?\n", + "1. Explain why random forests are well suited to answering each of the following question:\n", + " - How confident are we in our predictions using a particular row of data?\n", " - For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?\n", " - Which columns are the strongest predictors?\n", - " - How do predictions vary, as we vary these columns?\n", + " - How do predictions vary as we vary these columns?\n", "1. What's the purpose of removing unimportant variables?\n", "1. What's a good type of plot for showing tree interpreter results?\n", - "1. What is the *extrapolation problem*?\n", - "1. How can you tell if your test or validation set is distributed in a different way to your training set?\n", - "1. Why do we make `saleElapsed` a continuous variable, even although it has less than 9000 distinct values?\n", - "1. What is boosting?\n", + "1. What is the \"extrapolation problem\"?\n", + "1. How can you tell if your test or validation set is distributed in a different way than your training set?\n", + "1. Why do we make `saleElapsed` a continuous variable, even although it has less than 9,000 distinct values?\n", + "1. What is \"boosting\"?\n", "1. How could we use embeddings with a random forest? Would we expect this to help?\n", "1. Why might we not always use a neural net for tabular modeling?" ] @@ -8422,8 +8424,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "1. Pick a competition on Kaggle with tabular data (current or past) and try to adapt the techniques seen in this chapter to get the best possible results. Compare yourself to the private leaderboard.\n", - "1. Implement the decision tree algorithm in this chapter from scratch yourself, and try it on this dataset.\n", + "1. Pick a competition on Kaggle with tabular data (current or past) and try to adapt the techniques seen in this chapter to get the best possible results. Compare your results to the private leaderboard.\n", + "1. Implement the decision tree algorithm in this chapter from scratch yourself, and try it on the datase you used in the first exercise.\n", "1. Use the embeddings from the neural net in this chapter in a random forest, and see if you can improve on the random forest results we saw.\n", "1. Explain what each line of the source of `TabularModel` does (with the exception of the `BatchNorm1d` and `Dropout` layers)." ]