This commit is contained in:
Sylvain Gugger 2020-05-18 18:18:45 -07:00
parent d8d39c560a
commit 4b1345a068
20 changed files with 276 additions and 287 deletions

View File

@ -247,9 +247,9 @@
"source": [
"The hardest part of deep learning is artisanal: how do you know if you've got enough data, whether it is in the right format, if your model is training properly, and, if it's not, what you should do about it? That is why we believe in learning by doing. As with basic data science skills, with deep learning you only get better through practical experience. Trying to spend too much time on the theory can be counterproductive. The key is to just code and try to solve problems: the theory can come later, when you have context and motivation.\n",
"\n",
"There will be times when the journey will feel hard. Times where you feel stuck. Don't give up! Rewind through the book to find the last bit where you definitely weren't stuck, and then read slowly through from there to find the first thing that isn't clear. Then try some code experiments yourself, and Google around for more tutorials on whatever the issue you're stuck with is--often you'll find some different angle on the material might help it to click. Also, it's expected and normal to not understand everything (especially the code) on first reading. Trying to understand the material serially before proceeding can sometimes be hard. Sometimes things click into place after you get more context from parts down the road, from having a bigger picture. So if you do get stuck on a section, try moving on anyway and make a note to come back to it later.\n",
"There will be times when the journey will feel hard. Times where you feel stuck. Don't give up! Rewind through the book to find the last bit where you definitely weren't stuck, and then read slowly through from there to find the first thing that isn't clear. Then try some code experiments yourself, and Google around for more tutorials on whatever the issue you're stuck with isoften you'll find some different angle on the material might help it to click. Also, it's expected and normal to not understand everything (especially the code) on first reading. Trying to understand the material serially before proceeding can sometimes be hard. Sometimes things click into place after you get more context from parts down the road, from having a bigger picture. So if you do get stuck on a section, try moving on anyway and make a note to come back to it later.\n",
"\n",
"Remember, you don't need any particular academic background to succeed at deep learning. Many important breakthroughs are made in research and industry by folks without a PhD, such as [\"Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks\"](https://arxiv.org/abs/1511.06434)--one of the most influential papers of the last decade--with over 5,000 citations, which was written by Alec Radford when he was an undergraduate. Even at Tesla, where they're trying to solve the extremely tough challenge of making a self-driving car, CEO [Elon Musk says](https://twitter.com/elonmusk/status/1224089444963311616):\n",
"Remember, you don't need any particular academic background to succeed at deep learning. Many important breakthroughs are made in research and industry by folks without a PhD, such as [\"Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks\"](https://arxiv.org/abs/1511.06434)—one of the most influential papers of the last decade—with over 5,000 citations, which was written by Alec Radford when he was an undergraduate. Even at Tesla, where they're trying to solve the extremely tough challenge of making a self-driving car, CEO [Elon Musk says](https://twitter.com/elonmusk/status/1224089444963311616):\n",
"\n",
"> : A PhD is definitely not required. All that matters is a deep understanding of AI & ability to implement NNs in a way that is actually useful (latter point is whats truly hard). Dont care if you even graduated high school."
]
@ -663,7 +663,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"So, how do we know if this model is any good? In the last column of the table you can see the error rate, which is the proportion of images that were incorrectly identified. The error rate serves as our metric--our measure of model quality, chosen to be intuitive and comprehensible. As you can see, the model is nearly perfect, even though the training time was only a few seconds (not including the one-time downloading of the dataset and the pretrained model). In fact, the accuracy you've achieved already is far better than anybody had ever achieved just 10 years ago!\n",
"So, how do we know if this model is any good? In the last column of the table you can see the error rate, which is the proportion of images that were incorrectly identified. The error rate serves as our metricour measure of model quality, chosen to be intuitive and comprehensible. As you can see, the model is nearly perfect, even though the training time was only a few seconds (not including the one-time downloading of the dataset and the pretrained model). In fact, the accuracy you've achieved already is far better than anybody had ever achieved just 10 years ago!\n",
"\n",
"Finally, let's check that this model actually works. Go and get a photo of a dog, or a cat; if you don't have one handy, just search Google Images and download an image that you find there. Now execute the cell with `uploader` defined. It will output a button you can click, so you can select the image you want to classify:"
]
@ -901,7 +901,7 @@
"\n",
"Let us take these concepts one by one, in order to understand how they fit together in practice. First, we need to understand what Samuel means by a *weight assignment*.\n",
"\n",
"Weights are just variables, and a weight assignment is a particular choice of values for those variables. The program's inputs are values that it processes in order to produce its results--for instance, taking image pixels as inputs, and returning the classification \"dog\" as a result. The program's weight assignments are other values that define how the program will operate.\n",
"Weights are just variables, and a weight assignment is a particular choice of values for those variables. The program's inputs are values that it processes in order to produce its resultsfor instance, taking image pixels as inputs, and returning the classification \"dog\" as a result. The program's weight assignments are other values that define how the program will operate.\n",
"\n",
"Since they will affect the program they are in a sense another kind of input, so we will update our basic picture in <<basic_program>> and replace it with <<weight_assignment>> in order to take this into account."
]
@ -1004,7 +1004,7 @@
"\n",
"Finally, he says we need *a mechanism for altering the weight assignment so as to maximize the performance*. For instance, we could look at the difference in weights between the winning model and the losing model, and adjust the weights a little further in the winning direction.\n",
"\n",
"We can now see why he said that such a procedure *could be made entirely automatic and... a machine so programmed would \"learn\" from its experience*. Learning would become entirely automatic when the adjustment of the weights was also automatic--when instead of us improving a model by adjusting its weights manually, we relied on an automated mechanism that produced adjustments based on performance.\n",
"We can now see why he said that such a procedure *could be made entirely automatic and... a machine so programmed would \"learn\" from its experience*. Learning would become entirely automatic when the adjustment of the weights was also automaticwhen instead of us improving a model by adjusting its weights manually, we relied on an automated mechanism that produced adjustments based on performance.\n",
"\n",
"<<training_loop>> shows the full picture of Samuel's idea of training a machine learning model."
]
@ -1123,7 +1123,7 @@
"source": [
"Notice the distinction between the model's *results* (e.g., the moves in a checkers game) and its *performance* (e.g., whether it wins the game, or how quickly it wins). \n",
"\n",
"Also note that once the model is trained--that is, once we've chosen our final, best, favorite weight assignment--then we can think of the weights as being *part of the model*, since we're not varying them any more.\n",
"Also note that once the model is trained—that is, once we've chosen our final, best, favorite weight assignment—then we can think of the weights as being *part of the model*, since we're not varying them any more.\n",
"\n",
"Therefore, actually *using* a model after it's trained looks like <<using_model>>."
]
@ -1229,7 +1229,7 @@
"source": [
"It's not too hard to imagine what the model might look like for a checkers program. There might be a range of checkers strategies encoded, and some kind of search mechanism, and then the weights could vary how strategies are selected, what parts of the board are focused on during a search, and so forth. But it's not at all obvious what the model might look like for an image recognition program, or for understanding text, or for many other interesting problems we might imagine.\n",
"\n",
"What we would like is some kind of function that is so flexible that it could be used to solve any given problem, just by varying its weights. Amazingly enough, this function actually exists! It's the neural network, which we already discussed. That is, if you regard a neural network as a mathematical function, it turns out to be a function which is extremely flexible depending on its weights. A mathematical proof called the *universal approximation theorem* shows that this function can solve any problem to any level of accuracy, in theory. The fact that neural networks are so flexible means that, in practice, they are often a suitable kind of model, and you can focus your effort on the process of training them--that is, of finding good weight assignments.\n",
"What we would like is some kind of function that is so flexible that it could be used to solve any given problem, just by varying its weights. Amazingly enough, this function actually exists! It's the neural network, which we already discussed. That is, if you regard a neural network as a mathematical function, it turns out to be a function which is extremely flexible depending on its weights. A mathematical proof called the *universal approximation theorem* shows that this function can solve any problem to any level of accuracy, in theory. The fact that neural networks are so flexible means that, in practice, they are often a suitable kind of model, and you can focus your effort on the process of training themthat is, of finding good weight assignments.\n",
"\n",
"But what about that process? One could imagine that you might need to find a new \"mechanism\" for automatically updating weight for every problem. This would be laborious. What we'd like here as well is a completely general way to update the weights of a neural network, to make it improve at any given task. Conveniently, this also exists!\n",
"\n",
@ -1271,7 +1271,7 @@
"source": [
"Samuel was working in the 1960s, and since then terminology has changed. Here is the modern deep learning terminology for all the pieces we have discussed:\n",
"\n",
"- The functional form of the *model* is called its *architecture* (but be careful--sometimes people use *model* as a synonym of *architecture*, so this can get confusing).\n",
"- The functional form of the *model* is called its *architecture* (but be carefulsometimes people use *model* as a synonym of *architecture*, so this can get confusing).\n",
"- The *weights* are called *parameters*.\n",
"- The *predictions* are calculated from the *independent variable*, which is the *data* not including the *labels*.\n",
"- The *results* of the model are called *predictions*.\n",
@ -1509,9 +1509,9 @@
" label_func=is_cat, item_tfms=Resize(224))\n",
"```\n",
"\n",
"There are various different classes for different kinds of deep learning datasets and problems--here we're using `ImageDataLoaders`. The first part of the class name will generally be the type of data you have, such as image, or text.\n",
"There are various different classes for different kinds of deep learning datasets and problemshere we're using `ImageDataLoaders`. The first part of the class name will generally be the type of data you have, such as image, or text.\n",
"\n",
"The other important piece of information that we have to tell fastai is how to get the labels from the dataset. Computer vision datasets are normally structured in such a way that the label for an image is part of the filename, or path--most commonly the parent folder name. Fastai comes with a number of standardized labeling methods, and ways to write your own. Here we're telling fastai to use the `is_cat` function we just defined.\n",
"The other important piece of information that we have to tell fastai is how to get the labels from the dataset. Computer vision datasets are normally structured in such a way that the label for an image is part of the filename, or path—most commonly the parent folder name. fastai comes with a number of standardized labeling methods, and ways to write your own. Here we're telling fastai to use the `is_cat` function we just defined.\n",
"\n",
"Finally, we define the `Transform`s that we need. A `Transform` contains code that is applied automatically during training; fastai includes many predefined `Transform`s, and adding new ones is as simple as creating a Python function. There are two kinds: `item_tfms` are applied to each item (in this case, each item is resized to a 224-pixel square), while `batch_tfms` are applied to a *batch* of items at a time using the GPU, so they're particularly fast (we'll see many examples of these throughout this book).\n",
"\n",
@ -1531,7 +1531,7 @@
"source": [
"The Pet dataset contains 7,390 pictures of dogs and cats, consisting of 37 different breeds. Each image is labeled using its filename: for instance the file *great\\_pyrenees\\_173.jpg* is the 173rd example of an image of a Great Pyrenees breed dog in the dataset. The filenames start with an uppercase letter if the image is a cat, and a lowercase letter otherwise. We have to tell fastai how to get labels from the filenames, which we do by calling `from_name_func` (which means that filenames can be extracted using a function applied to the filename), and passing `x[0].isupper()`, which evaluates to `True` if the first letter is uppercase (i.e., it's a cat).\n",
"\n",
"The most important parameter to mention here is `valid_pct=0.2`. This tells fastai to hold out 20% of the data and *not use it for training the model at all*. This 20% of the data is called the *validation set*; the remaining 80% is called the *training set*. The validation set is used to measure the accuracy of the model. By default, the 20% that is held out is selected randomly. The parameter `seed=42` sets the *random seed* to the same value every time we run this code, which means we get the same validation set every time we run it--this way, if we change our model and retrain it, we know that any differences are due to the changes to the model, not due to having a different random validation set.\n",
"The most important parameter to mention here is `valid_pct=0.2`. This tells fastai to hold out 20% of the data and *not use it for training the model at all*. This 20% of the data is called the *validation set*; the remaining 80% is called the *training set*. The validation set is used to measure the accuracy of the model. By default, the 20% that is held out is selected randomly. The parameter `seed=42` sets the *random seed* to the same value every time we run this code, which means we get the same validation set every time we run itthis way, if we change our model and retrain it, we know that any differences are due to the changes to the model, not due to having a different random validation set.\n",
"\n",
"fastai will *always* show you your model's accuracy using *only* the validation set, *never* the training set. This is absolutely critical, because if you train a large enough model for a long enough time, it will eventually memorize the label of every item in your dataset! The result will not actually be a useful model, because what we care about is how well our model works on *previously unseen images*. That is always our goal when creating a model: for it to be useful on data that the model only sees in the future, after it has been trained.\n",
"\n",
@ -1551,7 +1551,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Overfitting is the single most important and challenging issue** when training for all machine learning practitioners, and all algorithms. As you will see, it is very easy to create a model that does a great job at making predictions on the exact data it has been trained on, but it is much harder to make accurate predictions on data the model has never seen before. And of course, this is the data that will actually matter in practice. For instance, if you create a handwritten digit classifier (as we will very soon!) and use it to recognize numbers written on checks, then you are never going to see any of the numbers that the model was trained on--check will have slightly different variations of writing to deal with. You will learn many methods to avoid overfitting in this book. However, you should only use those methods after you have confirmed that overfitting is actually occurring (i.e., you have actually observed the validation accuracy getting worse during training). We often see practitioners using over-fitting avoidance techniques even when they have enough data that they didn't need to do so, ending up with a model that may be less accurate than what they could have achieved."
"**Overfitting is the single most important and challenging issue** when training for all machine learning practitioners, and all algorithms. As you will see, it is very easy to create a model that does a great job at making predictions on the exact data it has been trained on, but it is much harder to make accurate predictions on data the model has never seen before. And of course, this is the data that will actually matter in practice. For instance, if you create a handwritten digit classifier (as we will very soon!) and use it to recognize numbers written on checks, then you are never going to see any of the numbers that the model was trained oncheck will have slightly different variations of writing to deal with. You will learn many methods to avoid overfitting in this book. However, you should only use those methods after you have confirmed that overfitting is actually occurring (i.e., you have actually observed the validation accuracy getting worse during training). We often see practitioners using over-fitting avoidance techniques even when they have enough data that they didn't need to do so, ending up with a model that may be less accurate than what they could have achieved."
]
},
{
@ -1614,7 +1614,7 @@
"\n",
"This is the key to deep learning—determining how to fit the parameters of a model to get it to solve your problem. In order to fit a model, we have to provide at least one piece of information: how many times to look at each image (known as number of *epochs*). The number of epochs you select will largely depend on how much time you have available, and how long you find it takes in practice to fit your model. If you select a number that is too small, you can always train for more epochs later.\n",
"\n",
"But why is the method called `fine_tune`, and not `fit`? fastai actually *does* have a method called `fit`, which does indeed fit a model (i.e. look at images in the training set multiple times, each time updating the parameters to make the predictions closer and closer to the target labels). But in this case, we've started with a pretrained model, and we don't want to throw away all those capabilities that it already has. As you'll learn in this book, there are some important tricks to adapt a pretrained model for a new dataset--a process called *fine-tuning*."
"But why is the method called `fine_tune`, and not `fit`? fastai actually *does* have a method called `fit`, which does indeed fit a model (i.e. look at images in the training set multiple times, each time updating the parameters to make the predictions closer and closer to the target labels). But in this case, we've started with a pretrained model, and we don't want to throw away all those capabilities that it already has. As you'll learn in this book, there are some important tricks to adapt a pretrained model for a new dataseta process called *fine-tuning*."
]
},
{
@ -1714,7 +1714,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This article was studying an older model called *AlexNet* that only contained five layers. Networks developed since then can have hundreds of layers--so you can imagine how rich the features developed by these models can be! \n",
"This article was studying an older model called *AlexNet* that only contained five layers. Networks developed since then can have hundreds of layersso you can imagine how rich the features developed by these models can be! \n",
"\n",
"When we fine-tuned our pretrained model earlier, we adapted what those last layers focus on (flowers, humans, animals) to specialize on the cats versus dogs problem. More generally, we could specialize such a pretrained model on many different tasks. Let's have a look at some examples. "
]
@ -1855,9 +1855,9 @@
"\n",
"Every model starts with a choice of *architecture*, a general template for how that kind of model works internally. The process of *training* (or *fitting*) the model is the process of finding a set of *parameter values* (or *weights*) that specialize that general architecture into a model that works well for our particular kind of data. In order to define how well a model does on a single prediction, we need to define a *loss function*, which determines how we score a prediction as good or bad.\n",
"\n",
"To make the training process go faster, we might start with a *pretrained model*--a model that has already been trained on someone else's data. We can then adapt it to our data by training it a bit more on our data, a process called *fine-tuning*.\n",
"To make the training process go faster, we might start with a *pretrained model*a model that has already been trained on someone else's data. We can then adapt it to our data by training it a bit more on our data, a process called *fine-tuning*.\n",
"\n",
"When we train a model, a key concern is to ensure that our model *generalizes*--that is, that it learns general lessons from our data which also apply to new items it will encounter, so that it can make good predictions on those items. The risk is that if we train our model badly, instead of learning general lessons it effectively memorizes what it has already seen, and then it will make poor predictions about new images. Such a failure is called *overfitting*. In order to avoid this, we always divide our data into two parts, the *training set* and the *validation set*. We train the model by showing it only the training set and then we evaluate how well the model is doing by seeing how well it performs on items from the validation set. In this way, we check if the lessons the model learns from the training set are lessons that generalize to the validation set. In order for a person to assess how well the model is doing on the validation set overall, we define a *metric*. During the training process, when the model has seen every item in the training set, we call that an *epoch*.\n",
"When we train a model, a key concern is to ensure that our model *generalizes*that is, that it learns general lessons from our data which also apply to new items it will encounter, so that it can make good predictions on those items. The risk is that if we train our model badly, instead of learning general lessons it effectively memorizes what it has already seen, and then it will make poor predictions about new images. Such a failure is called *overfitting*. In order to avoid this, we always divide our data into two parts, the *training set* and the *validation set*. We train the model by showing it only the training set and then we evaluate how well the model is doing by seeing how well it performs on items from the validation set. In this way, we check if the lessons the model learns from the training set are lessons that generalize to the validation set. In order for a person to assess how well the model is doing on the validation set overall, we define a *metric*. During the training process, when the model has seen every item in the training set, we call that an *epoch*.\n",
"\n",
"All these concepts apply to machine learning in general. That is, they apply to all sorts of schemes for defining a model by training it with data. What makes deep learning distinctive is a particular class of architectures: the architectures based on *neural networks*. In particular, tasks like image classification rely heavily on *convolutional neural networks*, which we will discuss shortly."
]
@ -2226,7 +2226,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In a Jupyter notebook, the order in which you execute each cell is very important. It's not like Excel, where everything gets updated as soon as you type something anywhere--it has an inner state that gets updated each time you execute a cell. For instance, when you run the first cell of the notebook (with the \"CLICK ME\" comment), you create an object called `learn` that contains a model and data for an image classification problem. If we were to run the cell just shown in the text (the one that predicts if a review is good or not) straight after, we would get an error as this `learn` object does not contain a text classification model. This cell needs to be run after the one containing:\n",
"In a Jupyter notebook, the order in which you execute each cell is very important. It's not like Excel, where everything gets updated as soon as you type something anywhereit has an inner state that gets updated each time you execute a cell. For instance, when you run the first cell of the notebook (with the \"CLICK ME\" comment), you create an object called `learn` that contains a model and data for an image classification problem. If we were to run the cell just shown in the text (the one that predicts if a review is good or not) straight after, we would get an error as this `learn` object does not contain a text classification model. This cell needs to be run after the one containing:\n",
"\n",
"```python\n",
"from fastai2.text.all import *\n",
@ -2700,11 +2700,11 @@
"source": [
"The problem is that even though the ordinary training process is only looking at predictions on the training data when it learns values for the weight parameters, the same is not true of us. We, as modelers, are evaluating the model by looking at predictions on the validation data when we decide to explore new hyperparameter values! So subsequent versions of the model are, indirectly, shaped by us having seen the validation data. Just as the automatic training process is in danger of overfitting the training data, we are in danger of overfitting the validation data through human trial and error and exploration.\n",
"\n",
"The solution to this conundrum is to introduce another level of even more highly reserved data, the *test set*. Just as we hold back the validation data from the training process, we must hold back the test set data even from ourselves. It cannot be used to improve the model; it can only be used to evaluate the model at the very end of our efforts. In effect, we define a hierarchy of cuts of our data, based on how fully we want to hide it from training and modeling processes: training data is fully exposed, the validation data is less exposed, and test data is totally hidden. This hierarchy parallels the different kinds of modeling and evaluation processes themselves--the automatic training process with back propagation, the more manual process of trying different hyper-parameters between training sessions, and the assessment of our final result.\n",
"The solution to this conundrum is to introduce another level of even more highly reserved data, the *test set*. Just as we hold back the validation data from the training process, we must hold back the test set data even from ourselves. It cannot be used to improve the model; it can only be used to evaluate the model at the very end of our efforts. In effect, we define a hierarchy of cuts of our data, based on how fully we want to hide it from training and modeling processes: training data is fully exposed, the validation data is less exposed, and test data is totally hidden. This hierarchy parallels the different kinds of modeling and evaluation processes themselvesthe automatic training process with back propagation, the more manual process of trying different hyper-parameters between training sessions, and the assessment of our final result.\n",
"\n",
"The test and validation sets should have enough data to ensure that you get a good estimate of your accuracy. If you're creating a cat detector, for instance, you generally want at least 30 cats in your validation set. That means that if you have a dataset with thousands of items, using the default 20% validation set size may be more than you need. On the other hand, if you have lots of data, using some of it for validation probably doesn't have any downsides.\n",
"\n",
"Having two levels of \"reserved data\"--a validation set and a test set, with one level representing data that you are virtually hiding from yourself--may seem a bit extreme. But the reason it is often necessary is because models tend to gravitate toward the simplest way to do good predictions (memorization), and we as fallible humans tend to gravitate toward fooling ourselves about how well our models are performing. The discipline of the test set helps us keep ourselves intellectually honest. That doesn't mean we *always* need a separate test set--if you have very little data, you may need to just have a validation set--but generally it's best to use one if at all possible.\n",
"Having two levels of \"reserved data\"a validation set and a test set, with one level representing data that you are virtually hiding from yourselfmay seem a bit extreme. But the reason it is often necessary is because models tend to gravitate toward the simplest way to do good predictions (memorization), and we as fallible humans tend to gravitate toward fooling ourselves about how well our models are performing. The discipline of the test set helps us keep ourselves intellectually honest. That doesn't mean we *always* need a separate test set—if you have very little data, you may need to just have a validation set—but generally it's best to use one if at all possible.\n",
"\n",
"This same discipline can be critical if you intend to hire a third party to perform modeling work on your behalf. A third party might not understand your requirements accurately, or their incentives might even encourage them to misunderstand them. A good test set can greatly mitigate these risks and let you evaluate whether their work solves your actual problem.\n",
"\n",

View File

@ -119,7 +119,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many domains in which deep learning has not been used to analyze images yet, but those where it has been tried have nearly universally shown that computers can recognize what items are in an image at least as well as people can—even specially trained people, such as radiologists. This is known as *object recognition*. Deep learning is also good at recognizing where objects in an image are, and can highlight their locations and name each found object. This is known as *object detection* (there is also a variant of this that we saw in <<chapter_intro>>, where every pixel is categorized based on what kind of object it is part of--this is called *segmentation*). Deep learning algorithms are generally not good at recognizing images that are significantly different in structure or style to those used to train the model. For instance, if there were no black-and-white images in the training data, the model may do poorly on black-and-white images. Similarly, if the training data did not contain hand-drawn images, then the model will probably do poorly on hand-drawn images. There is no general way to check what types of images are missing in your training set, but we will show in this chapter some ways to try to recognize when unexpected image types arise in the data when the model is being used in production (this is known as checking for *out-of-domain* data).\n",
"There are many domains in which deep learning has not been used to analyze images yet, but those where it has been tried have nearly universally shown that computers can recognize what items are in an image at least as well as people can—even specially trained people, such as radiologists. This is known as *object recognition*. Deep learning is also good at recognizing where objects in an image are, and can highlight their locations and name each found object. This is known as *object detection* (there is also a variant of this that we saw in <<chapter_intro>>, where every pixel is categorized based on what kind of object it is part ofthis is called *segmentation*). Deep learning algorithms are generally not good at recognizing images that are significantly different in structure or style to those used to train the model. For instance, if there were no black-and-white images in the training data, the model may do poorly on black-and-white images. Similarly, if the training data did not contain hand-drawn images, then the model will probably do poorly on hand-drawn images. There is no general way to check what types of images are missing in your training set, but we will show in this chapter some ways to try to recognize when unexpected image types arise in the data when the model is being used in production (this is known as checking for *out-of-domain* data).\n",
"\n",
"One major challenge for object detection systems is that image labelling can be slow and expensive. There is a lot of work at the moment going into tools to try to make this labelling faster and easier, and to require fewer handcrafted labels to train accurate object detection models. One approach that is particularly helpful is to synthetically generate variations of input images, such as by rotating them or changing their brightness and contrast; this is called *data augmentation* and also works well for text and other types of models. We will be discussing it in detail in this chapter.\n",
"\n",
@ -137,7 +137,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Computers are very good at classifying both short and long documents based on categories such as spam or not spam, sentiment (e.g., is the review positive or negative), author, source website, and so forth. We are not aware of any rigorous work done in this area to compare them to humans, but anecdotally it seems to us that deep learning performance is similar to human performance on these tasks. Deep learning is also very good at generating context-appropriate text, such as replies to social media posts, and imitating a particular author's style. It's good at making this content compelling to humans too--in fact, even more compelling than human-generated text. However, deep learning is currently not good at generating *correct* responses! We don't currently have a reliable way to, for instance, combine a knowledge base of medical information with a deep learning model for generating medically correct natural language responses. This is very dangerous, because it is so easy to create content that appears to a layman to be compelling, but actually is entirely incorrect.\n",
"Computers are very good at classifying both short and long documents based on categories such as spam or not spam, sentiment (e.g., is the review positive or negative), author, source website, and so forth. We are not aware of any rigorous work done in this area to compare them to humans, but anecdotally it seems to us that deep learning performance is similar to human performance on these tasks. Deep learning is also very good at generating context-appropriate text, such as replies to social media posts, and imitating a particular author's style. It's good at making this content compelling to humans tooin fact, even more compelling than human-generated text. However, deep learning is currently not good at generating *correct* responses! We don't currently have a reliable way to, for instance, combine a knowledge base of medical information with a deep learning model for generating medically correct natural language responses. This is very dangerous, because it is so easy to create content that appears to a layman to be compelling, but actually is entirely incorrect.\n",
"\n",
"Another concern is that context-appropriate, highly compelling responses on social media could be used at massive scale—thousands of times greater than any troll farm previously seen—to spread disinformation, create unrest, and encourage conflict. As a rule of thumb, text generation models will always be technologically a bit ahead of models recognizing automatically generated text. For instance, it is possible to use a model that can recognize artificially generated content to actually improve the generator that creates that content, until the classification model is no longer able to complete its task.\n",
"\n",
@ -171,7 +171,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For analyzing time series and tabular data, deep learning has recently been making great strides. However, deep learning is generally used as part of an ensemble of multiple types of model. If you already have a system that is using random forests or gradient boosting machines (popular tabular modeling tools that you will learn about soon), then switching to or adding deep learning may not result in any dramatic improvement. Deep learning does greatly increase the variety of columns that you can include--for example, columns containing natural language (book titles, reviews, etc.), and high-cardinality categorical columns (i.e., something that contains a large number of discrete choices, such as zip code or product ID). On the down side, deep learning models generally take longer to train than random forests or gradient boosting machines, although this is changing thanks to libraries such as [RAPIDS](https://rapids.ai/), which provides GPU acceleration for the whole modeling pipeline. We cover the pros and cons of all these methods in detail in <<chapter_tabular>>."
"For analyzing time series and tabular data, deep learning has recently been making great strides. However, deep learning is generally used as part of an ensemble of multiple types of model. If you already have a system that is using random forests or gradient boosting machines (popular tabular modeling tools that you will learn about soon), then switching to or adding deep learning may not result in any dramatic improvement. Deep learning does greatly increase the variety of columns that you can includefor example, columns containing natural language (book titles, reviews, etc.), and high-cardinality categorical columns (i.e., something that contains a large number of discrete choices, such as zip code or product ID). On the down side, deep learning models generally take longer to train than random forests or gradient boosting machines, although this is changing thanks to libraries such as [RAPIDS](https://rapids.ai/), which provides GPU acceleration for the whole modeling pipeline. We cover the pros and cons of all these methods in detail in <<chapter_tabular>>."
]
},
{
@ -187,7 +187,7 @@
"source": [
"Recommendation systems are really just a special type of tabular data. In particular, they generally have a high-cardinality categorical variable representing users, and another one representing products (or something similar). A company like Amazon represents every purchase that has ever been made by its customers as a giant sparse matrix, with customers as the rows and products as the columns. Once they have the data in this format, data scientists apply some form of collaborative filtering to *fill in the matrix*. For example, if customer A buys products 1 and 10, and customer B buys products 1, 2, 4, and 10, the engine will recommend that A buy 2 and 4. Because deep learning models are good at handling high-cardinality categorical variables, they are quite good at handling recommendation systems. They particularly come into their own, just like for tabular data, when combining these variables with other kinds of data, such as natural language or images. They can also do a good job of combining all of these types of information with additional metadata represented as tables, such as user information, previous transactions, and so forth.\n",
"\n",
"However, nearly all machine learning approaches have the downside that they only tell you what products a particular user might like, rather than what recommendations would be helpful for a user. Many kinds of recommendations for products a user might like may not be at all helpful--for instance, if the user is already familiar with the products, or if they are simply different packagings of products they have already purchased (such as a boxed set of novels, when they already have each of the items in that set). Jeremy likes reading books by Terry Pratchett, and for a while Amazon was recommending nothing but Terry Pratchett books to him (see <<pratchett>>), which really wasn't helpful because he already was aware of these books!"
"However, nearly all machine learning approaches have the downside that they only tell you what products a particular user might like, rather than what recommendations would be helpful for a user. Many kinds of recommendations for products a user might like may not be at all helpfulfor instance, if the user is already familiar with the products, or if they are simply different packagings of products they have already purchased (such as a boxed set of novels, when they already have each of the items in that set). Jeremy likes reading books by Terry Pratchett, and for a while Amazon was recommending nothing but Terry Pratchett books to him (see <<pratchett>>), which really wasn't helpful because he already was aware of these books!"
]
},
{
@ -270,7 +270,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For many types of projects, you may be able to find all the data you need online. The project we'll be completing in this chapter is a *bear detector*. It will discriminate between three types of bear: grizzly, black, and teddy bears. There are many images on the internet of each type of bear that we can use. We just need a way to find them and download them. We've provided a tool you can use for this purpose, so you can follow along with this chapter and create your own image recognition application for whatever kinds of objects you're interested in. In the fast.ai course, thousands of students have presented their work in the course forums, displaying everything from hummingbird varieties in Trinidad to bus types in Panama--one student even created an application that would help his fiancée recognize his 16 cousins during Christmas vacation!"
"For many types of projects, you may be able to find all the data you need online. The project we'll be completing in this chapter is a *bear detector*. It will discriminate between three types of bear: grizzly, black, and teddy bears. There are many images on the internet of each type of bear that we can use. We just need a way to find them and download them. We've provided a tool you can use for this purpose, so you can follow along with this chapter and create your own image recognition application for whatever kinds of objects you're interested in. In the fast.ai course, thousands of students have presented their work in the course forums, displaying everything from hummingbird varieties in Trinidad to bus types in Panamaone student even created an application that would help his fiancée recognize his 16 cousins during Christmas vacation!"
]
},
{
@ -1684,7 +1684,7 @@
"source": [
"#hide\n",
"# !pip install voila\n",
"# !jupyter serverextension enable voila --sys-prefix"
"# !jupyter serverextension enable voila sys-prefix"
]
},
{
@ -1696,7 +1696,7 @@
"Next, install Voilà if you haven't already, by copying these lines into a notebook cell and executing it:\n",
"\n",
" !pip install voila\n",
" !jupyter serverextension enable voila --sys-prefix\n",
" !jupyter serverextension enable voila sys-prefix\n",
"\n",
"Cells that begin with a `!` do not contain Python code, but instead contain code that is passed to your shell (bash, Windows PowerShell, etc.). If you are comfortable using the command line, which we'll discuss more later in this book, you can of course simply type these two lines (without the `!` prefix) directly into your terminal. In this case, the first line installs the `voila` library and application, and the second connects it to your existing Jupyter notebook.\n",
"\n",
@ -1764,7 +1764,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You may well want to deploy your application onto mobile devices, or edge devices such as a Raspberry Pi. There are a lot of libraries and frameworks that allow you to integrate a model directly into a mobile application. However, these approaches tend to require a lot of extra steps and boilerplate, and do not always support all the PyTorch and fastai layers that your model might use. In addition, the work you do will depend on what kind of mobile devices you are targeting for deployment--you might need to do some work to run on iOS devices, different work to run on newer Android devices, different work for older Android devices, etc. Instead, we recommend wherever possible that you deploy the model itself to a server, and have your mobile or edge application connect to it as a web service.\n",
"You may well want to deploy your application onto mobile devices, or edge devices such as a Raspberry Pi. There are a lot of libraries and frameworks that allow you to integrate a model directly into a mobile application. However, these approaches tend to require a lot of extra steps and boilerplate, and do not always support all the PyTorch and fastai layers that your model might use. In addition, the work you do will depend on what kind of mobile devices you are targeting for deploymentyou might need to do some work to run on iOS devices, different work to run on newer Android devices, different work for older Android devices, etc. Instead, we recommend wherever possible that you deploy the model itself to a server, and have your mobile or edge application connect to it as a web service.\n",
"\n",
"There are quite a few upsides to this approach. The initial installation is easier, because you only have to deploy a small GUI application, which connects to the server to do all the heavy lifting. More importantly perhaps, upgrades of that core logic can happen on your server, rather than needing to be distributed to all of your users. Your server will have a lot more memory and processing capacity than most edge devices, and it is far easier to scale those resources if your model becomes more demanding. The hardware that you will have on a server is also going to be more standard and more easily supported by fastai and PyTorch, so you don't have to compile your model into a different form.\n",
"\n",
@ -1810,7 +1810,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"A big part of the issue is that the kinds of photos that people are most likely to upload to the internet are the kinds of photos that do a good job of clearly and artistically displaying their subject matter--which isn't the kind of input this system is going to be getting. So, we may need to do a lot of our own data collection and labelling to create a useful system.\n",
"A big part of the issue is that the kinds of photos that people are most likely to upload to the internet are the kinds of photos that do a good job of clearly and artistically displaying their subject matterwhich isn't the kind of input this system is going to be getting. So, we may need to do a lot of our own data collection and labelling to create a useful system.\n",
"\n",
"This is just one example of the more general problem of *out-of-domain* data. That is to say, there may be data that our model sees in production which is very different to what it saw during training. There isn't really a complete technical solution to this problem; instead, we have to be careful about our approach to rolling out the technology.\n",
"\n",
@ -1895,7 +1895,7 @@
"\n",
"> : You are best positioned to help people one step behind you. The material is still fresh in your mind. Many experts have forgotten what it was like to be a beginner (or an intermediate) and have forgotten why the topic is hard to understand when you first hear it. The context of your particular background, your particular style, and your knowledge level will give a different twist to what youre writing about.\n",
"\n",
"We've provided full details on how to set up a blog in <<appendix_blog>>. If you don't have a blog already, take a look at that now, because we've got a really great approach set up for you to start blogging for free, with no ads--and you can even use Jupyter Notebook!"
"We've provided full details on how to set up a blog in <<appendix_blog>>. If you don't have a blog already, take a look at that now, because we've got a really great approach set up for you to start blogging for free, with no adsand you can even use Jupyter Notebook!"
]
},
{

View File

@ -82,9 +82,9 @@
"source": [
"We are going to start with three specific examples that illustrate three common ethical issues in tech:\n",
"\n",
"1. *Recourse processes*--Arkansas's buggy healthcare algorithms left patients stranded.\n",
"2. *Feedback loops*--YouTube's recommendation system helped unleash a conspiracy theory boom.\n",
"3. *Bias*--When a traditionally African-American name is searched for on Google, it displays ads for criminal background checks.\n",
"1. *Recourse processes*Arkansas's buggy healthcare algorithms left patients stranded.\n",
"2. *Feedback loops*YouTube's recommendation system helped unleash a conspiracy theory boom.\n",
"3. *Bias*When a traditionally African-American name is searched for on Google, it displays ads for criminal background checks.\n",
"\n",
"In fact, for every concept that we introduce in this chapter, we are going to provide at least one specific example. For each one, think about what you could have done in this situation, and what kinds of obstructions there might have been to you getting that done. How would you deal with them? What would you look out for?"
]
@ -180,7 +180,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"But this was not an isolated incident--the organization's involvement was extensive. IBM and its subsidiaries provided regular training and maintenance onsite at the concentration camps: printing off cards, configuring machines, and repairing them as they broke frequently. IBM set up categorizations on its punch card system for the way that each person was killed, which group they were assigned to, and the logistical information necessary to track them through the vast Holocaust system. IBM's code for Jews in the concentration camps was 8: some 6,000,000 were killed. Its code for Romanis was 12 (they were labeled by the Nazis as \"asocials,\" with over 300,000 killed in the *Zigeunerlager*, or “Gypsy camp”). General executions were coded as 4, death in the gas chambers as 6."
"But this was not an isolated incidentthe organization's involvement was extensive. IBM and its subsidiaries provided regular training and maintenance onsite at the concentration camps: printing off cards, configuring machines, and repairing them as they broke frequently. IBM set up categorizations on its punch card system for the way that each person was killed, which group they were assigned to, and the logistical information necessary to track them through the vast Holocaust system. IBM's code for Jews in the concentration camps was 8: some 6,000,000 were killed. Its code for Romanis was 12 (they were labeled by the Nazis as \"asocials,\" with over 300,000 killed in the *Zigeunerlager*, or “Gypsy camp”). General executions were coded as 4, death in the gas chambers as 6."
]
},
{
@ -200,13 +200,13 @@
"\n",
"It's not just a moral burden, either. Sometimes technologists pay very directly for their actions. For instance, the first person who was jailed as a result of the Volkswagen scandal, where the car company was revealed to have cheated on its diesel emissions tests, was not the manager that oversaw the project, or an executive at the helm of the company. It was one of the engineers, James Liang, who just did what he was told.\n",
"\n",
"Of course, it's not all bad--if a project you are involved in turns out to make a huge positive impact on even one person, this is going to make you feel pretty great!\n",
"Of course, it's not all badif a project you are involved in turns out to make a huge positive impact on even one person, this is going to make you feel pretty great!\n",
"\n",
"Okay, so hopefully we have convinced you that you ought to care. But what should you do? As data scientists, we're naturally inclined to focus on making our models better by optimizing some metric or other. But optimizing that metric may not actually lead to better outcomes. And even if it *does* help create better outcomes, it almost certainly won't be the only thing that matters. Consider the pipeline of steps that occurs between the development of a model or an algorithm by a researcher or practitioner, and the point at which this work is actually used to make some decision. This entire pipeline needs to be considered *as a whole* if we're to have a hope of getting the kinds of outcomes we want.\n",
"\n",
"Normally there is a very long chain from one end to the other. This is especially true if you are a researcher, where you might not even know if your research will ever get used for anything, or if you're involved in data collection, which is even earlier in the pipeline. But no one is better placed to inform everyone involved in this chain about the capabilities, constraints, and details of your work than you are. Although there's no \"silver bullet\" that can ensure your work is used the right way, by getting involved in the process, and asking the right questions, you can at the very least ensure that the right issues are being considered.\n",
"\n",
"Sometimes, the right response to being asked to do a piece of work is to just say \"no.\" Often, however, the response we hear is, \"If I dont do it, someone else will.\" But consider this: if youve been picked for the job, youre the best person theyve found to do it--so if you dont do it, the best person isnt working on that project. If the first five people they ask all say no too, even better!"
"Sometimes, the right response to being asked to do a piece of work is to just say \"no.\" Often, however, the response we hear is, \"If I dont do it, someone else will.\" But consider this: if youve been picked for the job, youre the best person theyve found to do itso if you dont do it, the best person isnt working on that project. If the first five people they ask all say no too, even better!"
]
},
{
@ -816,7 +816,7 @@
"source": [
"The hiring process is particularly broken in tech. One study indicative of the disfunction comes from Triplebyte, a company that helps place software engineers in companies, conducting a standardized technical interview as part of this process. They have a fascinating dataset: the results of how over 300 engineers did on their exam, coupled with the results of how those engineers did during the interview process for a variety of companies. The number one finding from [Triplebytes research](https://triplebyte.com/blog/who-y-combinator-companies-want) is that “the types of programmers that each company looks for often have little to do with what the company needs or does. Rather, they reflect company culture and the backgrounds of the founders.”\n",
"\n",
"This is a challenge for those trying to break into the world of deep learning, since most companies' deep learning groups today were founded by academics. These groups tend to look for people \"like them\"--that is, people that can solve complex math problems and understand dense jargon. They don't always know how to spot people who are actually good at solving real problems using deep learning.\n",
"This is a challenge for those trying to break into the world of deep learning, since most companies' deep learning groups today were founded by academics. These groups tend to look for people \"like them\"that is, people that can solve complex math problems and understand dense jargon. They don't always know how to spot people who are actually good at solving real problems using deep learning.\n",
"\n",
"This leaves a big opportunity for companies that are ready to look beyond status and pedigree, and focus on results!"
]
@ -945,7 +945,7 @@
"\n",
"It's reassuring that Angwin thinks we are largely still in the diagnosis phase: if your understanding of these problems feels incomplete, that is normal and natural. Nobody has a “cure” yet, although it is vital that we continue working to better understand and address the problems we are facing.\n",
"\n",
"One of our reviewers for this book, Fred Monroe, used to work in hedge fund trading. He told us, after reading this chapter, that many of the issues discussed here (distribution of data being dramatically different than what a model was trained on, the impact feedback loops on a model once deployed and at scale, and so forth) were also key issues for building profitable trading models. The kinds of things you need to do to consider societal consequences are going to have a lot of overlap with things you need to do to consider organizational, market, and customer consequences--so thinking carefully about ethics can also help you think carefully about how to make your data product successful more generally!"
"One of our reviewers for this book, Fred Monroe, used to work in hedge fund trading. He told us, after reading this chapter, that many of the issues discussed here (distribution of data being dramatically different than what a model was trained on, the impact feedback loops on a model once deployed and at scale, and so forth) were also key issues for building profitable trading models. The kinds of things you need to do to consider societal consequences are going to have a lot of overlap with things you need to do to consider organizational, market, and customer consequencesso thinking carefully about ethics can also help you think carefully about how to make your data product successful more generally!"
]
},
{
@ -1004,7 +1004,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Section 1: That's a Wrap!"
"## Deep Learning in Practice: That's a Wrap!"
]
},
{

View File

@ -67,7 +67,7 @@
"source": [
"The story of deep learning is one of tenacity and grit by a handful of dedicated researchers. After early hopes (and hype!) neural networks went out of favor in the 1990's and 2000's, and just a handful of researchers kept trying to make them work well. Three of them, Yann Lecun, Yoshua Bengio, and Geoffrey Hinton, were awarded the highest honor in computer science, the Turing Award (generally considered the \"Nobel Prize of computer science\"), in 2018 after triumphing despite the deep skepticism and disinterest of the wider machine learning and statistics community.\n",
"\n",
"Geoff Hinton has told of how even academic papers showing dramatically better results than anything previously published would be rejected by top journals and conferences, just because they used a neural network. Yann Lecun's work on convolutional neural networks, which we will study in the next section, showed that these models could read handwritten text--something that had never been achieved before. However, his breakthrough was ignored by most researchers, even as it was used commercially to read 10% of the checks in the US!\n",
"Geoff Hinton has told of how even academic papers showing dramatically better results than anything previously published would be rejected by top journals and conferences, just because they used a neural network. Yann Lecun's work on convolutional neural networks, which we will study in the next section, showed that these models could read handwritten textsomething that had never been achieved before. However, his breakthrough was ignored by most researchers, even as it was used commercially to read 10% of the checks in the US!\n",
"\n",
"In addition to these three Turing Award winners, there are many other researchers who have battled to get us to where we are today. For instance, Jurgen Schmidhuber (who many believe should have shared in the Turing Award) pioneered many important ideas, including working with his student Sepp Hochreiter on the long short-term memory (LSTM) architecture (widely used for speech recognition and other text modeling tasks, and used in the IMDb example in <<chapter_intro>>). Perhaps most important of all, Paul Werbos in 1974 invented back-propagation for neural networks, the technique shown in this chapter and used universally for training neural networks ([Werbos 1994](https://books.google.com/books/about/The_Roots_of_Backpropagation.html?id=WdR3OOM2gBwC)). His development was almost entirely ignored for decades, but today it is considered the most important foundation of modern AI.\n",
"\n",
@ -1391,7 +1391,7 @@
"\n",
"Let's create a tensor containing all of our 3s stacked together. We already know how to create a tensor containing a single image. To create a tensor containing all the images in a directory, we will first use a Python list comprehension to create a plain list of the single image tensors.\n",
"\n",
"We will use Jupyter to do some little checks of our work along the way--in this case, making sure that the number of returned items seems reasonable:"
"We will use Jupyter to do some little checks of our work along the wayin this case, making sure that the number of returned items seems reasonable:"
]
},
{
@ -1797,7 +1797,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Python is slow compared to many languages. Anything fast in Python, NumPy, or PyTorch is likely to be a wrapper for a compiled object written (and optimized) in another language--specifically C. In fact, **NumPy arrays and PyTorch tensors can finish computations many thousands of times faster than using pure Python.**\n",
"Python is slow compared to many languages. Anything fast in Python, NumPy, or PyTorch is likely to be a wrapper for a compiled object written (and optimized) in another languagespecifically C. In fact, **NumPy arrays and PyTorch tensors can finish computations many thousands of times faster than using pure Python.**\n",
"\n",
"A NumPy array is a multidimensional table of data, with all items of the same type. Since that can be any type at all, they can even be arrays of arrays, with the innermost arrays potentially being different sizes—this is called a \"jagged array.\" By \"multidimensional table\" we mean, for instance, a list (dimension of one), a table or matrix (dimension of two), a \"table of tables\" or \"cube\" (dimension of three), and so forth. If the items are all of some simple type such as integer or float, then NumPy will store them as a compact C data structure in memory. This is where NumPy shines. NumPy has a wide variety of operators and methods that can run computations on these compact structures at the same speed as optimized C, because they are written in optimized C.\n",
"\n",
@ -2056,7 +2056,7 @@
"source": [
"Recall that a metric is a number that is calculated based on the predictions of our model, and the correct labels in our dataset, in order to tell us how good our model is. For instance, we could use either of the functions we saw in the previous section, mean squared error, or mean absolute error, and take the average of them over the whole dataset. However, neither of these are numbers that are very understandable to most people; in practice, we normally use *accuracy* as the metric for classification models.\n",
"\n",
"As we've discussed, we want to calculate our metric over a *validation set*. This is so that we don't inadvertently overfit--that is, train a model to work well only on our training data. This is not really a risk with the pixel similarity model we're using here as a first try, since it has no trained components, but we'll use a validation set anyway to follow normal practices and to be ready for our second try later.\n",
"As we've discussed, we want to calculate our metric over a *validation set*. This is so that we don't inadvertently overfitthat is, train a model to work well only on our training data. This is not really a risk with the pixel similarity model we're using here as a first try, since it has no trained components, but we'll use a validation set anyway to follow normal practices and to be ready for our second try later.\n",
"\n",
"To get a validation set we need to remove some of the data from training entirely, so it is not seen by the model at all. As it turns out, the creators of the MNIST dataset have already done this for us. Do you remember how there was a whole separate directory called *valid*? That's what this directory is for!\n",
"\n",
@ -2095,7 +2095,7 @@
"source": [
"It's good to get in the habit of checking shapes as you go. Here we see two tensors, one representing the 3s validation set of 1,010 images of size 28\\*28, and one representing the 7s validation set of 1,028 images of size 28\\*28.\n",
"\n",
"We ultimately want to write a function, `is_3`, that will decide if an arbitrary image is a 3 or a 7. It will do this by deciding which of our two \"ideal digits\" this arbitrary image is closer to. For that we need to define a notion of distance--that is, a function that calculates the distance between two images.\n",
"We ultimately want to write a function, `is_3`, that will decide if an arbitrary image is a 3 or a 7. It will do this by deciding which of our two \"ideal digits\" this arbitrary image is closer to. For that we need to define a notion of distancethat is, a function that calculates the distance between two images.\n",
"\n",
"We can write a simple function that calculates the mean absolute error using an experssion very similar to the one we wrote in the last section:"
]
@ -2335,7 +2335,7 @@
"\n",
"But let's be honest: 3s and 7s are very different-looking digits. And we're only classifying 2 out of the 10 possible digits so far. So we're going to need to do better!\n",
"\n",
"To do better, perhaps it is time to try a system that does some real learning--that is, that can automatically modify itself to improve its performance. In other words, it's time to talk about the training process, and SGD."
"To do better, perhaps it is time to try a system that does some real learningthat is, that can automatically modify itself to improve its performance. In other words, it's time to talk about the training process, and SGD."
]
},
{
@ -2355,7 +2355,7 @@
"\n",
"As we discussed, this is the key to allowing us to have a model that can get better and better—that can learn. But our pixel similarity approach does not really do this. We do not have any kind of weight assignment, or any way of improving based on testing the effectiveness of a weight assignment. In other words, we can't really improve our pixel similarity approach by modifying a set of parameters. In order to take advantage of the power of deep learning, we will first have to represent our task in the way that Arthur Samuel described it.\n",
"\n",
"Instead of trying to find the similarity between an image and an \"ideal image,\" we could instead look at each individual pixel and come up with a set of weights for each one, such that the highest weights are associated with those pixels most likely to be black for a particular category. For instance, pixels toward the bottom right are not very likely to be activated for a 7, so they should have a low weight for a 7, but they are likely to be activated for an 8, so they should have a high weight for an 8. This can be represented as a function and set of weight values for each possible category--for instance the probability of being the number 8:\n",
"Instead of trying to find the similarity between an image and an \"ideal image,\" we could instead look at each individual pixel and come up with a set of weights for each one, such that the highest weights are associated with those pixels most likely to be black for a particular category. For instance, pixels toward the bottom right are not very likely to be activated for a 7, so they should have a low weight for a 7, but they are likely to be activated for an 8, so they should have a high weight for an 8. This can be represented as a function and set of weight values for each possible categoryfor instance the probability of being the number 8:\n",
"\n",
"```\n",
"def pr_eight(x,w) = (x*w).sum()\n",
@ -2366,7 +2366,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we are assuming that `x` is the image, represented as a vector--in other words, with all of the rows stacked up end to end into a single long line. And we are assuming that the weights are a vector `w`. If we have this function, then we just need some way to update the weights to make them a little bit better. With such an approach, we can repeat that step a number of times, making the weights better and better, until they are as good as we can make them.\n",
"Here we are assuming that `x` is the image, represented as a vectorin other words, with all of the rows stacked up end to end into a single long line. And we are assuming that the weights are a vector `w`. If we have this function, then we just need some way to update the weights to make them a little bit better. With such an approach, we can repeat that step a number of times, making the weights better and better, until they are as good as we can make them.\n",
"\n",
"We want to find the specific values for the vector `w` that causes the result of our function to be high for those images that are actually 8s, and low for those images that are not. Searching for the best vector `w` is a way to search for the best function for recognising 8s. (Because we are not yet using a deep neural network, we are limited by what our function can actually do—we are going to fix that constraint later in this chapter.) \n",
"\n",
@ -2510,7 +2510,7 @@
"source": [
"There are many different ways to do each of these seven steps, and we will be learning about them throughout the rest of this book. These are the details that make a big difference for deep learning practitioners, but it turns out that the general approach to each one generally follows some basic principles. Here are a few guidelines:\n",
"\n",
"- Initialize:: We initialize the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initializing them to the percentage of times that pixel is activated for that category--but since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well.\n",
"- Initialize:: We initialize the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initializing them to the percentage of times that pixel is activated for that categorybut since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well.\n",
"- Loss:: This is what Samuel referred to when he spoke of *testing the effectiveness of any current weight assignment in terms of actual performance*. We need some function that will return a number that is small if the performance of the model is good (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention).\n",
"- Step:: A simple way to figure out whether a weight should be increased a bit, or decreased a bit, would be just to try it: increase the weight by a small amount, and see if the loss goes up or down. Once you find the correct direction, you could then change that amount by a bit more, and a bit less, until you find an amount that works well. However, this is slow! As we will see, the magic of calculus allows us to directly figure out in which direction, and by roughly how much, to change each weight, without having to try all these small changes. The way to do this is by calculating *gradients*. This is just a performance optimization, we would get exactly the same results by using the slower manual process as well.\n",
"- Stop:: Once we've decided how many epochs to train the model for (a few suggestions for this were given in the earlier list), we apply that decision. This is where that decision is applied. For our digit classifier, we would keep training until the accuracy of the model started getting worse, or we ran out of time."
@ -2874,7 +2874,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"But picking a learning rate that's too high is even worse--it can actually result in the loss getting *worse*, as we see in <<descent_div>>!"
"But picking a learning rate that's too high is even worseit can actually result in the loss getting *worse*, as we see in <<descent_div>>!"
]
},
{
@ -2990,7 +2990,7 @@
"source": [
"In other words, we've restricted the problem of finding the best imaginable function that fits the data, to finding the best *quadratic* function. This greatly simplifies the problem, since every quadratic function is fully defined by the three parameters `a`, `b`, and `c`. Thus, to find the best quadratic function, we only need to find the best values for `a`, `b`, and `c`.\n",
"\n",
"If we can solve this problem for the three parameters of a quadratic function, we'll be able to apply the same approach for other, more complex functions with more parameters--such as a neural net. Let's find the parameters for `f` first, and then we'll come back and do the same thing for the MNIST dataset with a neural net.\n",
"If we can solve this problem for the three parameters of a quadratic function, we'll be able to apply the same approach for other, more complex functions with more parameterssuch as a neural net. Let's find the parameters for `f` first, and then we'll come back and do the same thing for the MNIST dataset with a neural net.\n",
"\n",
"We need to define first what we mean by \"best.\" We define this precisely by choosing a *loss function*, which will return a value based on a prediction and a target, where lower values of the function correspond to \"better\" predictions. For continuous data, it's common to use *mean squared error*:"
]
@ -3113,7 +3113,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This doesn't look very close--our random parameters suggest that the roller coaster will end up going backwards, since we have negative speeds!"
"This doesn't look very closeour random parameters suggest that the roller coaster will end up going backwards, since we have negative speeds!"
]
},
{
@ -3269,7 +3269,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> a: Understanding this bit depends on remembering recent history. To calculate the gradients we call `backward` on the `loss`. But this `loss` was itself calculated by `mse`, which in turn took `preds` as an input, which was calculated using `f` taking as an input `params`, which was the object on which we originally called `required_grads_`--which is the original call that now allows us to call `backward` on `loss`. This chain of function calls represents the mathematical composition of functions, which enables PyTorch to use calculus's chain rule under the hood to calculate these gradients."
"> a: Understanding this bit depends on remembering recent history. To calculate the gradients we call `backward` on the `loss`. But this `loss` was itself calculated by `mse`, which in turn took `preds` as an input, which was calculated using `f` taking as an input `params`, which was the object on which we originally called `required_grads_`which is the original call that now allows us to call `backward` on `loss`. This chain of function calls represents the mathematical composition of functions, which enables PyTorch to use calculus's chain rule under the hood to calculate these gradients."
]
},
{
@ -3595,7 +3595,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We already have our dependent variables `x`--these are the images themselves. We'll concatenate them all into a single tensor, and also change them from a list of matrices (a rank-3 tensor) to a list of vectors (a rank-2 tensor). We can do this using `view`, which is a PyTorch method that changes the shape of a tensor without changing its contents. `-1` is a special parameter to `view` that means \"make this axis as big as necessary to fit all the data\":"
"We already have our dependent variables `x`these are the images themselves. We'll concatenate them all into a single tensor, and also change them from a list of matrices (a rank-3 tensor) to a list of vectors (a rank-2 tensor). We can do this using `view`, which is a PyTorch method that changes the shape of a tensor without changing its contents. `-1` is a special parameter to `view` that means \"make this axis as big as necessary to fit all the data\":"
]
},
{
@ -3704,7 +3704,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The function `weights*pixels` won't be flexible enough--it is always equal to 0 when the pixels are equal to 0 (i.e., its *intercept* is 0). You might remember from high school math that the formula for a line is `y=w*x+b`; we still need the `b`. We'll initialize it to a random number too:"
"The function `weights*pixels` won't be flexible enoughit is always equal to 0 when the pixels are equal to 0 (i.e., its *intercept* is 0). You might remember from high school math that the formula for a line is `y=w*x+b`; we still need the `b`. We'll initialize it to a random number too:"
]
},
{
@ -3763,7 +3763,7 @@
"source": [
"While we could use a Python `for` loop to calculate the prediction for each image, that would be very slow. Because Python loops don't run on the GPU, and because Python is a slow language for loops in general, we need to represent as much of the computation in a model as possible using higher-level functions.\n",
"\n",
"In this case, there's an extremely convenient mathematical operation that calculates `w*x` for every row of a matrix--it's called *matrix multiplication*. <<matmul>> shows what matrix multiplication looks like."
"In this case, there's an extremely convenient mathematical operation that calculates `w*x` for every row of a matrixit's called *matrix multiplication*. <<matmul>> shows what matrix multiplication looks like."
]
},
{
@ -3916,14 +3916,14 @@
"\n",
"So, we need to choose a loss function. The obvious approach would be to use accuracy, which is our metric, as our loss function as well. In this case, we would calculate our prediction for each image, collect these values to calculate an overall accuracy, and then calculate the gradients of each weight with respect to that overall accuracy.\n",
"\n",
"Unfortunately, we have a significant technical problem here. The gradient of a function is its *slope*, or its steepness, which can be defined as *rise over run*--that is, how much the value of the function goes up or down, divided by how much we changed the input. We can write this in mathematically as: `(y_new - y_old) / (x_new - x_old)`. This gives us a good approximation of the gradient when `x_new` is very similar to `x_old`, meaning that their difference is very small. But accuracy only changes at all when a prediction changes from a 3 to a 7, or vice versa. The problem is that a small change in weights from `x_old` to `x_new` isn't likely to cause any prediction to change, so `(y_new - y_old)` will almost always be 0. In other words, the gradient is 0 almost everywhere."
"Unfortunately, we have a significant technical problem here. The gradient of a function is its *slope*, or its steepness, which can be defined as *rise over run*that is, how much the value of the function goes up or down, divided by how much we changed the input. We can write this in mathematically as: `(y_new - y_old) / (x_new - x_old)`. This gives us a good approximation of the gradient when `x_new` is very similar to `x_old`, meaning that their difference is very small. But accuracy only changes at all when a prediction changes from a 3 to a 7, or vice versa. The problem is that a small change in weights from `x_old` to `x_new` isn't likely to cause any prediction to change, so `(y_new - y_old)` will almost always be 0. In other words, the gradient is 0 almost everywhere."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A very small change in the value of a weight will often not actually change the accuracy at all. This means it is not useful to use accuracy as a loss function--if we do, most of the time our gradients will actually be 0, and the model will not be able to learn from that number.\n",
"A very small change in the value of a weight will often not actually change the accuracy at all. This means it is not useful to use accuracy as a loss functionif we do, most of the time our gradients will actually be 0, and the model will not be able to learn from that number.\n",
"\n",
"> S: In mathematical terms, accuracy is a function that is constant almost everywhere (except at the threshold, 0.5), so its derivative is nil almost everywhere (and infinity at the threshold). This then gives gradients that are 0 or infinite, which are useless for updating the model.\n",
"\n",
@ -3933,7 +3933,7 @@
"\n",
"The loss function receives not the images themseles, but the predictions from the model. Let's make one argument, `prds`, of values between 0 and 1, where each value is the prediction that an image is a 3. It is a vector (i.e., a rank-1 tensor), indexed over the images.\n",
"\n",
"The purpose of the loss function is to measure the difference between predicted values and the true values -- that is, the targets (aka labels). Let's make another argument, `trgts`, with values of 0 or 1 which tells whether an image actually is a 3 or not. It is also a vector (i.e., another rank-1 tensor), indexed over the images.\n",
"The purpose of the loss function is to measure the difference between predicted values and the true values that is, the targets (aka labels). Let's make another argument, `trgts`, with values of 0 or 1 which tells whether an image actually is a 3 or not. It is also a vector (i.e., another rank-1 tensor), indexed over the images.\n",
"\n",
"So, for instance, suppose we had three images which we knew were a 3, a 7, and a 3. And suppose our model predicted with high confidence (`0.9`) that the first was a 3, with slight confidence (`0.4`) that the second was a 7, and with fair confidence (`0.2`), but incorrectly, that the last was a 7. This would mean our loss function would receive these values as its inputs:"
]
@ -4059,7 +4059,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"One problem with `mnist_loss` as currently defined is that it assumes that predictions are always between 0 and 1. We need to ensure, then, that this is actually the case! As it happens, there is a function that does exactly that--let's take a look."
"One problem with `mnist_loss` as currently defined is that it assumes that predictions are always between 0 and 1. We need to ensure, then, that this is actually the case! As it happens, there is a function that does exactly thatlet's take a look."
]
},
{
@ -4164,7 +4164,7 @@
"\n",
"So instead we take a compromise between the two: we calculate the average loss for a few data items at a time. This is called a *mini-batch*. The number of data items in the mini-batch is called the *batch size*. A larger batch size means that you will get a more accurate and stable estimate of your dataset's gradients from the loss function, but it will take longer, and you will process fewer mini-batches per epoch. Choosing a good batch size is one of the decisions you need to make as a deep learning practitioner to train your model quickly and accurately. We will talk about how to make this choice throughout this book.\n",
"\n",
"Another good reason for using mini-batches rather than calculating the gradient on individual data items is that, in practice, we nearly always do our training on an accelerator such as a GPU. These accelerators only perform well if they have lots of work to do at a time, so it's helpful if we can give them lots of data items to work on. Using mini-batches is one of the best ways to do this. However, if you give them too much data to work on at once, they run out of memory--making GPUs happy is also tricky!\n",
"Another good reason for using mini-batches rather than calculating the gradient on individual data items is that, in practice, we nearly always do our training on an accelerator such as a GPU. These accelerators only perform well if they have lots of work to do at a time, so it's helpful if we can give them lots of data items to work on. Using mini-batches is one of the best ways to do this. However, if you give them too much data to work on at once, they run out of memorymaking GPUs happy is also tricky!\n",
"\n",
"As we saw in our discussion of data augmentation in <<chapter_production>>, we get better generalization if we can vary things during training. One simple and effective thing we can vary is what data items we put in each mini-batch. Rather than simply enumerating our dataset in order for every epoch, instead what we normally do is randomly shuffle it on every epoch, before we create mini-batches. PyTorch and fastai provide a class that will do the shuffling and mini-batch collation for you, called `DataLoader`.\n",
"\n",
@ -4550,7 +4550,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Our only remaining step is to update the weights and biases based on the gradient and learning rate. When we do so, we have to tell PyTorch not to take the gradient of this step too--otherwise things will get very confusing when we try to compute the derivative at the next batch! If we assign to the `data` attribute of a tensor then PyTorch will not take the gradient of that step. Here's our basic training loop for an epoch:"
"Our only remaining step is to update the weights and biases based on the gradient and learning rate. When we do so, we have to tell PyTorch not to take the gradient of this step toootherwise things will get very confusing when we try to compute the derivative at the next batch! If we assign to the `data` attribute of a tensor then PyTorch will not take the gradient of that step. Here's our basic training loop for an epoch:"
]
},
{
@ -5116,7 +5116,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"So far we have a general procedure for optimizing the parameters of a function, and we have tried it out on a very boring function: a simple linear classifier. A linear classifier is very constrained in terms of what it can do. To make it a bit more complex (and able to handle more tasks), we need to add something nonlinear between two linear classifiers--this is what gives us a neural network.\n",
"So far we have a general procedure for optimizing the parameters of a function, and we have tried it out on a very boring function: a simple linear classifier. A linear classifier is very constrained in terms of what it can do. To make it a bit more complex (and able to handle more tasks), we need to add something nonlinear between two linear classifiersthis is what gives us a neural network.\n",
"\n",
"Here is the entire definition of a basic neural network:"
]
@ -5161,7 +5161,7 @@
"source": [
"The key point about this is that `w1` has 30 output activations (which means that `w2` must have 30 input activations, so they match). That means that the first layer can construct 30 different features, each representing some different mix of pixels. You can change that `30` to anything you like, to make the model more or less complex.\n",
"\n",
"That little function `res.max(tensor(0.0))` is called a *rectified linear unit*, also known as *ReLU*. We think we can all agree that *rectified linear unit* sounds pretty fancy and complicated... But actually, there's nothing more to it than `res.max(tensor(0.0))`--in other words, replace every negative number with a zero. This tiny function is also available in PyTorch as `F.relu`:"
"That little function `res.max(tensor(0.0))` is called a *rectified linear unit*, also known as *ReLU*. We think we can all agree that *rectified linear unit* sounds pretty fancy and complicated... But actually, there's nothing more to it than `res.max(tensor(0.0))`in other words, replace every negative number with a zero. This tiny function is also available in PyTorch as `F.relu`:"
]
},
{
@ -5632,7 +5632,7 @@
"1. A function that can solve any problem to any level of accuracy (the neural network) given the correct set of parameters\n",
"1. A way to find the best set of parameters for any function (stochastic gradient descent)\n",
"\n",
"This is why deep learning can do things which seem rather magical such fantastic things. Believing that this combination of simple techniques can really solve any problem is one of the biggest steps that we find many students have to take. It seems too good to be true--surely things should be more difficult and complicated than this? Our recommendation: try it out! We just tried it on the MNIST dataset and you have seen the results. And since we are doing everything from scratch ourselves (except for calculating the gradients) you know that there is no special magic hiding behind the scenes."
"This is why deep learning can do things which seem rather magical such fantastic things. Believing that this combination of simple techniques can really solve any problem is one of the biggest steps that we find many students have to take. It seems too good to be truesurely things should be more difficult and complicated than this? Our recommendation: try it out! We just tried it on the MNIST dataset and you have seen the results. And since we are doing everything from scratch ourselves (except for calculating the gradients) you know that there is no special magic hiding behind the scenes."
]
},
{
@ -5728,7 +5728,7 @@
"\n",
"We will often talk in this book about activations and parameters. Remember that they have very specific meanings. They are numbers. They are not abstract concepts, but they are actual specific numbers that are in your model. Part of becoming a good deep learning practitioner is getting used to the idea of actually looking at your activations and parameters, and plotting them and testing whether they are behaving correctly.\n",
"\n",
"Our activations and parameters are all contained in *tensors*. These are simply regularly shaped arrays--for example, a matrix. Matrices have rows and columns; we call these the *axes* or *dimensions*. The number of dimensions of a tensor is its *rank*. There are some special tensors:\n",
"Our activations and parameters are all contained in *tensors*. These are simply regularly shaped arraysfor example, a matrix. Matrices have rows and columns; we call these the *axes* or *dimensions*. The number of dimensions of a tensor is its *rank*. There are some special tensors:\n",
"\n",
"- Rank zero: scalar\n",
"- Rank one: vector\n",

View File

@ -51,7 +51,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In our very first model we learned how to classify dogs versus cats. Just a few years ago this was considered a very challenging task--but today, it's far too easy! We will not be able to show you the nuances of training models with this problem, because we get a nearly perfect result without worrying about any of the details. But it turns out that the same dataset also allows us to work on a much more challenging problem: figuring out what breed of pet is shown in each image.\n",
"In our very first model we learned how to classify dogs versus cats. Just a few years ago this was considered a very challenging taskbut today, it's far too easy! We will not be able to show you the nuances of training models with this problem, because we get a nearly perfect result without worrying about any of the details. But it turns out that the same dataset also allows us to work on a much more challenging problem: figuring out what breed of pet is shown in each image.\n",
"\n",
"In <<chapter_intro>> we presented the applications as already-solved problems. But this is not how things work in real life. We start with some dataset that we know nothing about. We then have to figure out how it is put together, how to extract the data we need from it, and what that data looks like. For the rest of this book we will be showing you how to solve these problems in practice, including all of the intermediate steps necessary to understand the data that you are working with and test your modeling as you go.\n",
"\n",
@ -77,7 +77,7 @@
"- Individual files representing items of data, such as text documents or images, possibly organized into folders or with filenames representing information about those items\n",
"- A table of data, such as in CSV format, where each row is an item which may include filenames providing a connection between the data in the table and data in other formats, such as text documents and images\n",
"\n",
"There are exceptions to these rules--particularly in domains such as genomics, where there can be binary database formats or even network streams--but overall the vast majority of the datasets you'll work with will use some combination of these two formats.\n",
"There are exceptions to these rulesparticularly in domains such as genomics, where there can be binary database formats or even network streamsbut overall the vast majority of the datasets you'll work with will use some combination of these two formats.\n",
"\n",
"To see what is in our dataset we can use the `ls` method:"
]
@ -198,7 +198,7 @@
"source": [
"This regular expression plucks out all the characters leading up to the last underscore character, as long as the subsequence characters are numerical digits and then the JPEG file extension.\n",
"\n",
"Now that we confirmed the regular expression works for the example, let's use it to label the whole dataset. Fastai comes with many classes to help with labeling. For labeling with regular expressions, we can use the `RegexLabeller` class. In this example we use the data block API we saw in <<chapter_production>> (in fact, we nearly always use the data block API--it's so much more flexible than the simple factory methods we saw in <<chapter_intro>>):"
"Now that we confirmed the regular expression works for the example, let's use it to label the whole dataset. fastai comes with many classes to help with labeling. For labeling with regular expressions, we can use the `RegexLabeller` class. In this example we use the data block API we saw in <<chapter_production>> (in fact, we nearly always use the data block APIit's so much more flexible than the simple factory methods we saw in <<chapter_intro>>):"
]
},
{
@ -247,7 +247,7 @@
"\n",
"To work around these challenges, presizing adopts two strategies that are shown in <<presizing>>:\n",
"\n",
"1. Resize images to relatively \"large\" dimensions--that is, dimensions significantly larger than the target training dimensions. \n",
"1. Resize images to relatively \"large\" dimensionsthat is, dimensions significantly larger than the target training dimensions. \n",
"1. Compose all of the common augmentation operations (including a resize to the final target size) into one, and perform the combined operation on the GPU only once at the end of processing, rather than performing the operations individually and interpolating multiple times.\n",
"\n",
"The first step, the resize, creates images large enough that they have spare margin to allow further augmentation transforms on their inner regions without creating empty zones. This transformation works by resizing to a square, using a large crop size. On the training set, the crop area is chosen randomly, and the size of the crop is selected to cover the entire width or height of the image, whichever is smaller.\n",
@ -607,7 +607,7 @@
"source": [
"As we've briefly discussed before, the table shown when we fit a model shows us the results after each epoch of training. Remember, an epoch is one complete pass through all of the images in the data. The columns shown are the average loss over the items of the training set, the loss on the validation set, and any metrics that we requested—in this case, the error rate.\n",
"\n",
"Remember that *loss* is whatever function we've decided to use to optimize the parameters of our model. But we haven't actually told fastai what loss function we want to use. So what is it doing? Fastai will generally try to select an appropriate loss function based on what kind of data and model you are using. In this case we have image data and a categorical outcome, so fastai will default to using *cross-entropy loss*."
"Remember that *loss* is whatever function we've decided to use to optimize the parameters of our model. But we haven't actually told fastai what loss function we want to use. So what is it doing? fastai will generally try to select an appropriate loss function based on what kind of data and model you are using. In this case we have image data and a categorical outcome, so fastai will default to using *cross-entropy loss*."
]
},
{
@ -798,7 +798,7 @@
"source": [
"We can apply this function to a single column of activations from a neural network, and get back a column of numbers between 0 and 1, so it's a very useful activation function for our final layer.\n",
"\n",
"Now think about what happens if we want to have more categories in our target (such as our 37 pet breeds). That means we'll need more activations than just a single column: we need an activation *per category*. We can create, for instance, a neural net that predicts 3s and 7s that returns two activations, one for each class--this will be a good first step toward creating the more general approach. Let's just use some random numbers with a standard deviation of 2 (so we multiply `randn` by 2) for this example, assuming we have 6 images and 2 possible categories (where the first column represents 3s and the second is 7s):"
"Now think about what happens if we want to have more categories in our target (such as our 37 pet breeds). That means we'll need more activations than just a single column: we need an activation *per category*. We can create, for instance, a neural net that predicts 3s and 7s that returns two activations, one for each classthis will be a good first step toward creating the more general approach. Let's just use some random numbers with a standard deviation of 2 (so we multiply `randn` by 2) for this example, assuming we have 6 images and 2 possible categories (where the first column represents 3s and the second is 7s):"
]
},
{
@ -875,7 +875,7 @@
"source": [
"In <<chapter_mnist_basics>>, our neural net created a single activation per image, which we passed through the `sigmoid` function. That single activation represented the model's confidence that the input was a 3. Binary problems are a special case of classification problems, because the target can be treated as a single boolean value, as we did in `mnist_loss`. But binary problems can also be thought of in the context of the more general group of classifiers with any number of categories: in this case, we happen to have two categories. As we saw in the bear classifier, our neural net will return one activation per category.\n",
"\n",
"So in the binary case, what do those activations really indicate? A single pair of activations simply indicates the *relative* confidence of the input being a 3 versus being a 7. The overall values, whether they are both high, or both low, don't matter--all that matters is which is higher, and by how much.\n",
"So in the binary case, what do those activations really indicate? A single pair of activations simply indicates the *relative* confidence of the input being a 3 versus being a 7. The overall values, whether they are both high, or both low, don't matterall that matters is which is higher, and by how much.\n",
"\n",
"We would expect that since this is just another way of representing the same problem, that we would be able to use `sigmoid` directly on the two-activation version of our neural net. And indeed we can! We can just take the *difference* between the neural net activations, because that reflects how much more sure we are of the input being a 3 than a 7, and then take the sigmoid of that:"
]
@ -955,7 +955,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"`softmax` is the multi-category equivalent of `sigmoid`--we have to use it any time we have more than two categories and the probabilities of the categories must add to 1, and we often use it even when there are just two categories, just to make things a bit more consistent. We could create other functions that have the properties that all activations are between 0 and 1, and sum to 1; however, no other function has the same relationship to the sigmoid function, which we've seen is smooth and symmetric. Also, we'll see shortly that the softmax function works well hand-in-hand with the loss function we will look at in the next section.\n",
"`softmax` is the multi-category equivalent of `sigmoid`we have to use it any time we have more than two categories and the probabilities of the categories must add to 1, and we often use it even when there are just two categories, just to make things a bit more consistent. We could create other functions that have the properties that all activations are between 0 and 1, and sum to 1; however, no other function has the same relationship to the sigmoid function, which we've seen is smooth and symmetric. Also, we'll see shortly that the softmax function works well hand-in-hand with the loss function we will look at in the next section.\n",
"\n",
"If we have three output activations, such as in our bear classifier, calculating softmax for a single bear image would then look like something like <<bear_softmax>>."
]
@ -975,7 +975,7 @@
"\n",
"Intuitively, the softmax function *really* wants to pick one class among the others, so it's ideal for training a classifier when we know each picture has a definite label. (Note that it may be less ideal during inference, as you might want your model to sometimes tell you it doesn't recognize any of the classes that it has seen during training, and not pick a class because it has a slightly bigger activation score. In this case, it might be better to train a model using multiple binary output columns, each using a sigmoid activation.)\n",
"\n",
"Softmax is the first part of the cross-entropy loss--the second part is log likeklihood. "
"Softmax is the first part of the cross-entropy lossthe second part is log likeklihood. "
]
},
{
@ -997,7 +997,7 @@
" return torch.where(targets==1, 1-inputs, inputs).mean()\n",
"```\n",
"\n",
"Just as we moved from sigmoid to softmax, we need to extend the loss function to work with more than just binary classification--it needs to be able to classify any number of categories (in this case, we have 37 categories). Our activations, after softmax, are between 0 and 1, and sum to 1 for each row in the batch of predictions. Our targets are integers between 0 and 36.\n",
"Just as we moved from sigmoid to softmax, we need to extend the loss function to work with more than just binary classificationit needs to be able to classify any number of categories (in this case, we have 37 categories). Our activations, after softmax, are between 0 and 1, and sum to 1 for each row in the batch of predictions. Our targets are integers between 0 and 36.\n",
"\n",
"In the binary case, we used `torch.where` to select between `inputs` and `1-inputs`. When we treat a binary classification as a general classification problem with two categories, it actually becomes even easier, because (as we saw in the previous section) we now have two columns, containing the equivalent of `inputs` and `1-inputs`. So, all we need to do is select from the appropriate column. Let's try to implement this in PyTorch. For our synthetic 3s and 7s example, let's say these are our labels:"
]
@ -1224,7 +1224,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The function we saw in the previous section works quite well as a loss function, but we can make it a bit better. The problem is that we are using probabilities, and probabilities cannot be smaller than 0 or greater than 1. That means that our model will not care whether it predicts 0.99 or 0.999. Indeed, those numbers are so close together--but in another sense, 0.999 is 10 times more confident than 0.99. So, we want to transform our numbers between 0 and 1 to instead be between negative infinity and infinity. There is a mathematical function that does exactly this: the *logarithm* (available as `torch.log`). It is not defined for numbers less than 0, and looks like this:"
"The function we saw in the previous section works quite well as a loss function, but we can make it a bit better. The problem is that we are using probabilities, and probabilities cannot be smaller than 0 or greater than 1. That means that our model will not care whether it predicts 0.99 or 0.999. Indeed, those numbers are so close togetherbut in another sense, 0.999 is 10 times more confident than 0.99. So, we want to transform our numbers between 0 and 1 to instead be between negative infinity and infinity. There is a mathematical function that does exactly this: the *logarithm* (available as `torch.log`). It is not defined for numbers less than 0, and looks like this:"
]
},
{
@ -1457,7 +1457,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Oh dear--in this case, a confusion matrix is very hard to read. We have 37 different breeds of pet, which means we have 37×37 entries in this giant matrix! Instead, we can use the `most_confused` method, which just shows us the cells of the confusion matrix with the most incorrect predictions (here, with at least 5 or more):"
"Oh dearin this case, a confusion matrix is very hard to read. We have 37 different breeds of pet, which means we have 37×37 entries in this giant matrix! Instead, we can use the `most_confused` method, which just shows us the cells of the confusion matrix with the most incorrect predictions (here, with at least 5 or more):"
]
},
{
@ -1604,7 +1604,7 @@
"source": [
"That doesn't look good. Here's what happened. The optimizer stepped in the correct direction, but it stepped so far that it totally overshot the minimum loss. Repeating that multiple times makes it get further and further away, not closer and closer!\n",
"\n",
"What do we do to find the perfect learning rate--not too high, and not too low? In 2015 the researcher Leslie Smith came up with a brilliant idea, called the *learning rate finder*. His idea was to start with a very, very small learning rate, something so small that we would never expect it to be too big to handle. We use that for one mini-batch, find what the losses are afterwards, and then increase the learning rate by some percentage (e.g., doubling it each time). Then we do another mini-batch, track the loss, and double the learning rate again. We keep doing this until the loss gets worse, instead of better. This is the point where we know we have gone too far. We then select a learning rate a bit lower than this point. Our advice is to pick either:\n",
"What do we do to find the perfect learning ratenot too high, and not too low? In 2015 the researcher Leslie Smith came up with a brilliant idea, called the *learning rate finder*. His idea was to start with a very, very small learning rate, something so small that we would never expect it to be too big to handle. We use that for one mini-batch, find what the losses are afterwards, and then increase the learning rate by some percentage (e.g., doubling it each time). Then we do another mini-batch, track the loss, and double the learning rate again. We keep doing this until the loss gets worse, instead of better. This is the point where we know we have gone too far. We then select a learning rate a bit lower than this point. Our advice is to pick either:\n",
"\n",
"- One order of magnitude less than where the minimum loss was achieved (i.e., the minimum divided by 10)\n",
"- The last point where the loss was clearly decreasing \n",
@ -1946,7 +1946,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the graph is a little different from when we had random weights: we don't have that sharp descent that indicates the model is training. That's because our model has been trained already. Here we have a somewhat flat area before a sharp increase, and we should take a point well before that sharp increase--for instance, 1e-5. The point with the maximum gradient isn't what we look for here and should be ignored.\n",
"Note that the graph is a little different from when we had random weights: we don't have that sharp descent that indicates the model is training. That's because our model has been trained already. Here we have a somewhat flat area before a sharp increase, and we should take a point well before that sharp increasefor instance, 1e-5. The point with the maximum gradient isn't what we look for here and should be ignored.\n",
"\n",
"Let's train at a suitable learning rate:"
]
@ -2031,7 +2031,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This has improved our model a bit, but there's more we can do. The deepest layers of our pretrained model might not need as high a learning rate as the last ones, so we should probably use different learning rates for those--this is known as using *discriminative learning rates*."
"This has improved our model a bit, but there's more we can do. The deepest layers of our pretrained model might not need as high a learning rate as the last ones, so we should probably use different learning rates for thosethis is known as using *discriminative learning rates*."
]
},
{
@ -2063,7 +2063,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Fastai lets you pass a Python `slice` object anywhere that a learning rate is expected. The first value passed will be the learning rate in the earliest layer of the neural network, and the second value will be the learning rate in the final layer. The layers in between will have learning rates that are multiplicatively equidistant throughout that range. Let's use this approach to replicate the previous training, but this time we'll only set the *lowest* layer of our net to a learning rate of 1e-6; the other layers will scale up to 1e-4. Let's train for a while and see what happens:"
"fastai lets you pass a Python `slice` object anywhere that a learning rate is expected. The first value passed will be the learning rate in the earliest layer of the neural network, and the second value will be the learning rate in the final layer. The layers in between will have learning rates that are multiplicatively equidistant throughout that range. Let's use this approach to replicate the previous training, but this time we'll only set the *lowest* layer of our net to a learning rate of 1e-6; the other layers will scale up to 1e-4. Let's train for a while and see what happens:"
]
},
{
@ -2238,7 +2238,7 @@
"source": [
"Now the fine-tuning is working great!\n",
"\n",
"Fastai can show us a graph of the training and validation loss:"
"fastai can show us a graph of the training and validation loss:"
]
},
{
@ -2294,7 +2294,7 @@
"\n",
"Before the days of 1cycle training it was very common to save the model at the end of each epoch, and then select whichever model had the best accuracy out of all of the models saved in each epoch. This is known as *early stopping*. However, this is very unlikely to give you the best answer, because those epochs in the middle occur before the learning rate has had a chance to reach the small values, where it can really find the best result. Therefore, if you find that you have overfit, what you should actually do is retrain your model from scratch, and this time select a total number of epochs based on where your previous best results were found.\n",
"\n",
"If you have the time to train for more epochs, you may want to instead use that time to train more parameters--that is, use a deeper architecture."
"If you have the time to train for more epochs, you may want to instead use that time to train more parametersthat is, use a deeper architecture."
]
},
{
@ -2324,7 +2324,7 @@
"\n",
"The other downside of deeper architectures is that they take quite a bit longer to train. One technique that can speed things up a lot is *mixed-precision training*. This refers to using less-precise numbers (*half-precision floating point*, also called *fp16*) where possible during training. As we are writing these words in early 2020, nearly all current NVIDIA GPUs support a special feature called *tensor cores* that can dramatically speed up neural network training, by 2-3x. They also require a lot less GPU memory. To enable this feature in fastai, just add `to_fp16()` after your `Learner` creation (you also need to import the module).\n",
"\n",
"You can't really know ahead of time what the best architecture for your particular problem is--you need to try training some. So let's try a ResNet-50 now with mixed precision:"
"You can't really know ahead of time what the best architecture for your particular problem isyou need to try training some. So let's try a ResNet-50 now with mixed precision:"
]
},
{
@ -2456,7 +2456,7 @@
"source": [
"You'll see here we've gone back to using `fine_tune`, since it's so handy! We can pass `freeze_epochs` to tell fastai how many epochs to train for while frozen. It will automatically change learning rates appropriately for most datasets.\n",
"\n",
"In this case, we're not seeing a clear win from the deeper model. This is useful to remember--bigger models aren't necessarily better models for your particular case! Make sure you try small models before you start scaling up."
"In this case, we're not seeing a clear win from the deeper model. This is useful to rememberbigger models aren't necessarily better models for your particular case! Make sure you try small models before you start scaling up."
]
},
{
@ -2474,7 +2474,7 @@
"\n",
"We also discussed cross-entropy loss. This part of the book is worth spending plenty of time on. You aren't likely to need to actually implement cross-entropy loss from scratch yourself in practice, but it's really important you understand the inputs to and output from that function, because it (or a variant of it, as we'll see in the next chapter) is used in nearly every classification model. So when you want to debug a model, or put a model in production, or improve the accuracy of a model, you're going to need to be able to look at its activations and loss, and understand what's going on, and why. You can't do that properly if you don't understand your loss function.\n",
"\n",
"If cross-entropy loss hasn't \"clicked\" for you just yet, don't worry--you'll get there! First, go back to the last chapter and make sure you really understand `mnist_loss`. Then work gradually through the cells of the notebook for this chapter, where we step through each piece of cross-entropy loss. Make sure you understand what each calculation is doing, and why. Try creating some small tensors yourself and pass them into the functions, to see what they return.\n",
"If cross-entropy loss hasn't \"clicked\" for you just yet, don't worryyou'll get there! First, go back to the last chapter and make sure you really understand `mnist_loss`. Then work gradually through the cells of the notebook for this chapter, where we step through each piece of cross-entropy loss. Make sure you understand what each calculation is doing, and why. Try creating some small tensors yourself and pass them into the functions, to see what they return.\n",
"\n",
"Remember: the choices made in the implementation of cross-entropy loss are not the only possible choices that could have been made. Just like when we looked at regression we could choose between mean squared error and mean absolute difference (L1). If you have other ideas for possible functions that you think might work, feel free to give them a try in this chapter's notebook! (Fair warning though: you'll probably find that the model will be slower to train, and less accurate. That's because the gradient of cross-entropy loss is proportional to the difference between the activation and the target, so SGD always gets a nicely scaled step for the weights.)"
]

View File

@ -30,7 +30,7 @@
"source": [
"In the previous chapter you learned some important practical techniques for training models in practice. COnsiderations like selecting learning rates and the number of epochs are very important to getting good results.\n",
"\n",
"In this chapter we are going to look at two other types of computer vision problems: multi-label classification and regression. The first one is when you want to predict more than one label per image (or sometimes none at all), and the second is when your labels are one or several numbers--a quantity instead of a category.\n",
"In this chapter we are going to look at two other types of computer vision problems: multi-label classification and regression. The first one is when you want to predict more than one label per image (or sometimes none at all), and the second is when your labels are one or several numbersa quantity instead of a category.\n",
"\n",
"In the process will study more deeply the output activations, targets, and loss functions in deep learning models."
]
@ -50,7 +50,7 @@
"\n",
"For instance, this would have been a great approach for our bear classifier. One problem with the bear classifier that we rolled out in <<chapter_production>> was that if a user uploaded something that wasn't any kind of bear, the model would still say it was either a grizzly, black, or teddy bear—it had no ability to predict \"not a bear at all.\" In fact, after we have completed this chapter, it would be a great exercise for you to go back to your image classifier application, and try to retrain it using the multi-label technique, then test it by passing in an image that is not of any of your recognized classes.\n",
"\n",
"In practice, we have not seen many examples of people training multi-label classifiers for this purpose--but we very often see both users and developers complaining about this problem. It appears that this simple solution is not at all widely understood or appreciated! Because in practice it is probably more common to have some images with zero matches or more than one match, we should probably expect in practice that multi-label classifiers are more widely applicable than single-label classifiers.\n",
"In practice, we have not seen many examples of people training multi-label classifiers for this purposebut we very often see both users and developers complaining about this problem. It appears that this simple solution is not at all widely understood or appreciated! Because in practice it is probably more common to have some images with zero matches or more than one match, we should probably expect in practice that multi-label classifiers are more widely applicable than single-label classifiers.\n",
"\n",
"First, let's see what a multi-label dataset looks like, then we'll explain how to get it ready for our model. You'll see that the architecture of the model does not change from the last chapter; only the loss function does. Let's start with the data."
]
@ -497,7 +497,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can create a `Datasets` object from this. The only thing needed is a source--in this case, our DataFrame:"
"We can create a `Datasets` object from this. The only thing needed is a sourcein this case, our DataFrame:"
]
},
{
@ -918,7 +918,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Think about why `activs` has this shape--we have a batch size of 64, and we need to calculate the probability of each of 20 categories. Heres what one of those activations looks like:"
"Think about why `activs` has this shapewe have a batch size of 64, and we need to calculate the probability of each of 20 categories. Heres what one of those activations looks like:"
]
},
{
@ -953,7 +953,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"They arent yet scaled to between 0 and 1, but we learned how to do that in <<chapter_mnist_basics>>, using the `sigmoid` function. We also saw how to calculate a loss based on this--this is our loss function from <<chapter_mnist_basics>>, with the addition of `log` as discussed in the last chapter:"
"They arent yet scaled to between 0 and 1, but we learned how to do that in <<chapter_mnist_basics>>, using the `sigmoid` function. We also saw how to calculate a loss based on thisthis is our loss function from <<chapter_mnist_basics>>, with the addition of `log` as discussed in the last chapter:"
]
},
{

View File

@ -28,7 +28,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This chapter introduces more advanced techniques for training an image classification model and getting state-of-the-art results. You can skip it if you want to learn more about other applications of deep learning and come back to it later--knowledge of this material will not be assumed in later chapters.\n",
"This chapter introduces more advanced techniques for training an image classification model and getting state-of-the-art results. You can skip it if you want to learn more about other applications of deep learning and come back to it laterknowledge of this material will not be assumed in later chapters.\n",
"\n",
"We will look at what normalization is, a powerful data augmentation technique called mixup, the progressive resizing approach and test time augmentation. To show all of this, we are going to train a model from scratch (not using transfer learning) using a subset of ImageNet called [Imagenette](https://github.com/fastai/imagenette). It contains a subset of 10 very different categories from the original ImageNet dataset, making for quicker training when we want to experiment.\n",
"\n",
@ -190,7 +190,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When training a model, it helps if your input data is normalized--that is, has a mean of 0 and a standard deviation of 1. But most images and computer vision libraries use values between 0 and 255 for pixels, or between 0 and 1; in either case, your data is not going to have a mean of 0 and a standard deviation of 1.\n",
"When training a model, it helps if your input data is normalizedthat is, has a mean of 0 and a standard deviation of 1. But most images and computer vision libraries use values between 0 and 255 for pixels, or between 0 and 1; in either case, your data is not going to have a mean of 0 and a standard deviation of 1.\n",
"\n",
"Let's grab a batch of our data and look at those values, by averaging over all axes except for the channel axis, which is axis 1:"
]
@ -579,7 +579,7 @@
"source": [
"As you can see, we're getting much better performance, and the initial training on small images was much faster on each epoch.\n",
"\n",
"You can repeat the process of increasing size and training more epochs as many times as you like, for as big an image as you wish--but of course, you will not get any benefit by using an image size larger than the size of your images on disk.\n",
"You can repeat the process of increasing size and training more epochs as many times as you like, for as big an image as you wishbut of course, you will not get any benefit by using an image size larger than the size of your images on disk.\n",
"\n",
"Note that for transfer learning, progressive resizing may actually hurt performance. This is most likely to happen if your pretrained model was quite similar to your transfer learning task and dataset and was trained on similar-sized images, so the weights don't need to be changed much. In that case, training on smaller images may damage the pretrained weights.\n",
"\n",
@ -699,7 +699,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, using TTA gives us good a boost in performance, with no additional training required. However, it does make inference slower--if you're averaging five images for TTA, inference will be five times slower.\n",
"As we can see, using TTA gives us good a boost in performance, with no additional training required. However, it does make inference slowerif you're averaging five images for TTA, inference will be five times slower.\n",
"\n",
"We've seen examples of how data augmentation helps train better models. Let's now focus on a new data augmentation technique called *Mixup*."
]
@ -849,13 +849,13 @@
"\n",
"Mixup requires far more epochs to train to get better accuracy, compared to other augmentation approaches we've seen. You can try training Imagenette with and without Mixup by using the *examples/train_imagenette.py* script in the [fastai repo](https://github.com/fastai/fastai). At the time of writing, the leaderboard in the [Imagenette repo](https://github.com/fastai/imagenette/) is showing that Mixup is used for all leading results for trainings of >80 epochs, and for fewer epochs Mixup is not being used. This is in line with our experience of using Mixup too.\n",
"\n",
"One of the reasons that Mixup is so exciting is that it can be applied to types of data other than photos. In fact, some people have even shown good results by using Mixup on activations *inside* their models, not just on inputs--this allows Mixup to be used for NLP and other data types too.\n",
"One of the reasons that Mixup is so exciting is that it can be applied to types of data other than photos. In fact, some people have even shown good results by using Mixup on activations *inside* their models, not just on inputsthis allows Mixup to be used for NLP and other data types too.\n",
"\n",
"There's another subtle issue that Mixup deals with for us, which is that it's not actually possible with the models we've seen before for our loss to ever be perfect. The problem is that our labels are 1s and 0s, but the outputs of softmax and sigmoid can never equal 1 or 0. This means training our model pushes our activations ever closer to those values, such that the more epochs we do, the more extreme our activations become.\n",
"\n",
"With Mixup we no longer have that problem, because our labels will only be exactly 1 or 0 if we happen to \"mix\" with another image of the same class. The rest of the time our labels will be a linear combination, such as the 0.7 and 0.3 we got in the church and gas station example earlier.\n",
"\n",
"One issue with this, however, is that Mixup is \"accidentally\" making the labels bigger than 0, or smaller than 1. That is to say, we're not *explicitly* telling our model that we want to change the labels in this way. So, if we want to change to make the labels closer to, or further away from 0 and 1, we have to change the amount of Mixup--which also changes the amount of data augmentation, which might not be what we want. There is, however, a way to handle this more directly, which is to use *label smoothing*."
"One issue with this, however, is that Mixup is \"accidentally\" making the labels bigger than 0, or smaller than 1. That is to say, we're not *explicitly* telling our model that we want to change the labels in this way. So, if we want to change to make the labels closer to, or further away from 0 and 1, we have to change the amount of Mixupwhich also changes the amount of data augmentation, which might not be what we want. There is, however, a way to handle this more directly, which is to use *label smoothing*."
]
},
{
@ -895,7 +895,7 @@
"source": [
"Here is how the reasoning behind label smoothing was explained in the paper by Christian Szegedy et al.:\n",
"\n",
"> : This maximum is not achievable for finite $z_k$ but is approached if $z_y\\gg z_k$ for all $k\\neq y$--that is, if the logit corresponding to the ground-truth label is much great than all other logits. This, however, can cause two problems. First, it may result in over-fitting: if the model learns to assign full probability to the ground-truth label for each training example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient $\\frac{\\partial\\ell}{\\partial z_k}$, reduces the ability of the model to adapt. Intuitively, this happens because the model becomes too confident about its predictions."
"> : This maximum is not achievable for finite $z_k$ but is approached if $z_y\\gg z_k$ for all $k\\neq y$that is, if the logit corresponding to the ground-truth label is much great than all other logits. This, however, can cause two problems. First, it may result in over-fitting: if the model learns to assign full probability to the ground-truth label for each training example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient $\\frac{\\partial\\ell}{\\partial z_k}$, reduces the ability of the model to adapt. Intuitively, this happens because the model becomes too confident about its predictions."
]
},
{

View File

@ -1844,7 +1844,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use these to replicate any of the analyses we did in the previous section--for instance:"
"We can use these to replicate any of the analyses we did in the previous sectionfor instance:"
]
},
{
@ -2122,7 +2122,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Fastai provides this model in `fastai.collab` if you pass `use_nn=True` in your call to `collab_learner` (including calling `get_emb_sz` for you), and it lets you easily create more layers. For instance, here we're creating two hidden layers, of size 100 and 50, respectively:"
"fastai provides this model in `fastai.collab` if you pass `use_nn=True` in your call to `collab_learner` (including calling `get_emb_sz` for you), and it lets you easily create more layers. For instance, here we're creating two hidden layers, of size 100 and 50, respectively:"
]
},
{
@ -2230,7 +2230,7 @@
"\n",
"We're using `**kwargs` in `EmbeddingNN` to avoid having to write all the arguments to `TabularModel` a second time, and keep them in sync. However, this makes our API quite difficult to work with, because now Jupyter Notebook doesn't know what parameters are available. Consequently things like tab completion of parameter names and pop-up lists of signatures won't work.\n",
"\n",
"Fastai resolves this by providing a special `@delegates` decorator, which automatically changes the signature of the class or function (`EmbeddingNN` in this case) to insert all of its keyword arguments into the signature."
"fastai resolves this by providing a special `@delegates` decorator, which automatically changes the signature of the class or function (`EmbeddingNN` in this case) to insert all of its keyword arguments into the signature."
]
},
{
@ -2315,7 +2315,7 @@
"\n",
"1. Take a look at all the differences between the `Embedding` version of `DotProductBias` and the `create_params` version, and try to understand why each of those changes is required. If you're not sure, try reverting each change to see what happens. (NB: even the type of brackets used in `forward` has changed!)\n",
"1. Find three other areas where collaborative filtering is being used, and find out what the pros and cons of this approach are in those areas.\n",
"1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book's website and the fast.ai forum for ideas. Note that there are more columns in the full dataset--see if you can use those too (the next chapter might give you ideas).\n",
"1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book's website and the fast.ai forum for ideas. Note that there are more columns in the full datasetsee if you can use those too (the next chapter might give you ideas).\n",
"1. Create a model for MovieLens that works with cross-entropy loss, and compare it to the model in this chapter."
]
},

View File

@ -510,7 +510,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The most important data column is the dependent variable--that is, the one we want to predict. Recall that a model's metric is a function that reflects how good the predictions are. It's important to note what metric is being used for a project. Generally, selecting the metric is an important part of the project setup. In many cases, choosing a good metric will require more than just selecting a variable that already exists. It is more like a design process. You should think carefully about which metric, or set of metrics, actually measures the notion of model quality that matters to you. If no variable represents that metric, you should see if you can build the metric from the variables that are available.\n",
"The most important data column is the dependent variablethat is, the one we want to predict. Recall that a model's metric is a function that reflects how good the predictions are. It's important to note what metric is being used for a project. Generally, selecting the metric is an important part of the project setup. In many cases, choosing a good metric will require more than just selecting a variable that already exists. It is more like a design process. You should think carefully about which metric, or set of metrics, actually measures the notion of model quality that matters to you. If no variable represents that metric, you should see if you can build the metric from the variables that are available.\n",
"\n",
"However, in this case Kaggle tells us what metric to use: root mean squared log error (RMSLE) between the actual and predicted auction prices. We need do only a small amount of processing to use this: we take the log of the prices, so that `rmse` of that value will give us what we ultimately need:"
]
@ -565,7 +565,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This sequence of questions is now a procedure for taking any data item, whether an item from the training set or a new one, and assigning that item to a group. Namely, after asking and answering the questions, we can say the item belongs to the same group as all the other training data items that yielded the same set of answers to the questions. But what good is this? The goal of our model is to predict values for items, not to assign them into groups from the training dataset. The value is that we can now assign a prediction value for each of these groups--for regression, we take the target mean of the items in the group.\n",
"This sequence of questions is now a procedure for taking any data item, whether an item from the training set or a new one, and assigning that item to a group. Namely, after asking and answering the questions, we can say the item belongs to the same group as all the other training data items that yielded the same set of answers to the questions. But what good is this? The goal of our model is to predict values for items, not to assign them into groups from the training dataset. The value is that we can now assign a prediction value for each of these groupsfor regression, we take the target mean of the items in the group.\n",
"\n",
"Let's consider how we find the right questions to ask. Of course, we wouldn't want to have to create all these questions ourselves—that's what computers are for! The basic steps to train a decision tree can be written down very easily:\n",
"\n",
@ -600,13 +600,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The first piece of data preparation we need to do is to enrich our representation of dates. The fundamental basis of the decision tree that we just described is *bisection*-- dividing a group into two. We look at the ordinal variables and divide up the dataset based on whether the variable's value is greater (or lower) than a threshhold, and we look at the categorical variables and divide up the dataset based on whether the variable's level is a particular level. So this algorithm has a way of dividing up the dataset based on both ordinal and categorical data.\n",
"The first piece of data preparation we need to do is to enrich our representation of dates. The fundamental basis of the decision tree that we just described is *bisection* dividing a group into two. We look at the ordinal variables and divide up the dataset based on whether the variable's value is greater (or lower) than a threshhold, and we look at the categorical variables and divide up the dataset based on whether the variable's level is a particular level. So this algorithm has a way of dividing up the dataset based on both ordinal and categorical data.\n",
"\n",
"But how does this apply to a common data type, the date? You might want to treat a date as an ordinal value, because it is meaningful to say that one date is greater than another. However, dates are a bit different from most ordinal values in that some dates are qualitatively different from others in a way that that is often relevant to the systems we are modeling.\n",
"\n",
"In order to help our algorithm handle dates intelligently, we'd like our model to know more than whether a date is more recent or less recent than another. We might want our model to make decisions based on that date's day of the week, on whether a day is a holiday, on what month it is in, and so forth. To do this, we replace every date column with a set of date metadata columns, such as holiday, day of week, and month. These columns provide categorical data that we suspect will be useful.\n",
"\n",
"Fastai comes with a function that will do this for us—we just have to pass a column name that contains dates:"
"fastai comes with a function that will do this for us—we just have to pass a column name that contains dates:"
]
},
{
@ -1371,7 +1371,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Since it takes a minute or so to process the data to get to this point, we should save it--that way in the future we can continue our work from here without rerunning the previous steps. fastai provides a `save` method that uses Python's *pickle* system to save nearly any Python object:"
"Since it takes a minute or so to process the data to get to this point, we should save itthat way in the future we can continue our work from here without rerunning the previous steps. fastai provides a `save` method that uses Python's *pickle* system to save nearly any Python object:"
]
},
{
@ -7318,7 +7318,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Oops--it looks like we might be overfitting pretty badly. Here's why:"
"Oopsit looks like we might be overfitting pretty badly. Here's why:"
]
},
{
@ -7408,7 +7408,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> A: Here's my intuition for an overfitting decision tree with more leaf nodes than data items. Consider the game Twenty Questions. In that game, the chooser secretly imagines an object (like, \"our television set\"), and the guesser gets to pose 20 yes or no questions to try to guess what the object is (like \"Is it bigger than a breadbox?\"). The guesser is not trying to predict a numerical value, but just to identify a particular object out of the set of all imaginable objects. When your decision tree has more leaves than there are possible objects in your domain, then it is essentially a well-trained guesser. It has learned the sequence of questions needed to identify a particular data item in the training set, and it is \"predicting\" only by describing that item's value. This is a way of memorizing the training set--i.e., of overfitting."
"> A: Here's my intuition for an overfitting decision tree with more leaf nodes than data items. Consider the game Twenty Questions. In that game, the chooser secretly imagines an object (like, \"our television set\"), and the guesser gets to pose 20 yes or no questions to try to guess what the object is (like \"Is it bigger than a breadbox?\"). The guesser is not trying to predict a numerical value, but just to identify a particular object out of the set of all imaginable objects. When your decision tree has more leaves than there are possible objects in your domain, then it is essentially a well-trained guesser. It has learned the sequence of questions needed to identify a particular data item in the training set, and it is \"predicting\" only by describing that item's value. This is a way of memorizing the training seti.e., of overfitting."
]
},
{
@ -7431,7 +7431,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous chapter, when working with deep learning networks, we dealt with categorical variables by one-hot encoding them and feeding them to an embedding layer. The embedding layer helped the model to discover the meaning of the different levels of these variables (the levels of a categorical variable do not have an intrinsic meaning, unless we manually specify an ordering using Pandas). In a decision tree, we don't have embeddings layers--so how can these untreated categorical variables do anything useful in a decision tree? For instance, how could something like a product code be used?\n",
"In the previous chapter, when working with deep learning networks, we dealt with categorical variables by one-hot encoding them and feeding them to an embedding layer. The embedding layer helped the model to discover the meaning of the different levels of these variables (the levels of a categorical variable do not have an intrinsic meaning, unless we manually specify an ordering using Pandas). In a decision tree, we don't have embeddings layersso how can these untreated categorical variables do anything useful in a decision tree? For instance, how could something like a product code be used?\n",
"\n",
"The short answer is: it just works! Think about a situation where there is one product code that is far more expensive at auction than any other one. In that case, any binary split will result in that one product code being in some group, and that group will be more expensive than the other group. Therefore, our simple decision tree building algorithm will choose that split. Later during training the algorithm will be able to further split the subgroup that contains the expensive product code, and over time, the tree will home in on that one expensive product.\n",
"\n",
@ -7490,7 +7490,7 @@
"outputs": [],
"source": [
"#hide\n",
"# pip install --pre -f https://sklearn-nightly.scdn8.secure.raxcdn.com scikit-learn --U"
"# pip install —pre -f https://sklearn-nightly.scdn8.secure.raxcdn.com scikit-learn —U"
]
},
{
@ -7562,7 +7562,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the most important properties of random forests is that they aren't very sensitive to the hyperparameter choices, such as `max_features`. You can set `n_estimators` to as high a number as you have time to train--the more trees you have, the more accurate the model will be. `max_samples` can often be left at its default, unless you have over 200,000 data points, in which case setting it to 200,000 will make it train faster with little impact on accuracy. `max_features=0.5` and `min_samples_leaf=4` both tend to work well, although sklearn's defaults work well too.\n",
"One of the most important properties of random forests is that they aren't very sensitive to the hyperparameter choices, such as `max_features`. You can set `n_estimators` to as high a number as you have time to trainthe more trees you have, the more accurate the model will be. `max_samples` can often be left at its default, unless you have over 200,000 data points, in which case setting it to 200,000 will make it train faster with little impact on accuracy. `max_features=0.5` and `min_samples_leaf=4` both tend to work well, although sklearn's defaults work well too.\n",
"\n",
"The sklearn docs [show an example](http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html) of the effects different `max_features` choices, with increasing numbers of trees. In the plot, the blue plot line uses the fewest features and the green line uses the most (it uses all the features). As you can see in <<max_features>>, the models with the lowest error result from using a subset of features but with a larger number of trees."
]
@ -7704,7 +7704,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This is one way to interpret our model's predictions--let's focus on more of those now."
"This is one way to interpret our model's predictionslet's focus on more of those now."
]
},
{
@ -7740,7 +7740,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We saw how the model averages the individual tree's predictions to get an overall prediction--that is, an estimate of the value. But how can we know the confidence of the estimate? One simple way is to use the standard deviation of predictions across the trees, instead of just the mean. This tells us the *relative* confidence of predictions. In general, we would want to be more cautious of using the results for rows where trees give very different results (higher standard deviations), compared to cases where they are more consistent (lower standard deviations).\n",
"We saw how the model averages the individual tree's predictions to get an overall predictionthat is, an estimate of the value. But how can we know the confidence of the estimate? One simple way is to use the standard deviation of predictions across the trees, instead of just the mean. This tells us the *relative* confidence of predictions. In general, we would want to be more cautious of using the results for rows where trees give very different results (higher standard deviations), compared to cases where they are more consistent (lower standard deviations).\n",
"\n",
"In the earlier section on creating a random forest, we saw how to get predictions over the validation set, using a Python list comprehension to do this for each tree in the forest:"
]
@ -7796,7 +7796,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are the standard deviations for the predictions for the first five auctions--that is, the first five rows of the validation set:"
"Here are the standard deviations for the predictions for the first five auctionsthat is, the first five rows of the validation set:"
]
},
{
@ -7837,7 +7837,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It's not normally enough to just to know that a model can make accurate predictions--we also want to know *how* it's making predictions. *feature importance* gives us insight into this. We can get these directly from sklearn's random forest by looking in the `feature_importances_` attribute. Here's a simple function we can use to pop them into a DataFrame and sort them:"
"It's not normally enough to just to know that a model can make accurate predictionswe also want to know *how* it's making predictions. *feature importance* gives us insight into this. We can get these directly from sklearn's random forest by looking in the `feature_importances_` attribute. Here's a simple function we can use to pop them into a DataFrame and sort them:"
]
},
{
@ -8125,7 +8125,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We've found that generally the first step to improving a model is simplifying it--78 columns was too many for us to study them all in depth! Furthermore, in practice often a simpler, more interpretable model is easier to roll out and maintain.\n",
"We've found that generally the first step to improving a model is simplifying it78 columns was too many for us to study them all in depth! Furthermore, in practice often a simpler, more interpretable model is easier to roll out and maintain.\n",
"\n",
"This also makes our feature importance plot easier to interpret. Let's look at it again:"
]
@ -8203,7 +8203,7 @@
"\n",
"> note: Determining Similarity: The most similar pairs are found by calculating the _rank correlation_, which means that all the values are replaced with their _rank_ (i.e., first, second, third, etc. within the column), and then the _correlation_ is calculated. (Feel free to skip over this minor detail though, since it's not going to come up again in the book!)\n",
"\n",
"Let's try removing some of these closely related features to see if the model can be simplified without impacting the accuracy. First, we create a function that quickly trains a random forest and returns the OOB score, by using a lower `max_samples` and higher `min_samples_leaf`. The OOB score is a number returned by sklearn that ranges between 1.0 for a perfect model and 0.0 for a random model. (In statistics it's called *R^2*, although the details aren't important for this explanation.) We don't need it to be very accurate--we're just going to use it to compare different models, based on removing some of the possibly redundant columns:"
"Let's try removing some of these closely related features to see if the model can be simplified without impacting the accuracy. First, we create a function that quickly trains a random forest and returns the OOB score, by using a lower `max_samples` and higher `min_samples_leaf`. The OOB score is a number returned by sklearn that ranges between 1.0 for a perfect model and 0.0 for a random model. (In statistics it's called *R^2*, although the details aren't important for this explanation.) We don't need it to be very accuratewe're just going to use it to compare different models, based on removing some of the possibly redundant columns:"
]
},
{
@ -8511,7 +8511,7 @@
"source": [
"Looking first of all at the `YearMade` plot, and specifically at the section covering the years after 1990 (since as we noted this is where we have the most data), we can see a nearly linear relationship between year and price. Remember that our dependent variable is after taking the logarithm, so this means that in practice there is an exponential increase in price. This is what we would expect: depreciation is generally recognized as being a multiplicative factor over time, so, for a given sale date, varying year made ought to show an exponential relationship with sale price.\n",
"\n",
"The `ProductSize` partial plot is a bit concerning. It shows that the final group, which we saw is for missing values, has the lowest price. To use this insight in practice, we would want to find out *why* it's missing so often, and what that *means*. Missing values can sometimes be useful predictors--it entirely depends on what causes them to be missing. Sometimes, however, they can indicate *data leakage*."
"The `ProductSize` partial plot is a bit concerning. It shows that the final group, which we saw is for missing values, has the lowest price. To use this insight in practice, we would want to find out *why* it's missing so often, and what that *means*. Missing values can sometimes be useful predictorsit entirely depends on what causes them to be missing. Sometimes, however, they can indicate *data leakage*."
]
},
{
@ -8553,7 +8553,7 @@
"- Look for important predictors that don't make sense in practice.\n",
"- Look for partial dependence plot results that don't make sense in practice.\n",
"\n",
"Thinking back to our bear detector, this mirrors the advice that we provided in <<chapter_production>>--it is often a good idea to build a model first and then do your data cleaning, rather than vice versa. The model can help you identify potentially problematic data issues.\n",
"Thinking back to our bear detector, this mirrors the advice that we provided in <<chapter_production>>it is often a good idea to build a model first and then do your data cleaning, rather than vice versa. The model can help you identify potentially problematic data issues.\n",
"\n",
"It can also help you identifyt which factors influence specific predictions, with tree interpreters."
]
@ -8637,7 +8637,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"`prediction` is simply the prediction that the random forest makes. `bias` is the prediction based on taking the mean of the dependent variable (i.e., the *model* that is the root of every tree). `contributions` is the most interesting bit--it tells us the total change in predicition due to each of the independent variables. Therefore, the sum of `contributions` plus `bias` must equal the `prediction`, for each row. Let's look just at the first row:"
"`prediction` is simply the prediction that the random forest makes. `bias` is the prediction based on taking the mean of the dependent variable (i.e., the *model* that is the root of every tree). `contributions` is the most interesting bitit tells us the total change in predicition due to each of the independent variables. Therefore, the sum of `contributions` plus `bias` must equal the `prediction`, for each row. Let's look just at the first row:"
]
},
{
@ -9294,7 +9294,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can create our `TabularPandas` object in the same way as when we created our random forest, with one very important addition: normalization. A random forest does not need any normalization--the tree building procedure cares only about the order of values in a variable, not at all about how they are scaled. But as we have seen, a neural network definitely does care about this. Therefore, we add the `Normalize` processor when we build our `TabularPandas` object:"
"We can create our `TabularPandas` object in the same way as when we created our random forest, with one very important addition: normalization. A random forest does not need any normalizationthe tree building procedure cares only about the order of values in a variable, not at all about how they are scaled. But as we have seen, a neural network definitely does care about this. Therefore, we add the `Normalize` processor when we build our `TabularPandas` object:"
]
},
{
@ -9577,7 +9577,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Another thing that can help with generalization is to use several models and average their predictions--a technique, as mentioned earlier, known as *ensembling*."
"Another thing that can help with generalization is to use several models and average their predictionsa technique, as mentioned earlier, known as *ensembling*."
]
},
{
@ -9595,7 +9595,7 @@
"\n",
"In our case, we have two very different models, trained using very different algorithms: a random forest, and a neural network. It would be reasonable to expect that the kinds of errors that each one makes would be quite different. Therefore, we might expect that the average of their predictions would be better than either one's individual predictions.\n",
"\n",
"As we saw earlier, a random forest is itself an ensemble. But we can then include a random forest in *another* ensemble--an ensemble of the random forest and the neural network! While ensembling won't make the difference between a successful and an unsuccessful modeling process, it can certainly add a nice little boost to any models that you have built.\n",
"As we saw earlier, a random forest is itself an ensemble. But we can then include a random forest in *another* ensemblean ensemble of the random forest and the neural network! While ensembling won't make the difference between a successful and an unsuccessful modeling process, it can certainly add a nice little boost to any models that you have built.\n",
"\n",
"One minor issue we have to be aware of is that our PyTorch model and our sklearn model create data of different types: PyTorch gives us a rank-2 tensor (i.e, a column matrix), whereas NumPy gives us a rank-1 array (a vector). `squeeze` removes any unit axes from a tensor, and `to_np` converts it into a NumPy array:"
]

View File

@ -213,7 +213,7 @@
"source": [
"As we write this book, the default English word tokenizer for fastai uses a library called *spaCy*. It has a sophisticated rules engine with special rules for URLs, individual special English words, and much more. Rather than directly using `SpacyTokenizer`, however, we'll use `WordTokenizer`, since that will always point to fastai's current default word tokenizer (which may not necessarily be spaCy, depending when you're reading this).\n",
"\n",
"Let's try it out. We'll use fastai's `coll_repr(collection, n)` function to display the results. This displays the first *`n`* items of *`collection`*, along with the full size--it's what `L` uses by default. Note that fastai's tokenizers take a collection of documents to tokenize, so we have to wrap `txt` in a list:"
"Let's try it out. We'll use fastai's `coll_repr(collection, n)` function to display the results. This displays the first *`n`* items of *`collection`*, along with the full sizeit's what `L` uses by default. Note that fastai's tokenizers take a collection of documents to tokenize, so we have to wrap `txt` in a list:"
]
},
{
@ -239,7 +239,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As you see, spaCy has mainly just separated out the words and punctuation. But it does something else here too: it has split \"it's\" into \"it\" and \"'s\". That makes intuitive sense; these are separate words, really. Tokenization is a surprisingly subtle task, when you think about all the little details that have to be handled. Fortunately, spaCy handles these pretty well for us--for instance, here we see that \".\" is separated when it terminates a sentence, but not in an acronym or number:"
"As you see, spaCy has mainly just separated out the words and punctuation. But it does something else here too: it has split \"it's\" into \"it\" and \"'s\". That makes intuitive sense; these are separate words, really. Tokenization is a surprisingly subtle task, when you think about all the little details that have to be handled. Fortunately, spaCy handles these pretty well for usfor instance, here we see that \".\" is separated when it terminates a sentence, but not in an acronym or number:"
]
},
{
@ -295,7 +295,7 @@
"\n",
"For example, the first item in the list, `xxbos`, is a special token that indicates the start of a new text (\"BOS\" is a standard NLP acronym that means \"beginning of stream\"). By recognizing this start token, the model will be able to learn it needs to \"forget\" what was said previously and focus on upcoming words.\n",
"\n",
"These special tokens don't come from spaCy directly. They are there because fastai adds them by default, by applying a number of rules when processing text. These rules are designed to make it easier for a model to recognize the important parts of a sentence. In a sense, we are translating the original English language sequence into a simplified tokenized language--a language that is designed to be easy for a model to learn.\n",
"These special tokens don't come from spaCy directly. They are there because fastai adds them by default, by applying a number of rules when processing text. These rules are designed to make it easier for a model to recognize the important parts of a sentence. In a sense, we are translating the original English language sequence into a simplified tokenized languagea language that is designed to be easy for a model to learn.\n",
"\n",
"For instance, the rules will replace a sequence of four exclamation points with a single exclamation point, followed by a special *repeated character* token, and then the number four. In this way, the model's embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repetitions of every punctuation mark. Similarly, a capitalized word will be replaced with a special capitalization token, followed by the lowercase version of the word. This way, the embedding matrix only needs the lowercase versions of the words, saving compute and memory resources, but can still learn the concept of capitalization.\n",
"\n",
@ -1285,7 +1285,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"fastai handles tokenization and numericalization automatically when `TextBlock` is passed to `DataBlock`. All of the arguments that can be passed to `Tokenize` and `Numericalize` can also be passed to `TextBlock`. In the next chapter we'll discuss the easiest ways to run each of these steps separately, to ease debugging--but you can always just debug by running them manually on a subset of your data as shown in the previous sections. And don't forget about `DataBlock`'s handy `summary` method, which is very useful for debugging data issues.\n",
"fastai handles tokenization and numericalization automatically when `TextBlock` is passed to `DataBlock`. All of the arguments that can be passed to `Tokenize` and `Numericalize` can also be passed to `TextBlock`. In the next chapter we'll discuss the easiest ways to run each of these steps separately, to ease debuggingbut you can always just debug by running them manually on a subset of your data as shown in the previous sections. And don't forget about `DataBlock`'s handy `summary` method, which is very useful for debugging data issues.\n",
"\n",
"Here's how we use `TextBlock` to create a language model, using fastai's defaults:"
]
@ -1313,7 +1313,7 @@
"- It save the tokenized documents in a temporary folder, so it doesn't have to tokenize them more than once\n",
"- It runs multiple tokenization processes in parallel, to take advantage of your computer's CPUs\n",
"\n",
"We need to tell `TextBlock` how to access the texts, so that it can do this initial preprocessing--that's what `from_folder` does.\n",
"We need to tell `TextBlock` how to access the texts, so that it can do this initial preprocessingthat's what `from_folder` does.\n",
"\n",
"`show_batch` then works in the usual way:"
]
@ -1412,7 +1412,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It takes quite a while to train each epoch, so we'll be saving the intermediate model results during the training process. Since `fine_tune` doesn't do that for us, we'll use `fit_one_cycle`. Just like `cnn_learner`, `language_model_learner` automatically calls `freeze` when using a pretrained model (which is the default), so this will only train the embeddings (the only part of the model that contains randomly initialized weights--i.e., embeddings for words that are in our IMDb vocab, but aren't in the pretrained model vocab):"
"It takes quite a while to train each epoch, so we'll be saving the intermediate model results during the training process. Since `fine_tune` doesn't do that for us, we'll use `fit_one_cycle`. Just like `cnn_learner`, `language_model_learner` automatically calls `freeze` when using a pretrained model (which is the default), so this will only train the embeddings (the only part of the model that contains randomly initialized weightsi.e., embeddings for words that are in our IMDb vocab, but aren't in the pretrained model vocab):"
]
},
{
@ -1743,7 +1743,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We're now moving from language model fine-tuning to classifier fine-tuning. To recap, a language model predicts the next word of a document, so it doesn't need any external labels. A classifier, however, predicts some external label--in the case of IMDb, it's the sentiment of a document.\n",
"We're now moving from language model fine-tuning to classifier fine-tuning. To recap, a language model predicts the next word of a document, so it doesn't need any external labels. A classifier, however, predicts some external labelin the case of IMDb, it's the sentiment of a document.\n",
"\n",
"This means that the structure of our `DataBlock` for NLP classification will look very familiar. It's actually nearly the same as we've seen for the many image classification datasets we've worked with:"
]
@ -2150,7 +2150,7 @@
"source": [
"Kao estimated that \"less than 800,000 of the 22M+ comments… could be considered truly unique\" and that \"more than 99% of the truly unique comments were in favor of keeping net neutrality.\"\n",
"\n",
"Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the necessary tools at your disposal to create a compelling language model--that is, something that can generate context-appropriate, believable text. It won't necessarily be perfectly accurate or correct, but it will be plausible. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about in recent years. Take a look at the Reddit dilaogue shown in <<ethics_reddit>>, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending."
"Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the necessary tools at your disposal to create a compelling language modelthat is, something that can generate context-appropriate, believable text. It won't necessarily be perfectly accurate or correct, but it will be plausible. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about in recent years. Take a look at the Reddit dilaogue shown in <<ethics_reddit>>, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending."
]
},
{
@ -2182,7 +2182,7 @@
"source": [
"Katie Jones was connected on LinkedIn to several members of mainstream Washington think tanks. But she didn't exist. That image you see was auto-generated by a generative adversarial network, and somebody named Katie Jones has not, in fact, graduated from the Center for Strategic and International Studies.\n",
"\n",
"Many people assume or hope that algorithms will come to our defense here--that we will develop classification algorithms that can automatically recognise autogenerated content. The problem, however, is that this will always be an arms race, in which better classification (or discriminator) algorithms can be used to create better generation algorithms."
"Many people assume or hope that algorithms will come to our defense herethat we will develop classification algorithms that can automatically recognise autogenerated content. The problem, however, is that this will always be an arms race, in which better classification (or discriminator) algorithms can be used to create better generation algorithms."
]
},
{

View File

@ -1026,7 +1026,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The important thing with transforms that we saw before is that they dispatch over tuples or their subclasses. That's precisely why we chose to subclass `Tuple` in this instance--this way we can apply any transform that works on images to our `SiameseImage` and it will be applied on each image in the tuple:"
"The important thing with transforms that we saw before is that they dispatch over tuples or their subclasses. That's precisely why we chose to subclass `Tuple` in this instancethis way we can apply any transform that works on images to our `SiameseImage` and it will be applied on each image in the tuple:"
]
},
{
@ -1187,7 +1187,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we need to pass more transforms than usual--that's because the data block API usually adds them automatically:\n",
"Note that we need to pass more transforms than usualthat's because the data block API usually adds them automatically:\n",
"\n",
"- `ToTensor` is the one that converts images to tensors (again, it's applied on every part of the tuple).\n",
"- `IntToFloatTensor` converts the tensor of images containing integers from 0 to 255 to a tensor of floats, and divides by 255 to make the values between 0 and 1."
@ -1261,7 +1261,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Becoming a Deep Learning Practitioner"
"## Understanding fastai's Applications: Wrap Up"
]
},
{

View File

@ -566,7 +566,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the code for our module, we could simplify it by replacing the duplicated code that calls the layers with a `for` loop. As well as making our code simpler, this will also have the benefit that we will be able to apply our module equally well to token sequences of different lengths--we won't be restricted to token lists of length three:"
"Looking at the code for our module, we could simplify it by replacing the duplicated code that calls the layers with a `for` loop. As well as making our code simpler, this will also have the benefit that we will be able to apply our module equally well to token sequences of different lengthswe won't be restricted to token lists of length three:"
]
},
{
@ -1504,7 +1504,7 @@
"\n",
"The reason this is challenging is because of what happens when you multiply by a matrix many times. Think about what happens when you multiply by a number many times. For example, if you multiply by 2, starting at 1, you get the sequence 1, 2, 4, 8,... after 32 steps you are already at 4,294,967,296. A similar issue happens if you multiply by 0.5: you get 0.5, 0.25, 0.125… and after 32 steps it's 0.00000000023. As you can see, multiplying by a number even slightly higher or lower than 1 results in an explosion or disappearance of our starting number, after just a few repeated multiplications.\n",
"\n",
"Because matrix multiplication is just multiplying numbers and adding them up, exactly the same thing happens with repeated matrix multiplications. And that's all a deep neural network is --each extra layer is another matrix multiplication. This means that it is very easy for a deep neural network to end up with extremely large or extremely small numbers.\n",
"Because matrix multiplication is just multiplying numbers and adding them up, exactly the same thing happens with repeated matrix multiplications. And that's all a deep neural network is each extra layer is another matrix multiplication. This means that it is very easy for a deep neural network to end up with extremely large or extremely small numbers.\n",
"\n",
"This is a problem, because the way computers store numbers (known as \"floating point\") means that they become less and less accurate the further away the numbers get from zero. The diagram in <<float_prec>>, from the excellent article [\"What You Never Wanted to Know About Floating Point but Will Be Forced to Find Out\"](http://www.volkerschatz.com/science/float.html), shows how the precision of floating-point numbers varies over the number line."
]
@ -1574,7 +1574,7 @@
"\n",
"where $\\sigma$ is the sigmoid function. The green circles are elementwise operations. What goes out on the right is the new hidden state ($h_{t}$) and new cell state ($c_{t}$), ready for our next input. The new hidden state is also used as output, which is why the arrow splits to go up.\n",
"\n",
"Let's go over the four neural nets (called *gates*) one by one and explain the diagram--but before this, notice how very little the cell state (at the top) is changed. It doesn't even go directly through a neural net! This is exactly why it will carry on a longer-term state.\n",
"Let's go over the four neural nets (called *gates*) one by one and explain the diagrambut before this, notice how very little the cell state (at the top) is changed. It doesn't even go directly through a neural net! This is exactly why it will carry on a longer-term state.\n",
"\n",
"First, the arrows for input and old hidden state are joined together. In the RNN we wrote earlier in this chapter, we were adding them together. In the LSTM, we stack them in one big tensor. This means the dimension of our embeddings (which is the dimension of $x_{t}$) can be different than the dimension of our hidden state. If we call those `n_in` and `n_hid`, the arrow at the bottom is of size `n_in + n_hid`; thus all the neural nets (orange boxes) are linear layers with `n_in + n_hid` inputs and `n_hid` outputs.\n",
"\n",
@ -1960,7 +1960,7 @@
"source": [
"The `bernoulli_` method is creating a tensor of random zeros (with probability `p`) and ones (with probability `1-p`), which is then multiplied with our input before dividing by `1-p`. Note the use of the `training` attribute, which is available in any PyTorch `nn.Module`, and tells us if we are doing training or inference.\n",
"\n",
"> note: Do Your Own Experiments: In previous chapters of the book we'd be adding a code example for `bernoulli_` here, so you can see exactly how it works. But now that you know enough to do this yourself, we're going to be doing fewer and fewer examples for you, and instead expecting you to do your own experiments to see how things work. In this case, you'll see in the end-of-chapter questionnaire that we're asking you to experiment with `bernoulli_`--but don't wait for us to ask you to experiment to develop your understanding of the code we're studying; go ahead and do it anyway!\n",
"> note: Do Your Own Experiments: In previous chapters of the book we'd be adding a code example for `bernoulli_` here, so you can see exactly how it works. But now that you know enough to do this yourself, we're going to be doing fewer and fewer examples for you, and instead expecting you to do your own experiments to see how things work. In this case, you'll see in the end-of-chapter questionnaire that we're asking you to experiment with `bernoulli_`but don't wait for us to ask you to experiment to develop your understanding of the code we're studying; go ahead and do it anyway!\n",
"\n",
"Using dropout before passing the output of our LSTM to the final layer will help reduce overfitting. Dropout is also used in many other models, including the default CNN head used in `fastai.vision`, and is available in `fastai.tabular` by passing the `ps` parameter (where each \"p\" is passed to each added `Dropout` layer), as we'll see in <<chapter_arch_details>>."
]

View File

@ -31,7 +31,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In <<chapter_mnist_basics>> we learned how to create a neural network recognizing images. We were able to achieve a bit over 98% accuracy at distinguishing 3s from 7s--but we also saw that fastai's built-in classes were able to get close to 100%. Let's start trying to close the gap.\n",
"In <<chapter_mnist_basics>> we learned how to create a neural network recognizing images. We were able to achieve a bit over 98% accuracy at distinguishing 3s from 7sbut we also saw that fastai's built-in classes were able to get close to 100%. Let's start trying to close the gap.\n",
"\n",
"In this chapter, we will begin by digging into what convolutions are and building a CNN from scratch. We will then study a range of techniques to improve training stability and learn all the tweaks the library usually applies for us to get great results."
]
@ -198,7 +198,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Not very interesting so far--all the pixels in the top-left corner are white. But let's pick a couple of more interesting spots:"
"Not very interesting so farall the pixels in the top-left corner are white. But let's pick a couple of more interesting spots:"
]
},
{
@ -1329,7 +1329,7 @@
"\n",
"$$\\begin{matrix} a1 & a2 & a3 \\\\ a4 & a5 & a6 \\\\ a7 & a8 & a9 \\end{matrix}$$\n",
"\n",
"it will return $a1+a2+a3-a7-a8-a9$. If we are in a part of the image where $a1$, $a2$, and $a3$ add up to the same as $a7$, $a8$, and $a9$, then the terms will cancel each other out and we will get 0. However, if $a1$ is greater than $a7$, $a2$ is greater than $a8$, and $a3$ is greater than $a9$, we will get a bigger number as a result. So this filter detects horizontal edges--more precisely, edges where we go from bright parts of the image at the top to darker parts at the bottom.\n",
"it will return $a1+a2+a3-a7-a8-a9$. If we are in a part of the image where $a1$, $a2$, and $a3$ add up to the same as $a7$, $a8$, and $a9$, then the terms will cancel each other out and we will get 0. However, if $a1$ is greater than $a7$, $a2$ is greater than $a8$, and $a3$ is greater than $a9$, we will get a bigger number as a result. So this filter detects horizontal edgesmore precisely, edges where we go from bright parts of the image at the top to darker parts at the bottom.\n",
"\n",
"Changing our filter to have the row of `1`s at the top and the `-1`s at the bottom would detect horizonal edges that go from dark to light. Putting the `1`s and `-1`s in columns versus rows would give us filters that detect vertical edges. Each set of weights will produce a different kind of outcome.\n",
"\n",
@ -1633,7 +1633,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"One batch contains 64 images, each of 1 channel, with 28\\*28 pixels. `F.conv2d` can handle multichannel (i.e., color) images too. A *channel* is a single basic color in an image--for regular full-color images there are three channels, red, green, and blue. PyTorch represents an image as a rank-3 tensor, with dimensions `[channels, rows, columns]`.\n",
"One batch contains 64 images, each of 1 channel, with 28\\*28 pixels. `F.conv2d` can handle multichannel (i.e., color) images too. A *channel* is a single basic color in an imagefor regular full-color images there are three channels, red, green, and blue. PyTorch represents an image as a rank-3 tensor, with dimensions `[channels, rows, columns]`.\n",
"\n",
"We'll see how to handle more than one channel later in this chapter. Kernels passed to `F.conv2d` need to be rank-4 tensors: `[channels_in, features_out, rows, columns]`. `edge_kernels` is currently missing one of these. We need to tell PyTorch that the number of input channels in the kernel is one, which we can do by inserting an axis of size one (this is known as a *unit axis*) in the first location, where the PyTorch docs show `in_channels` is expected. To insert a unit axis into a tensor, we use the `unsqueeze` method:"
]
@ -1728,7 +1728,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The most important trick that PyTorch has up its sleeve is that it can use the GPU to do all this work in parallel--that is, applying multiple kernels, to multiple images, across multiple channels. Doing lots of work in parallel is critical to getting GPUs to work efficiently; if we did each of these operations one at a time, we'd often run hundreds of times slower (and if we used our manual convolution loop from the previous section, we'd be millions of times slower!). Therefore, to become a strong deep learning practitioner, one skill to practice is giving your GPU plenty of work to do at a time."
"The most important trick that PyTorch has up its sleeve is that it can use the GPU to do all this work in parallelthat is, applying multiple kernels, to multiple images, across multiple channels. Doing lots of work in parallel is critical to getting GPUs to work efficiently; if we did each of these operations one at a time, we'd often run hundreds of times slower (and if we used our manual convolution loop from the previous section, we'd be millions of times slower!). Therefore, to become a strong deep learning practitioner, one skill to practice is giving your GPU plenty of work to do at a time."
]
},
{
@ -1893,7 +1893,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's an interesting insight--a convolution can be represented as a special kind of matrix multiplication, as illustrated in <<conv_matmul>>. The weight matrix is just like the ones from traditional neural networks. However, this weight matrix has two special properties:\n",
"Here's an interesting insighta convolution can be represented as a special kind of matrix multiplication, as illustrated in <<conv_matmul>>. The weight matrix is just like the ones from traditional neural networks. However, this weight matrix has two special properties:\n",
"\n",
"1. The zeros shown in gray are untrainable. This means that theyll stay zero throughout the optimization process.\n",
"1. Some of the weights are equal, and while they are trainable (i.e., changeable), they must remain equal. These are called *shared weights*.\n",
@ -2418,7 +2418,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, the cell with the green border is the cell we clicked on, and the blue highlighted cells are its *precedents*--that is, the cells used to calculate its value. These cells are the corresponding 3\\*3 area of cells from the input layer (on the left), and the cells from the filter (on the right). Let's now click *trace precedents* again, to see what cells are used to calculate these inputs. <<preced2>> shows what happens."
"Here, the cell with the green border is the cell we clicked on, and the blue highlighted cells are its *precedents*that is, the cells used to calculate its value. These cells are the corresponding 3\\*3 area of cells from the input layer (on the left), and the cells from the filter (on the right). Let's now click *trace precedents* again, to see what cells are used to calculate these inputs. <<preced2>> shows what happens."
]
},
{
@ -3111,7 +3111,7 @@
"\n",
"1cycle training allows us to use a much higher maximum learning rate than other types of training, which gives two benefits:\n",
"\n",
"- By training with higher learning rates, we train faster--a phenomenon Smith named *super-convergence*.\n",
"- By training with higher learning rates, we train fastera phenomenon Smith named *super-convergence*.\n",
"- By training with higher learning rates, we overfit less because we skip over the sharp local minima to end up in a smoother (and therefore more generalizable) part of the loss.\n",
"\n",
"The second point is an interesting and subtle one; it is based on the observation that a model that generalizes well is one whose loss would not change very much if you changed the input by a small amount. If a model trains at a large learning rate for quite a while, and can find a good loss when doing so, it must have found an area that also generalizes well, because it is jumping around a lot from batch to batch (that is basically the definition of a high learning rate). The problem is that, as we have discussed, just jumping to a high learning rate is more likely to result in diverging losses, rather than seeing your losses improve. So we don't jump straight to a high learning rate. Instead, we start at a low learning rate, where our losses do not diverge, and we allow the optimizer to gradually find smoother and smoother areas of our parameters by gradually going to higher and higher learning rates.\n",
@ -3348,7 +3348,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This shows a classic picture of \"bad training.\" We start with nearly all activations at zero--that's what we see at the far left, with all the dark blue. The bright yellow at the bottom represents the near-zero activations. Then, over the first few batches we see the number of nonzero activations exponentially increasing. But it goes too far, and collapses! We see the dark blue return, and the bottom becomes bright yellow again. It almost looks like training restarts from scratch. Then we see the activations increase again, and collapse again. After repeating this a few times, eventually we see a spread of activations throughout the range.\n",
"This shows a classic picture of \"bad training.\" We start with nearly all activations at zerothat's what we see at the far left, with all the dark blue. The bright yellow at the bottom represents the near-zero activations. Then, over the first few batches we see the number of nonzero activations exponentially increasing. But it goes too far, and collapses! We see the dark blue return, and the bottom becomes bright yellow again. It almost looks like training restarts from scratch. Then we see the activations increase again, and collapse again. After repeating this a few times, eventually we see a spread of activations throughout the range.\n",
"\n",
"It's much better if training can be smooth from the start. The cycles of exponential increase and then collapse tend to result in a lot of near-zero activations, resulting in slow training and poor final results. One way to solve this problem is to use batch normalization."
]

View File

@ -30,24 +30,24 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we now know how to create state-of-the-art architectures for computer vision, natural image processing, tabular analysis, and collaborative filtering, and we know how to train them quickly, we're done, right? Not quite yet. We still have to explorea little bit more the training process.\n",
"You now know how to create state-of-the-art architectures for computer vision, natural image processing, tabular analysis, and collaborative filtering, and you know how to train them quickly. So we're done, right? Not quite yet. We still have to explorea little bit more the training process.\n",
"\n",
"We explained in <<chapter_mnist_basics>> the basis of Stochastic Gradient Descent: pass a minibatch in the model, compare it to our target with the loss function then compute the gradients of this loss function with regards to each weight before updating the weights with the formula:\n",
"We explained in <<chapter_mnist_basics>> the basis of stochastic gradient descent: pass a mini-batch to the model, compare it to our target with the loss function, then compute the gradients of this loss function with regard to each weight before updating the weights with the formula:\n",
"\n",
"```python\n",
"new_weight = weight - lr * weight.grad\n",
"```\n",
"\n",
"We implemented this from scratch in a training loop, and also saw that Pytorch provides a simple `nn.SGD` class that does this calculation for each parameter for us. In this chapter, we will build some faster optimizers, using a flexible foundation. But that's not all what we might want to change in the training process. For any tweak of the training loop, we will need a way to add some code to the basis of SGD. The fastai library has a system of callbacks to do this, and we will teach you all about it.\n",
"We implemented this from scratch in a training loop, and also saw that PyTorch provides a simple `nn.SGD` class that does this calculation for each parameter for us. In this chapter we will build some faster optimizers, using a flexible foundation. But that's not all we might want to change in the training process. For any tweak of the training loop, we will need a way to add some code to the basis of SGD. The fastai library has a system of callbacks to do this, and we will teach you all about it.\n",
"\n",
"Firs things first, let's start with standard SGD to get a baseline, then we will introduce most commonly used optimizers."
"Let's start with standard SGD to get a baseline, then we will introduce the most commonly used optimizers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Let's Start with SGD"
"## Establishing a Baseline"
]
},
{
@ -88,7 +88,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll create a ResNet34 without pretraining, and pass along any arguments received:"
"We'll create a ResNet-34 without pretraining, and pass along any arguments received:"
]
},
{
@ -296,9 +296,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"(Because accelerated SGD using momentum with is such a good idea, fastai uses it by default in `fit_one_cycle`, so we turn it off with `moms=(0,0,0)`; we'll be learning about momentum shortly.)\n",
"Because accelerating SGD with momentum is such a good idea, fastai does this by default in `fit_one_cycle`, so we turn it off with `moms=(0,0,0)`. We'll be discussing momentum shortly.)\n",
"\n",
"Clearly, plain SGD isn't training as fast as we'd like. So let's learn the tricks to get accelerated training!"
"Clearly, plain SGD isn't training as fast as we'd like. So let's learn some tricks to get accelerated training!"
]
},
{
@ -312,7 +312,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to build up our accelerated SGD tricks, we'll need to start with a nice flexible optimizer foundation. No library prior to fastai provided such a foundation, but during fastai's development we realized that all optimizer improvements we'd seen in the academic literature could be handled using *optimizer callbacks*. These are small pieces of code that an optimizer can add to the optimizer `step`. They are called by fastai's `Optimizer` class. This is a small class (less than a screen of code); these are the definitions in `Optimizer` of the two key methods that we've been using in this book:\n",
"To build up our accelerated SGD tricks, we'll need to start with a nice flexible optimizer foundation. No library prior to fastai provided such a foundation, but during fastai's development we realized that all the optimizer improvements we'd seen in the academic literature could be handled using *optimizer callbacks*. These are small pieces of code that we can compose, mix and match in an optimizer to build the optimizer `step`. They are called by fastai's lightweight `Optimizer` class. These are the definitions in `Optimizer` of the two key methods that we've been using in this book:\n",
"\n",
"```python\n",
"def zero_grad(self):\n",
@ -334,9 +334,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The more interesting method is `step`, which loops through the callbacks (`cbs`) and calls them to update the parameters (the `_update` function just calls `state.update` if there's anything returned by `cb(...)`). As you can see, `Optimizer` doesn't actually do any SGD steps itself. Let's see how we can add SGD to `Optimizer`.\n",
"The more interesting method is `step`, which loops through the callbacks (`cbs`) and calls them to update the parameters (the `_update` function just calls `state.update` if there's anything returned by `cb`). As you can see, `Optimizer` doesn't actually do any SGD steps itself. Let's see how we can add SGD to `Optimizer`.\n",
"\n",
"Here's an optimizer callback that does a single SGD step, by multiplying `-lr` by the gradients, and adding that to the parameter (when `Tensor.add_` in PyTorch is passed two parameters, they are multiplied together before the addition): "
"Here's an optimizer callback that does a single SGD step, by multiplying `-lr` by the gradients and adding that to the parameter (when `Tensor.add_` in PyTorch is passed two parameters, they are multiplied together before the addition): "
]
},
{
@ -431,7 +431,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It's working! So that's how we create SGD from scratch in fastai. Now let's see see what this momentum is exactly."
"It's working! So that's how we create SGD from scratch in fastai. Now let's see see what \"momentum\"."
]
},
{
@ -445,20 +445,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"SGD is the idea of taking a step in the direction of the steepest slope at each point of time. But what if we have a ball rolling down the mountain? It won't, at each given point, exactly follow the direction of the gradient, as it will have *momentum*. A ball with more momentum (for instance, a heavier ball) will skip over little bumps and holes, and be more likely to get to the bottom of a bumpy mountain. A ping pong ball, on the other hand, will get stuck in every little crevice.\n",
"As described in <<chapter_mnist_basics>>, SGD can be thought of as standing at the top of a mountain and working your way down by taking a step in the direction of the steepest slope at each point in time. But what if we have a ball rolling down the mountain? It won't, at each given point, exactly follow the direction of the gradient, as it will have *momentum*. A ball with more momentum (for instance, a heavier ball) will skip over little bumps and holes, and be more likely to get to the bottom of a bumpy mountain. A ping pong ball, on the other hand, will get stuck in every little crevice.\n",
"\n",
"So how could we bring this idea over to SGD? We can use a moving average, instead of only the current gradient, to make our step:\n",
"So how can we bring this idea over to SGD? We can use a moving average, instead of only the current gradient, to make our step:\n",
"\n",
"```python\n",
"weight.avg = beta * weight.avg + (1-beta) * weight.grad\n",
"new_weight = weight - lr * weight.avg\n",
"```\n",
"\n",
"Here `beta` is some number we choose which defines how much momentum to use. If `beta` is zero, then the first equation above becomes `weight.avg = weight.grad`, so we end up with plain SGD. But if it's a number close to one, then the main direction chosen is an average of previous steps. (If you have done a bit of statistics, you may recognize in the first equation an *exponentially weighted moving average*, which is very often used to denoise data and get the underlying tendency.)\n",
"Here `beta` is some number we choose which defines how much momentum to use. If `beta` is 0, then the first equation becomes `weight.avg = weight.grad`, so we end up with plain SGD. But if it's a number close to 1, then the main direction chosen is an average of the previous steps. (If you have done a bit of statistics, you may recognize in the first equation an *exponentially weighted moving average*, which is very often used to denoise data and get the underlying tendency.)\n",
"\n",
"Note that we are writing `weight.avg` to highlight the fact we need to store thoe moving averages for each parameter of the model (and they all their own independent moving averages).\n",
"Note that we are writing `weight.avg` to highlight the fact that we need to store the moving averages for each parameter of the model (they all their own independent moving averages).\n",
"\n",
"<<img_momentum>> shows an example of noisy data for a single parameter, with the momentum curve plotted in red, and the gradients of the parameter plotted in blue. The gradients increase, and then decrease, and the momentum does a good job of following the general trend, without getting too influenced by noise."
"<<img_momentum>> shows an example of noisy data for a single parameter, with the momentum curve plotted in red, and the gradients of the parameter plotted in blue. The gradients increase, then decrease, and the momentum does a good job of following the general trend without getting too influenced by noise."
]
},
{
@ -503,9 +503,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It works particularly well if the loss function has narrow canyons we need to navigate: vanilla SGD would send us from one side to the other while SGD with momentum will average those to roll down inside. The parameter `beta` determines the strength of that momentum we are using: with a small beta we stay closer to the actual gradient values whereas with a high beta, we will mostly go in the direction of the average of the gradients and it will take a while before any change in the gradients makes that trend move.\n",
"It works particularly well if the loss function has narrow canyons we need to navigate: vanilla SGD would send us bouncing from one side to the other, while SGD with momentum will average those to roll smoothly down the side. The parameter `beta` determines the strength of the momentum we are using: with a small `beta` we stay closer to the actual gradient values, whereas with a high `beta` we will mostly go in the direction of the average of the gradients and it will take a while before any change in the gradients makes that trend move.\n",
"\n",
"With a large beta, we might miss that the gradients have changed directions and roll over a small local minima which is a desired side-effect: intuitively, when we show a new picture/text/data to our model, it will look like something in the training set but won't be exactly like it. That means it will correspond to a point in the loss function that is closest to the minimum we ended up with at the end of training, but not exactly *at* that minimum. We then would rather end up training in a wide minimum, where nearby points have approximately the same loss (or if you prefer, a point where the loss is as flat as possible). <<img_betas>> shows how the chart in <<img_momentum>> varies as we change beta."
"With a large `beta`, we might miss that the gradients have changed directions and roll over a small local minima. This is a desired side effect: intuitively, when we show a new input to our model, it will look like something in the training set but won't be *exactly* like it. That means it will correspond to a point in the loss function that is close to the minimum we ended up with at the end of training, but not exactly *at* that minimum. So, we would rather end up training in a wide minimum, where nearby points have approximately the same loss (or if you prefer, a point where the loss is as flat as possible). <<img_betas>> shows how the chart in <<img_momentum>> varies as we change `beta`."
]
},
{
@ -554,16 +554,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see in these examples that a beta that's too high results in the overall changes in gradient getting ignored. In SGD with momentum, a value of `beta` that is often used is 0.9.\n",
"We can see in these examples that a `beta` that's too high results in the overall changes in gradient getting ignored. In SGD with momentum, a value of `beta` that is often used is 0.9.\n",
"\n",
"`fit_one_cycle` by default starts with a beta of 0.95, gradually adjusts it to 0.85, and then gradually moves it back to 0.95 at the end of training. Let's see how our training goes with momentum added to plain SGD:"
"`fit_one_cycle` by default starts with a `beta` of 0.95, gradually adjusts it to 0.85, and then gradually moves it back to 0.95 at the end of training. Let's see how our training goes with momentum added to plain SGD."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to add momentum to our optimizer, we'll first need to keep track of the moving average gradient, which we can do with another callback. When an optimizer callback returns a dict, it is used to update the state of the optimizer, and is passed back to the optimizer on the next step. So this callback will keep track of the gradient averages in a parameter called `grad_avg`:"
"In order to add momentum to our optimizer, we'll first need to keep track of the moving average gradient, which we can do with another callback. When an optimizer callback returns a `dict`, it is used to update the state of the optimizer and is passed back to the optimizer on the next step. So this callback will keep track of the gradient averages in a parameter called `grad_avg`:"
]
},
{
@ -606,7 +606,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"`Learner` will automatically schedule `mom` and `lr`, so fit_one_cycle will even work with our custom Optimizer:"
"`Learner` will automatically schedule `mom` and `lr`, so `fit_one_cycle` will even work with our custom `Optimizer`:"
]
},
{
@ -705,18 +705,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"RMSProp is another variant of SGD introduced by Geoffrey Hinton in [Lecture 6e of his Coursera class](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf). The main difference with SGD is that it uses an adaptive learning rate: instead of using the same learning rate for every parameter, each parameter gets it's own specific learning rate controlled by a global learning rate. That way we can speed up training by giving a high learning rate to the weights that needs to change a lot while the ones that are good enough get a lower learning rate.\n",
"RMSProp is another variant of SGD introduced by Geoffrey Hinton in Lecture 6e of his Coursera class [\"Neural Networks for Machine Learning\"](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf). The main difference from SGD is that it uses an adaptive learning rate: instead of using the same learning rate for every parameter, each parameter gets its own specific learning rate controlled by a global learning rate. That way we can speed up training by giving a higher learning rate to the weights that need to change a lot while the ones that are good enough get a lower learning rate.\n",
"\n",
"How do we decide which parameter should have a high learning rate and which should not? We can look at the gradients to get an idea. Not just the one we computed, but all of them: if they have been close to 0 for a while, it means this parameter will need a higher learning rate because the loss is very flat. On the opposite, if they are all over the place, we should probably be careful and pick a low learning rate to avoid divergence. We can't just average the gradients to see if they're changing a lot, since the average of a large positive and a large negative number is close to zero. So we can use the usual trick of either taking the absolute value, or the squared values (and then taking the square root after the mean).\n",
"How do we decide which parameters should have a high learning rate and which should not? We can look at the gradients to get an idea. If a parameter's gradients have been close to zero for a while, that parameter will need a higher learning rate because the loss is flat. On the other hand, if the gradients are all over the place, we should probably be careful and pick a low learning rate to avoid divergence. We can't just average the gradients to see if they're changing a lot, because the average of a large positive and a large negative number is close to zero. Instead, we can use the usual trick of either taking the absolute value or the squared values (and then taking the square root after the mean).\n",
"\n",
"Once again, to pick the general tendency behind the noise, we will use a moving average, specifically the moving average of the gradients squared. Then, we will update the corresponding weight by using the current gradient (for the direction) divided by the square root of this moving average (that way if it's low, the effective learning rate will be higher, and if it's big, the effective learning rate will be lower).\n",
"Once again, to determine the general tendency behind the noise, we will use a moving average—specifically the moving average of the gradients squared. Then we will update the corresponding weight by using the current gradient (for the direction) divided by the square root of this moving average (that way if it's low, the effective learning rate will be higher, and if it's high, the effective learning rate will be lower):\n",
"\n",
"```python\n",
"w.square_avg = alpha * w.square_avg + (1-alpha) * (w.grad ** 2)\n",
"new_w = w - lr * w.grad / math.sqrt(w.square_avg + eps)\n",
"```\n",
"\n",
"The `eps` (*epsilon*) is added for numerical stability (usually set at 1e-8) and the default value for `alpha` is usually 0.99."
"The `eps` (*epsilon*) is added for numerical stability (usually set at 1e-8), and the default value for `alpha` is usually 0.99."
]
},
{
@ -841,14 +841,14 @@
"source": [
"Adam mixes the ideas of SGD with momentum and RMSProp together: it uses the moving average of the gradients as a direction and divides by the square root of the moving average of the gradients squared to give an adaptive learning rate to each parameter.\n",
"\n",
"There is one other difference with how Adam calculates moving averages, is that it takes the *unbiased* moving average which is:\n",
"There is one other difference in how Adam calculates moving averages. It takes the *unbiased* moving average, which is:\n",
"\n",
"``` python\n",
"w.avg = beta * w.avg + (1-beta) * w.grad\n",
"unbias_avg = w.avg / (1 - (beta**(i+1)))\n",
"```\n",
"\n",
"if we are the `i`-th iteration (starting at 0 like python does). This divisor of `1 - (beta**(i+1))` makes sure the unbiased average looks more like the gradients at the beginning (since `beta < 1` the denominator is very quickly very close to 1).\n",
"if we are the `i`-th iteration (starting at 0 like Python does). This divisor of `1 - (beta**(i+1))` makes sure the unbiased average looks more like the gradients at the beginning (since `beta < 1`, the denominator is very quickly close to 1).\n",
"\n",
"Putting everything together, our update step looks like:\n",
"``` python\n",
@ -858,11 +858,11 @@
"new_w = w - lr * unbias_avg / sqrt(w.sqr_avg + eps)\n",
"```\n",
"\n",
"Like for RMSProp, `eps` is usually set to 1e-8, and the default for `(beta1,beta2)` suggested by the literature `(0.9,0.999)`. \n",
"Like for RMSProp, `eps` is usually set to 1e-8, and the default for `(beta1,beta2)` suggested by the literature is `(0.9,0.999)`. \n",
"\n",
"In fastai, Adam is the default optimizer we use since it allows faster training, but we found that `beta2=0.99` is better suited for the type of schedule we are using. `beta1` is the momentum parameter, which we specify with the argument `moms` in our call to `fit_one_cycle`. As for `eps`, fastai uses a default of 1e-5. `eps` is not just useful for numerical stability. A higher `eps` limits the maximum value of the adjusted learning rate. To take an extreme example, if `eps` is 1, then the adjusted learning will never be higher than the base learning rate. \n",
"In fastai, Adam is the default optimizer we use since it allows faster training, but we've found that `beta2=0.99` is better suited to the type of schedule we are using. `beta1` is the momentum parameter, which we specify with the argument `moms` in our call to `fit_one_cycle`. As for `eps`, fastai uses a default of 1e-5. `eps` is not just useful for numerical stability. A higher `eps` limits the maximum value of the adjusted learning rate. To take an extreme example, if `eps` is 1, then the adjusted learning will never be higher than the base learning rate. \n",
"\n",
"Rather than show all the code for this in the book, we'll let you look at the optimizer notebook in fastai's GitHub repository--you'll see all the code we've seen so far, along with Adam and other optimizers, and lots of examples and tests.\n",
"Rather than show all the code for this in the book, we'll let you look at the optimizer notebook in [fastai's GitHub repository](https://github.com/fastai/fastai) (browse the *nbs* folder and search for the notebook called optimizer). You'll see all the code we've shown so far, along with Adam and other optimizers, and lots of examples and tests.\n",
"\n",
"One thing that changes when we go from SGD to Adam is the way we apply weight decay, and it can have important consequences."
]
@ -878,32 +878,28 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We've discussed weight decay before, which is equivalent to (in the case of vanilla SGD) updating the parameters\n",
"Weight decay, which we discussed in <<chapter_collab>>, is equivalent to (in the case of vanilla SGD) updating the parameters\n",
"with:\n",
"\n",
"``` python\n",
"new_weight = weight - lr*weight.grad - lr*wd*weight\n",
"```\n",
"\n",
"This last formula explains why the name of this technique is weight decay, as each weight is decayed by a factor `lr * wd`. \n",
"The last part of this formula explains the name of this technique: each weight is decayed by a factor `lr * wd`. \n",
"\n",
"However, this only works correctly for standard SGD, because we have seen that with momentum, RMSProp or in Adam, the update has some additional formulas around the gradient. In those cases, the formula that comes from L2 regularization:\n",
"The other name of weight decay is L2 regularization, which consists in adding the sum of all squared weights to the loss (multiplied by the weight decay). As we have seen in <<chapter_collab>>, this can be directly expressed on the gradients with:\n",
"\n",
"``` python\n",
"weight.grad += wd*weight\n",
"```\n",
"\n",
"is different than weight decay:\n",
"For SGD, those two formulas are equivalent. However, this equivalence only holds for standard SGD, because we have seen that with momentum, RMSProp or in Adam, the update has some additional formulas around the gradient. \n",
"\n",
"``` python\n",
"new_weight = weight - lr*weight.grad - lr*wd*weight\n",
"```\n",
"\n",
"Most libraries use the first formulation, but it was pointed out in [Decoupled Weight Regularization](https://arxiv.org/pdf/1711.05101.pdf) by Ilya Loshchilov and Frank Hutter, second one is the only correct approach with the Adam optimizer or momentum, which is why fastai makes it its default.\n",
"Most libraries use the second formulation, but it was pointed out in [\"Decoupled Weight Decay Regularization\"](https://arxiv.org/pdf/1711.05101.pdf) by Ilya Loshchilov and Frank Hutter, that the first one is the only correct approach with the Adam optimizer or momentum, which is why fastai makes it its default.\n",
"\n",
"Now you know everything that is hidden behind the line `learn.fit_one_cycle`!\n",
"\n",
"OPtimizers are only one part of the training process. When you need to change the training loop with fastai, you can't directly change the code inside the library. Instead, we have designed a system of callbacks to let you write any tweak in independent blocks you can then mix and match. "
"Optimizers are only one part of the training process, however when you need to change the training loop with fastai, you can't directly change the code inside the library. Instead, we have designed a system of callbacks to let you write any tweaks you like in independent blocks that you can then mix and match. "
]
},
{
@ -917,7 +913,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes you need to change how things work a little bit. In fact, we have already seen examples of this: mixup, FP16 training, resetting the model after each epoch for training RNNs, and so forth. How do we go about making these kinds of tweaks to the training process?\n",
"Sometimes you need to change how things work a little bit. In fact, we have already seen examples of this: Mixup, fp16 training, resetting the model after each epoch for training RNNs, and so forth. How do we go about making these kinds of tweaks to the training process?\n",
"\n",
"We've seen the basic training loop, which, with the help of the `Optimizer` class, looks like this for a single epoch:\n",
"\n",
@ -943,13 +939,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The usual way for deep learning practitioners to customise the training loop is to make a copy of an existing training loop, and then insert their code necessary for their particular changes into it. This is how nearly all code that you find online will look. But it has some very serious problems.\n",
"The usual way for deep learning practitioners to customize the training loop is to make a copy of an existing training loop, and then insert the code necessary for their particular changes into it. This is how nearly all code that you find online will look. But it has some very serious problems.\n",
"\n",
"It's not very likely that some particular tweaked training loop is going to meet your particular needs. There are hundreds of changes that can be made to a training loop, which means there are billions and billions of possible permutations. You can't just copy one tweak from a training loop here, another from a training loop there, and expect them all to work together. Each will be based on different assumptions about the environment that it's working in, use different naming conventions, and expect the data to be in different formats.\n",
"\n",
"We need a way to allow users to insert their own code at any part of the training loop, but in a consistent and well-defined way. Computer scientists have already come up with an answer to this question: the callback. A callback is a piece of code that you write, and inject into another piece of code at some predefined point. In fact, callbacks have been used with deep learning training loops for years. The problem is that only a small subset of places that may require code injection have been available in previous libraries, and, more importantly, callbacks were not able to do all the things they needed to do.\n",
"We need a way to allow users to insert their own code at any part of the training loop, but in a consistent and well-defined way. Computer scientists have already come up with an elegant solution: the callback. A callback is a piece of code that you write, and inject into another piece of code at some predefined point. In fact, callbacks have been used with deep learning training loops for years. The problem is that in previous libraries it was only possible to inject code in a small subset of places where this may have been required, and, more importantly, callbacks were not able to do all the things they needed to do.\n",
"\n",
"In order to be just as flexible as manually copying and pasting a training loop and directly inserting code into it, a callback must be able to read every possible piece of information available in the training loop, modify all of it as needed, and fully control when a batch, epoch, or even all the whole training loop should be terminated. fastai is the first library to provide all of this functionality. It modifies the training loop so it looks like <<cb_loop>>."
"In order to be just as flexible as manually copying and pasting a training loop and directly inserting code into it, a callback must be able to read every possible piece of information available in the training loop, modify all of it as needed, and fully control when a batch, epoch, or even the whole training loop should be terminated. fastai is the first library to provide all of this functionality. It modifies the training loop so it looks like <<cb_loop>>."
]
},
{
@ -963,7 +959,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The real test of whether this works has been borne out over the last couple of years — it has turned out that every single new paper implemented, or use a request fulfilled, for modifying the training loop has successfully been achieved entirely by using the fastai callback system. The training loop itself has not required modifications. <<some_cbs>> shows just a few of the callbacks that have been added."
"The real effectiveness of this approach has been borne out over the last couple of years—it has turned out that, by using the fastai callback system, we were able to implement every single new paper we tried and fulfilled every user request for modifying the training loop. The training loop itself has not required modifications. <<some_cbs>> shows just a few of the callbacks that have been added."
]
},
{
@ -977,7 +973,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The reason that this is important for all of us is that it means that whatever idea we have in our head, we can implement it. We need never dig into the source code of PyTorch or fastai and act together some one-off system to try out our ideas. And when we do implement our own callbacks to develop our own ideas, we know that they will work together with all of the other functionality provided by fastai so we will get progress bars, mixed precision training, hyperparameter annealing, and so forth.\n",
"The reason that this is important is because it means that whatever idea we have in our head, we can implement it. We need never dig into the source code of PyTorch or fastai and hack together some one-off system to try out our ideas. And when we do implement our own callbacks to develop our own ideas, we know that they will work together with all of the other functionality provided by fastaiso we will get progress bars, mixed-precision training, hyperparameter annealing, and so forth.\n",
"\n",
"Another advantage is that it makes it easy to gradually remove or add functionality and perform ablation studies. You just need to adjust the list of callbacks you pass along to your fit function."
]
@ -1001,9 +997,9 @@
"finally: self('after_batch')\n",
"```\n",
"\n",
"The calls of the form `self('...')` are where the callbacks are called. As you see, after every step a callback is called. The callback will receive the entire state of training, and can also modify it. For instance, as you see above, the input data and target labels are in `self.xb` and `self.yb` respectively. A callback can modify these to modify the data the training loop sees. It can also modify `self.loss`, or even modify the gradients.\n",
"The calls of the form `self('...')` are where the callbacks are called. As you see, this happens after every step. The callback will receive the entire state of training, and can also modify it. For instance, the input data and target labels are in `self.xb` and `self.yb`, respectively; a callback can modify these to \\the data the training loop sees. It can also modify `self.loss`, or even the gradients.\n",
"\n",
"Let's see how this work in practice by writing a `Callback`."
"Let's see how this work in practice by writing a callback."
]
},
{
@ -1019,29 +1015,29 @@
"source": [
"When you want to write your own callback, the full list of available events is:\n",
"\n",
"- `begin_fit`:: called before doing anything, ideal for initial setup.\n",
"- `begin_epoch`:: called at the beginning of each epoch, useful for any behavior you need to reset at each epoch.\n",
"- `begin_fit`:: called before doing anything; ideal for initial setup.\n",
"- `begin_epoch`:: called at the beginning of each epoch; useful for any behavior you need to reset at each epoch.\n",
"- `begin_train`:: called at the beginning of the training part of an epoch.\n",
"- `begin_batch`:: called at the beginning of each batch, just after drawing said batch. It can be used to do any setup necessary for the batch (like hyper-parameter scheduling) or to change the input/target before it goes in the model (change of the input with techniques like mixup for instance).\n",
"- `after_pred`:: called after computing the output of the model on the batch. It can be used to change that output before it's fed to the loss.\n",
"- `after_loss`:: called after the loss has been computed, but before the backward pass. It can be used to add any penalty to the loss (AR or TAR in RNN training for instance).\n",
"- `after_backward`:: called after the backward pass, but before the update of the parameters. It can be used to do any change to the gradients before said update (gradient clipping for instance).\n",
"- `begin_batch`:: called at the beginning of each batch, just after drawing said batch. It can be used to do any setup necessary for the batch (like hyperparameter scheduling) or to change the input/target before it goes into the model (for instance, apply Mixup).\n",
"- `after_pred`:: called after computing the output of the model on the batch. It can be used to change that output before it's fed to the loss function.\n",
"- `after_loss`:: called after the loss has been computed, but before the backward pass. It can be used to add penalty to the loss (AR or TAR in RNN training, for instance).\n",
"- `after_backward`:: called after the backward pass, but before the update of the parameters. It can be used to make changes to the gradients before said update (via gradient clipping, for instance).\n",
"- `after_step`:: called after the step and before the gradients are zeroed.\n",
"- `after_batch`:: called at the end of a batch, for any clean-up before the next one.\n",
"- `after_batch`:: called at the end of a batch, for to perform any required cleanup before the next one.\n",
"- `after_train`:: called at the end of the training phase of an epoch.\n",
"- `begin_validate`:: called at the beginning of the validation phase of an epoch, useful for any setup needed specifically for validation.\n",
"- `begin_validate`:: called at the beginning of the validation phase of an epoch; useful for any setup needed specifically for validation.\n",
"- `after_validate`:: called at the end of the validation part of an epoch.\n",
"- `after_epoch`:: called at the end of an epoch, for any clean-up before the next one.\n",
"- `after_fit`:: called at the end of training, for final clean-up.\n",
"- `after_epoch`:: called at the end of an epoch, for any cleanup before the next one.\n",
"- `after_fit`:: called at the end of training, for final cleanup.\n",
"\n",
"This list is available as attributes of the special variable `event`; so just type `event.` and hit `Tab` in your notebook to see a list of all the options"
"This elements of that list are available as attributes of the special variable `event`, so you can just type `event.` and hit Tab in your notebook to see a list of all the options"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at an example. Do you recall how in <<chapter_nlp_dive>> we needed to ensure that our special `reset` method was called at the start of training and validation for each epoch? We used the `ModelReseter` callback provided by fastai to do this for us. But how did `ModelReseter` do that exactly? Here's the full actual source code to that class:"
"Let's take a look at an example. Do you recall how in <<chapter_nlp_dive>> we needed to ensure that our special `reset` method was called at the start of training and validation for each epoch? We used the `ModelReseter` callback provided by fastai to do this for us. But how does it owrk? Here's the full source code for that class:"
]
},
{
@ -1059,9 +1055,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Yes, that's actually it! It just does what we said in the paragraph above: after completing training and epoch or validation for an epoch, call a method named `reset`.\n",
"Yes, that's actually it! It just does what we said in the preceding paragraph: after completing training or validation for an epoch, call a method named `reset`.\n",
"\n",
"Callbacks are often \"short and sweet\" like this one. In fact, let's look at one more. Here's the fastai source for the callback that add RNN regularization (*AR* and *TAR*):"
"Callbacks are often \"short and sweet\" like this one. In fact, let's look at one more. Here's the fastai source for the callback that adds RNN regularization (AR and TAR):"
]
},
{
@ -1092,14 +1088,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> stop: Go back to where we discussed TAR and AR regularization, and compare to the code here. Made sure you understand what it's doing, and why."
"> note: Code It Yourself: Go back and reread \"Activation Regularization and Temporal Activation Regularization\" in <<chapter_nlp_dive>> then take another look at the code here. Make sure you understand what it's doing, and why."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In both of these examples, notice how we can access attributes of the training loop by directly checking `self.model` or `self.pred`. That's because a `Callback` will always try to get an attribute it doesn't have inside the `Learner` associated to it. This is a shortcut for `self.learn.model` or `self.learn.pred`. Note that this shortcut works for reading attributes, but not for writing them, which is why when `RNNRegularizer` changes the loss or the predictions, you see `self.learn.loss = ` or `self.learn.pred = `. "
"In both of these examples, notice how we can access attributes of the training loop by directly checking `self.model` or `self.pred`. That's because a `Callback` will always try to get an attribute it doesn't have inside the `Learner` associated with it. These are shortcuts for `self.learn.model` or `self.learn.pred`. Note that they work for reading attributes, but not for writing them, which is why when `RNNRegularizer` changes the loss or the predictions you see `self.learn.loss = ` or `self.learn.pred = `. "
]
},
{
@ -1108,31 +1104,31 @@
"source": [
"When writing a callback, the following attributes of `Learner` are available:\n",
"\n",
"- `model`: the model used for training/validation\n",
"- `data`: the underlying `DataLoaders`\n",
"- `loss_func`: the loss function used\n",
"- `opt`: the optimizer used to udpate the model parameters\n",
"- `opt_func`: the function used to create the optimizer\n",
"- `cbs`: the list containing all `Callback`s\n",
"- `dl`: current `DataLoader` used for iteration\n",
"- `x`/`xb`: last input drawn from `self.dl` (potentially modified by callbacks). `xb` is always a tuple (potentially with one element) and `x` is detuplified. You can only assign to `xb`.\n",
"- `y`/`yb`: last target drawn from `self.dl` (potentially modified by callbacks). `yb` is always a tuple (potentially with one element) and `y` is detuplified. You can only assign to `yb`.\n",
"- `pred`: last predictions from `self.model` (potentially modified by callbacks)\n",
"- `loss`: last computed loss (potentially modified by callbacks)\n",
"- `n_epoch`: the number of epochs in this training\n",
"- `n_iter`: the number of iterations in the current `self.dl`\n",
"- `epoch`: the current epoch index (from 0 to `n_epoch-1`)\n",
"- `iter`: the current iteration index in `self.dl` (from 0 to `n_iter-1`)\n",
"- `model`:: The model used for training/validation.\n",
"- `data`:: The underlying `DataLoaders`.\n",
"- `loss_func`:: The loss function used.\n",
"- `opt`:: The optimizer used to udpate the model parameters.\n",
"- `opt_func`:: The function used to create the optimizer.\n",
"- `cbs`:: The list containing all the `Callback`s.\n",
"- `dl`:: The current `DataLoader` used for iteration.\n",
"- `x`/`xb`:: The last input drawn from `self.dl` (potentially modified by callbacks). `xb` is always a tuple (potentially with one element) and `x` is detuplified. You can only assign to `xb`.\n",
"- `y`/`yb`:: The last target drawn from `self.dl` (potentially modified by callbacks). `yb` is always a tuple (potentially with one element) and `y` is detuplified. You can only assign to `yb`.\n",
"- `pred`:: The last predictions from `self.model` (potentially modified by callbacks).\n",
"- `loss`:: The last computed loss (potentially modified by callbacks).\n",
"- `n_epoch`:: The number of epochs in this training.\n",
"- `n_iter`:: The number of iterations in the current `self.dl`.\n",
"- `epoch`:: The current epoch index (from 0 to `n_epoch-1`).\n",
"- `iter`:: The current iteration index in `self.dl` (from 0 to `n_iter-1`).\n",
"\n",
"The following attributes are added by `TrainEvalCallback` and should be available unless you went out of your way to remove that callback:\n",
"\n",
"- `train_iter`: the number of training iterations done since the beginning of this training\n",
"- `pct_train`: from 0. to 1., the percentage of training iterations completed\n",
"- `training`: flag to indicate if we're in training mode or not\n",
"- `train_iter`:: The number of training iterations done since the beginning of this training\n",
"- `pct_train`:: The percentage of training iterations completed (from 0. to 1.)\n",
"- `training`:: A flag to indicate whether or not we're in training mode\n",
"\n",
"The following attribute is added by `Recorder` and should be available unless you went out of your way to remove that callback:\n",
"\n",
"- `smooth_loss`: an exponentially-averaged version of the training loss"
"- `smooth_loss`:: An exponentially averaged version of the training loss"
]
},
{
@ -1173,40 +1169,33 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The way it tells the training loop to interrupt training at this point is to `raise CancelFitException`. The training loop catches this exception and does not run any further training or validation. The callback control flow exceptions available are:\n",
"The line `raise CancelFitException` tells the training loop to interrupt training at this point. The training loop catches this exception and does not run any further training or validation. The callback control flow exceptions available are:\n",
"\n",
"- `CancelFitException`:: Skip the rest of this batch and go to `after_batch\n",
"- `CancelEpochException`:: Skip the rest of the training part of the epoch and go to `after_train\n",
"- `CancelTrainException`:: Skip the rest of the validation part of the epoch and go to `after_validate\n",
"- `CancelValidException`:: Skip the rest of this epoch and go to `after_epoch\n",
"- `CancelBatchException`:: Interrupts training and go to `after_fit"
"- `CancelFitException`:: Skip the rest of this batch and go to `after_batch`.\n",
"- `CancelEpochException`:: Skip the rest of the training part of the epoch and go to `after_train`.\n",
"- `CancelTrainException`:: Skip the rest of the validation part of the epoch and go to `after_validate`.\n",
"- `CancelValidException`:: Skip the rest of this epoch and go to `after_epoch`.\n",
"- `CancelBatchException`:: Interrupt training and go to `after_fit`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can detect one of those exceptions occurred and add code that executes right after with the following events:\n",
"You can detect if one of those exceptions has occurred and add code that executes right after with the following events:\n",
"\n",
"- `after_cancel_batch`:: reached immediately after a `CancelBatchException` before proceeding to `after_batch`\n",
"- `after_cancel_train`:: reached immediately after a `CancelTrainException` before proceeding to `after_epoch`\n",
"- `after_cancel_valid`:: reached immediately after a `CancelValidException` before proceeding to `after_epoch`\n",
"- `after_cancel_epoch`:: reached immediately after a `CancelEpochException` before proceeding to `after_epoch`\n",
"- `after_cancel_fit`:: reached immediately after a `CancelFitException` before proceeding to `after_fit`"
"- `after_cancel_batch`:: Reached immediately after a `CancelBatchException` before proceeding to `after_batch`\n",
"- `after_cancel_train`:: Reached immediately after a `CancelTrainException` before proceeding to `after_epoch`\n",
"- `after_cancel_valid`:: Reached immediately after a `CancelValidException` before proceeding to `after_epoch`\n",
"- `after_cancel_epoch`:: Reached immediately after a `CancelEpochException` before proceeding to `after_epoch`\n",
"- `after_cancel_fit`:: Reached immediately after a `CancelFitException` before proceeding to `after_fit`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes, callbacks need to be called in a particular order. In the case of `TerminateOnNaNCallback`, it's important that `Recorder` runs its `after_batch` after this callback, to avoid registering an NaN loss. You can specify `run_before` (this callback must run before ...) or `run_after` (this callback must run after ...) in your callback to ensure the ordering that you need."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have seen how to tweak the training loop of fastai to do anything we need, let's take a step back and dig a little bit deeper in the foundations of that training loop."
"Sometimes, callbacks need to be called in a particular order. For example, in the case of `TerminateOnNaNCallback`, it's important that `Recorder` runs its `after_batch` after this callback, to avoid registering an `NaN` loss. You can specify `run_before` (this callback must run before ...) or `run_after` (this callback must run after ...) in your callback to ensure the ordering that you need."
]
},
{
@ -1220,9 +1209,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We have dug into the training loop by looking at all the variants of SGD and why they can be more powerful. At the time of writing, research has been very active in developping new optimizers, so by the time you read this chapter, there may be an addendum to this chapter on the book website that presents new variants. Be sure to check how our general optimizer framework can help you implement new optimizers very fast.\n",
"In this chapter we took a close look at the training loop, explorig differnet variants of SGD and why they can be more powerful. At the time of writing developping new optimizers is a very active area of research, so by the time you read this chapter there may be an addendum on the book's website that presents new variants. Be sure to check out how our general optimizer framework can help you implement new optimizers very quickly.\n",
"\n",
"For every tweak of the training loop, we use a powerful `Callback` system that allows you to customize every bit of that loop by being able to inspect and modify any parameter between each step of the training loop."
"We also examined the powerful callback system that allows you to customize every bit of the training loop by enabling you to inspect and modify any parameter you like between each step."
]
},
{
@ -1242,7 +1231,7 @@
"1. What does `zero_grad` do in an optimizer?\n",
"1. What does `step` do in an optimizer? How is it implemented in the general optimizer?\n",
"1. Rewrite `sgd_cb` to use the `+=` operator, instead of `add_`.\n",
"1. What is momentum? Write out the equation.\n",
"1. What is \"momentum\"? Write out the equation.\n",
"1. What's a physical analogy for momentum? How does it apply in our model training settings?\n",
"1. What does a bigger value for momentum do to the gradients?\n",
"1. What are the default values of momentum for 1cycle training?\n",
@ -1250,18 +1239,18 @@
"1. What do the squared values of the gradients indicate?\n",
"1. How does Adam differ from momentum and RMSProp?\n",
"1. Write out the equation for Adam.\n",
"1. Calculate the value of `unbias_avg` and `w.avg` for a few batches of dummy values.\n",
"1. What's the impact of having a high eps in Adam?\n",
"1. Calculate the values of `unbias_avg` and `w.avg` for a few batches of dummy values.\n",
"1. What's the impact of having a high `eps` in Adam?\n",
"1. Read through the optimizer notebook in fastai's repo, and execute it.\n",
"1. In what situations do dynamic learning rate methods like Adam change the behaviour of weight decay?\n",
"1. In what situations do dynamic learning rate methods like Adam change the behavior of weight decay?\n",
"1. What are the four steps of a training loop?\n",
"1. Why is the use of callbacks better than writing a new training loop for each tweak you want to add?\n",
"1. What are the necessary points in the design of the fastai's callback system that make it as flexible as copying and pasting bits of code?\n",
"1. Why is using callbacks better than writing a new training loop for each tweak you want to add?\n",
"1. What aspects of the design of fastai's callback system make it as flexible as copying and pasting bits of code?\n",
"1. How can you get the list of events available to you when writing a callback?\n",
"1. Write the `ModelResetter` callback (without peeking).\n",
"1. How can you access the necessary attributes of the training loop inside a callback? When can you use or not use the shortcut that goes with it?\n",
"1. How can you access the necessary attributes of the training loop inside a callback? When can you use or not use the shortcuts that go with them?\n",
"1. How can a callback influence the control flow of the training loop.\n",
"1. Write the `TerminateOnNaN` callback (without peeking if possible).\n",
"1. Write the `TerminateOnNaN` callback (without peeking, if possible).\n",
"1. How do you make sure your callback runs after or before another callback?"
]
},
@ -1276,8 +1265,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Look up the \"rectified Adam\" paper and implement it using the general optimizer framework, and try it out. Search for other recent optimizers that work well in practice, and pick one to implement.\n",
"1. Look at the mixed precision callback with the documentation. Try to understand what each event and line of code does.\n",
"1. Look up the \"Rectified Adam\" paper, implement it using the general optimizer framework, and try it out. Search for other recent optimizers that work well in practice, and pick one to implement.\n",
"1. Look at the mixed-precision callback with the documentation. Try to understand what each event and line of code does.\n",
"1. Implement your own version of ther learning rate finder from scratch. Compare it with fastai's version.\n",
"1. Look at the source code of the callbacks that ship with fastai. See if you can find one that's similar to what you're looking to do, to get some inspiration."
]
@ -1293,11 +1282,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations, you have made it to the end of the \"foundations of deep learning\" section. You now understand how all of fastai's applications and most important architectures are built, and the recommended ways to train them, and have all the information you need to build these from scratch. Whilst you probably won't need to create your own training loop, or batchnorm layer, for instance, knowing what is going on behind the scenes is very helpful for debugging, profiling, and deploying your solutions.\n",
"Congratulations, you have made it to the end of the \"foundations of deep learning\" section of the book! You now understand how all of fastai's applications and most important architectures are built, and the recommended ways to train them—and you have all the information you need to build these from scratch. While you probably won't need to create your own training loop, or batchnorm layer, for instance, knowing what is going on behind the scenes is very helpful for debugging, profiling, and deploying your solutions.\n",
"\n",
"Since you understand all of the foundations of fastai's applications now, be sure to spend some time digging through fastai's source notebooks, and running and experimenting with parts of them, since you can and see exactly how everything in fastai is developed.\n",
"Since you understand the foundations of fastai's applications now, be sure to spend some time digging through the source notebooks and running and experimenting with parts of them. This will give you a better idea of how everything in fastai is developed.\n",
"\n",
"In the next section, we will be looking even further under the covers, to see how the actual forward and backward passes of a neural network are done, and we will see what tools are at our disposal to get better performance. We will then finish up with a project that brings together everything we have learned throughout the book, which we will use to build a method for interpreting convolutional neural networks."
"In the next section, we will be looking even further under the covers: we'll explore how the actual forward and backward passes of a neural network are done, and we will see what tools are at our disposal to get better performance. We will then continue with a project that brings together all the material in the book, which we will use to build a tool for interpreting convolutional neural networks. Last but not least, we'll finish by building fastai's `Learner` class from scratch."
]
},
{

View File

@ -983,7 +983,7 @@
"source": [
"#hide\n",
"# !pip install voila\n",
"# !jupyter serverextension enable voila --sys-prefix"
"# !jupyter serverextension enable voila sys-prefix"
]
},
{

View File

@ -266,7 +266,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Section 1: That's a Wrap!"
"## Deep Learning in Practice: That's a Wrap!"
]
},
{

View File

@ -1719,7 +1719,7 @@
"\n",
"1. Take a look at all the differences between the `Embedding` version of `DotProductBias` and the `create_params` version, and try to understand why each of those changes is required. If you're not sure, try reverting each change to see what happens. (NB: even the type of brackets used in `forward` has changed!)\n",
"1. Find three other areas where collaborative filtering is being used, and find out what the pros and cons of this approach are in those areas.\n",
"1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book's website and the fast.ai forum for ideas. Note that there are more columns in the full dataset--see if you can use those too (the next chapter might give you ideas).\n",
"1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book's website and the fast.ai forum for ideas. Note that there are more columns in the full datasetsee if you can use those too (the next chapter might give you ideas).\n",
"1. Create a model for MovieLens that works with cross-entropy loss, and compare it to the model in this chapter."
]
},

View File

@ -6859,7 +6859,7 @@
"outputs": [],
"source": [
"#hide\n",
"# pip install --pre -f https://sklearn-nightly.scdn8.secure.raxcdn.com scikit-learn --U"
"# pip install —pre -f https://sklearn-nightly.scdn8.secure.raxcdn.com scikit-learn —U"
]
},
{

View File

@ -851,7 +851,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Becoming a Deep Learning Practitioner"
"## Understanding fastai's Applications: Wrap Up"
]
},
{

View File

@ -23,7 +23,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Let's Start with SGD"
"## Establishing a Baseline"
]
},
{
@ -687,7 +687,7 @@
"1. What does `zero_grad` do in an optimizer?\n",
"1. What does `step` do in an optimizer? How is it implemented in the general optimizer?\n",
"1. Rewrite `sgd_cb` to use the `+=` operator, instead of `add_`.\n",
"1. What is momentum? Write out the equation.\n",
"1. What is \"momentum\"? Write out the equation.\n",
"1. What's a physical analogy for momentum? How does it apply in our model training settings?\n",
"1. What does a bigger value for momentum do to the gradients?\n",
"1. What are the default values of momentum for 1cycle training?\n",
@ -695,18 +695,18 @@
"1. What do the squared values of the gradients indicate?\n",
"1. How does Adam differ from momentum and RMSProp?\n",
"1. Write out the equation for Adam.\n",
"1. Calculate the value of `unbias_avg` and `w.avg` for a few batches of dummy values.\n",
"1. What's the impact of having a high eps in Adam?\n",
"1. Calculate the values of `unbias_avg` and `w.avg` for a few batches of dummy values.\n",
"1. What's the impact of having a high `eps` in Adam?\n",
"1. Read through the optimizer notebook in fastai's repo, and execute it.\n",
"1. In what situations do dynamic learning rate methods like Adam change the behaviour of weight decay?\n",
"1. In what situations do dynamic learning rate methods like Adam change the behavior of weight decay?\n",
"1. What are the four steps of a training loop?\n",
"1. Why is the use of callbacks better than writing a new training loop for each tweak you want to add?\n",
"1. What are the necessary points in the design of the fastai's callback system that make it as flexible as copying and pasting bits of code?\n",
"1. Why is using callbacks better than writing a new training loop for each tweak you want to add?\n",
"1. What aspects of the design of fastai's callback system make it as flexible as copying and pasting bits of code?\n",
"1. How can you get the list of events available to you when writing a callback?\n",
"1. Write the `ModelResetter` callback (without peeking).\n",
"1. How can you access the necessary attributes of the training loop inside a callback? When can you use or not use the shortcut that goes with it?\n",
"1. How can you access the necessary attributes of the training loop inside a callback? When can you use or not use the shortcuts that go with them?\n",
"1. How can a callback influence the control flow of the training loop.\n",
"1. Write the `TerminateOnNaN` callback (without peeking if possible).\n",
"1. Write the `TerminateOnNaN` callback (without peeking, if possible).\n",
"1. How do you make sure your callback runs after or before another callback?"
]
},
@ -721,8 +721,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Look up the \"rectified Adam\" paper and implement it using the general optimizer framework, and try it out. Search for other recent optimizers that work well in practice, and pick one to implement.\n",
"1. Look at the mixed precision callback with the documentation. Try to understand what each event and line of code does.\n",
"1. Look up the \"Rectified Adam\" paper, implement it using the general optimizer framework, and try it out. Search for other recent optimizers that work well in practice, and pick one to implement.\n",
"1. Look at the mixed-precision callback with the documentation. Try to understand what each event and line of code does.\n",
"1. Implement your own version of ther learning rate finder from scratch. Compare it with fastai's version.\n",
"1. Look at the source code of the callbacks that ship with fastai. See if you can find one that's similar to what you're looking to do, to get some inspiration."
]
@ -738,11 +738,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations, you have made it to the end of the \"foundations of deep learning\" section. You now understand how all of fastai's applications and most important architectures are built, and the recommended ways to train them, and have all the information you need to build these from scratch. Whilst you probably won't need to create your own training loop, or batchnorm layer, for instance, knowing what is going on behind the scenes is very helpful for debugging, profiling, and deploying your solutions.\n",
"Congratulations, you have made it to the end of the \"foundations of deep learning\" section of the book! You now understand how all of fastai's applications and most important architectures are built, and the recommended ways to train them—and you have all the information you need to build these from scratch. While you probably won't need to create your own training loop, or batchnorm layer, for instance, knowing what is going on behind the scenes is very helpful for debugging, profiling, and deploying your solutions.\n",
"\n",
"Since you understand all of the foundations of fastai's applications now, be sure to spend some time digging through fastai's source notebooks, and running and experimenting with parts of them, since you can and see exactly how everything in fastai is developed.\n",
"Since you understand the foundations of fastai's applications now, be sure to spend some time digging through the source notebooks and running and experimenting with parts of them. This will give you a better idea of how everything in fastai is developed.\n",
"\n",
"In the next section, we will be looking even further under the covers, to see how the actual forward and backward passes of a neural network are done, and we will see what tools are at our disposal to get better performance. We will then finish up with a project that brings together everything we have learned throughout the book, which we will use to build a method for interpreting convolutional neural networks."
"In the next section, we will be looking even further under the covers: we'll explore how the actual forward and backward passes of a neural network are done, and we will see what tools are at our disposal to get better performance. We will then continue with a project that brings together all the material in the book, which we will use to build a tool for interpreting convolutional neural networks. Last but not least, we'll finish by building fastai's `Learner` class from scratch."
]
},
{