Merge pull request #1 from fastai/master

Updating from scorce
This commit is contained in:
Happy Sugar Life 2020-08-21 09:08:23 -12:00 committed by GitHub
commit 927b05d537
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
69 changed files with 47163 additions and 10641 deletions

7
.gitignore vendored
View File

@ -1 +1,8 @@
tmp/
*.bak
*.pkl
bears/
__pycache__/
.last_checked
.gitconfig
.ipynb_checkpoints/

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -7,7 +7,19 @@
"outputs": [],
"source": [
"#hide\n",
"from utils import *"
"!pip install -Uqq fastbook\n",
"import fastbook\n",
"fastbook.setup_book()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"from fastbook import *"
]
},
{
@ -21,18 +33,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Training a state-of-the-art model"
"# Training a State-of-the-Art Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This chapter introduces more advanced techniques for training an image classification model and get state-of-the-art results. You can skip it if you want to learn more about other applications of deep learning and come back to it later--nothing in this chapter will be assumed in later chapters.\n",
"This chapter introduces more advanced techniques for training an image classification model and getting state-of-the-art results. You can skip it if you want to learn more about other applications of deep learning and come back to it later—knowledge of this material will not be assumed in later chapters.\n",
"\n",
"We will look at powerful data augmentation techniques, the *progressive resizing* approach and test time augmentation. To show all of this, we are going to train a model from scratch (not transfer learning) using a subset of ImageNet called [Imagenette](https://github.com/fastai/imagenette). It contains ten very different categories from the original ImageNet dataset, making for quicker training when we want to experiment.\n",
"We will look at what normalization is, a powerful data augmentation technique called mixup, the progressive resizing approach and test time augmentation. To show all of this, we are going to train a model from scratch (not using transfer learning) using a subset of ImageNet called [Imagenette](https://github.com/fastai/imagenette). It contains a subset of 10 very different categories from the original ImageNet dataset, making for quicker training when we want to experiment.\n",
"\n",
"This is going to be much harder to do well than our previous datasets because we're using full-size, full-color images, which are photos of objects of different sizes, in different orientations, in different lighting, and so forth... So in this chapter we're going to introduce some important techniques for getting the most out of your dataset, especially when you're training from scratch, or transfer learning to a very different kind of dataset to what the pretrained model used."
"This is going to be much harder to do well than with our previous datasets because we're using full-size, full-color images, which are photos of objects of different sizes, in different orientations, in different lighting, and so forth. So, in this chapter we're going to introduce some important techniques for getting the most out of your dataset, especially when you're training from scratch, or using transfer learning to train a model on a very different kind of dataset than the pretrained model used."
]
},
{
@ -48,17 +60,17 @@
"source": [
"When fast.ai first started there were three main datasets that people used for building and testing computer vision models:\n",
"\n",
"- *ImageNet*: 1.3 million images of various sizes around 500 pixels across, in 1000 categories, which took a few days to train\n",
"- *MNIST*: 50,000 28x28 pixel greyscale handwritten digits\n",
"- *CIFAR10*: 60,000 32x32 colour images in 10 classes\n",
"- ImageNet:: 1.3 million images of various sizes around 500 pixels across, in 1,000 categories, which took a few days to train\n",
"- MNIST:: 50,000 28×28-pixel grayscale handwritten digits\n",
"- CIFAR10:: 60,000 32×32-pixel color images in 10 classes\n",
"\n",
"The problem is that the small datasets didn't actually generalise effectively to the large ImageNet dataset. The approaches that worked well on ImageNet generally had to be developed and trained on ImageNet. This led to many people believing that only researchers with access to giant computing resources could effectively contribute to developing image classification algorithms.\n",
"The problem was that the smaller datasets didn't actually generalize effectively to the large ImageNet dataset. The approaches that worked well on ImageNet generally had to be developed and trained on ImageNet. This led to many people believing that only researchers with access to giant computing resources could effectively contribute to developing image classification algorithms.\n",
"\n",
"We thought that seemed very unlikely to be true. We had never actually seen a study that showed that ImageNet happen to be exactly the right size, and that other datasets could not be developed which would provide useful insights. So we thought we would try to create a new dataset which researchers could test their algorithms on quickly and cheaply, but which would also provide insights likely to work on the full ImageNet dataset.\n",
"We thought that seemed very unlikely to be true. We had never actually seen a study that showed that ImageNet happen to be exactly the right size, and that other datasets could not be developed which would provide useful insights. So we thought we would try to create a new dataset that researchers could test their algorithms on quickly and cheaply, but which would also provide insights likely to work on the full ImageNet dataset.\n",
"\n",
"About three hours later we had created Imagenette. We selected 10 classes from the full ImageNet which look very different to each other. We hope that it would be possible to create a classifier that worked to recognise these classes quickly and cheaply. When we tried it out, we discovered we were right. We then tried out a few algorithmic tweaks to see how they impacted Imagenette, found some which worked pretty well, and tested them on ImageNet as well — we were very pleased to find that our tweaks worked well on ImageNet too!\n",
"About three hours later we had created Imagenette. We selected 10 classes from the full ImageNet that looked very different from one another. As we had hopep, we were able to quickly and cheaply create a classifier capable of recognizing these classes. We then tried out a few algorithmic tweaks to see how they impacted Imagenette. We found some that worked pretty well, and tested them on ImageNet as well—and we were very pleased to find that our tweaks worked well on ImageNet too!\n",
"\n",
"There is an important message here: the dataset you get given is not necessarily the dataset you want; it's particularly unlikely to be the dataset that you want to do your development and prototyping in. You should aim to have an iteration speed of no more than a couple of minutes that is, when you come up with a new idea you want to try out, you should be able to train a model and see how it goes within a couple of minutes. If it's taking longer to do an experiment, think about how you could cut down your dataset, or simplify your model, to improve your experimentation speed. The more experiments you can do, the better!\n",
"There is an important message here: the dataset you get given is not necessarily the dataset you want. It's particularly unlikely to be the dataset that you want to do your development and prototyping in. You should aim to have an iteration speed of no more than a couple of minutes—that is, when you come up with a new idea you want to try out, you should be able to train a model and see how it goes within a couple of minutes. If it's taking longer to do an experiment, think about how you could cut down your dataset, or simplify your model, to improve your experimentation speed. The more experiments you can do, the better!\n",
"\n",
"Let's get started with this dataset:"
]
@ -69,7 +81,7 @@
"metadata": {},
"outputs": [],
"source": [
"from fastai2.vision.all import *\n",
"from fastai.vision.all import *\n",
"path = untar_data(URLs.IMAGENETTE)"
]
},
@ -77,7 +89,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"First we'll get our dataset into a `DataLoaders` object, using the *presizing* trick we saw in <<chapter_pet_breeds>>:"
"First we'll get our dataset into a `DataLoaders` object, using the *presizing* trick introduced in <<chapter_pet_breeds>>:"
]
},
{
@ -98,7 +110,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
" ...and do a training that will serve as a baseline:"
"and do a training run that will serve as a baseline:"
]
},
{
@ -176,7 +188,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"That's a good baseline, since we are not using a pretrained model, but we can do better. When working with models that are being trained from scratch, or fine-tuned to a very different dataset to that used for the pretraining, there are additional techniques that are really important. In the rest of the chapter we'll consider some of the key approaches you'll want to be familiar with. The first one is normalizing your data."
"That's a good baseline, since we are not using a pretrained model, but we can do better. When working with models that are being trained from scratch, or fine-tuned to a very different dataset than the one used for the pretraining, there are some additional techniques that are really important. In the rest of the chapter we'll consider some of the key approaches you'll want to be familiar with. The first one is *normalizing* your data."
]
},
{
@ -190,7 +202,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When training a model, it helps if your input data is normalized, that is, as a mean of 0 and a standard deviation of 1. But most images and computer vision libraries will use values between 0 and 255 for pixels, or between 0 and 1; in either case, your data is not going to have a mean of zero and a standard deviation of one.\n",
"When training a model, it helps if your input data is normalized—that is, has a mean of 0 and a standard deviation of 1. But most images and computer vision libraries use values between 0 and 255 for pixels, or between 0 and 1; in either case, your data is not going to have a mean of 0 and a standard deviation of 1.\n",
"\n",
"Let's grab a batch of our data and look at those values, by averaging over all axes except for the channel axis, which is axis 1:"
]
@ -221,9 +233,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we expected, its mean and standard deviation is not very close to the desired values of zero and one. This is easy to do in fastai by adding the `Normalize` transform. This acts on a whole mini batch at once, so you can add it to the `batch_tfms` section of your data block. You need to pass to this transform the mean and standard deviation that you want to use; fastai comes with the standard ImageNet mean and standard deviation already defined. (If you do not pass any statistics to the Normalize transform, fastai will automatically calculate them from a single batch of your data.)\n",
"As we expected, the mean and standard deviation are not very close to the desired values. Fortunately, normalizing the data is easy to do in fastai by adding the `Normalize` transform. This acts on a whole mini-batch at once, so you can add it to the `batch_tfms` section of your data block. You need to pass to this transform the mean and standard deviation that you want to use; fastai comes with the standard ImageNet mean and standard deviation already defined. (If you do not pass any statistics to the `Normalize` transform, fastai will automatically calculate them from a single batch of your data.)\n",
"\n",
"Let's add this transform (using `imagenet_stats` as Imagenette is a subset of ImageNet) and have a look at one batch now:"
"Let's add this transform (using `imagenet_stats` as Imagenette is a subset of ImageNet) and take a look at one batch now:"
]
},
{
@ -277,7 +289,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check how normalization helps training our model here:"
"Let's check how what effet this had on training our model:"
]
},
{
@ -355,27 +367,27 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Although it only helped a little here, normalization becomes especially important when using pretrained models. The pretrained model only knows how to work with data of the type that it has seen before. If the average pixel was zero in the data it was trained with, but your data has zero as the minimum possible value of a pixel, then the model is going to be seeing something very different to what is intended! \n",
"Although it only helped a little here, normalization becomes especially important when using pretrained models. The pretrained model only knows how to work with data of the type that it has seen before. If the average pixel value was 0 in the data it was trained with, but your data has 0 as the minimum possible value of a pixel, then the model is going to be seeing something very different to what is intended! \n",
"\n",
"This means that when you distribute a model, you need to also distribute the statistics used for normalization, since anyone using it for inference, or transfer learning, will need to use the same statistics. By the same token, if you're using a model that someone else has trained, make sure you find out what normalization statistics they used, and match them.\n",
"\n",
"We didn't have to handle normalization in previous chapters because when using a pretrained model through `cnn_learner`, the fastai library automatically adds the proper `Normalize` transform; the model has been pretrained with certain statistics in `Normalize` (usually coming from the ImageNet dataset), so the library can fill those for you. Note that this only applies with pretrained models, which is why we need to add it manually here, when training from scratch.\n",
"We didn't have to handle normalization in previous chapters because when using a pretrained model through `cnn_learner`, the fastai library automatically adds the proper `Normalize` transform; the model has been pretrained with certain statistics in `Normalize` (usually coming from the ImageNet dataset), so the library can fill those in for you. Note that this only applies with pretrained models, which is why we need to add this information manually here, when training from scratch.\n",
"\n",
"All our training up until now have been done at size 224. We could have begun training at a smaller size before going to that. This is called *progressive resizing*."
"All our training up until now has been done at size 224. We could have begun training at a smaller size before going to that. This is called *progressive resizing*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Progressive resizing"
"## Progressive Resizing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When fast.ai and its team of students [won the DAWNBench competition](https://www.theverge.com/2018/5/7/17316010/fast-ai-speed-test-stanford-dawnbench-google-intel), one of the most important innovations was something very simple: start training using small images, and end training using large images. By spending most of the epochs training with small images, training completed much faster. By completing training using large images, the final accuracy was much higher. We call this approach *progressive resizing*."
"When fast.ai and its team of students [won the DAWNBench competition](https://www.theverge.com/2018/5/7/17316010/fast-ai-speed-test-stanford-dawnbench-google-intel) in 2018, one of the most important innovations was something very simple: start training using small images, and end training using large images. Spending most of the epochs training with small images, helps training complete much faster. Completing training using large images makes the final accuracy much higher. We call this approach *progressive resizing*."
]
},
{
@ -389,15 +401,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we have seen, the kinds of features that are learned by convolutional neural networks are not in any way specific to the size of the image early layers find things like edges and gradients, and later layers may find things like noses and sunsets. So, when we change image size in the middle of training, it doesn't mean that we have two find totally different parameters for our model.\n",
"As we have seen, the kinds of features that are learned by convolutional neural networks are not in any way specific to the size of the image—early layers find things like edges and gradients, and later layers may find things like noses and sunsets. So, when we change image size in the middle of training, it doesn't mean that we have to find totally different parameters for our model.\n",
"\n",
"But clearly there are some differences between small images and big ones, so we shouldn't expect our model to continue working exactly as well, with no changes at all. Does this remind you of something? When we developed this idea, it reminded us of transfer learning! We are trying to get our model to learn to do something a little bit different to what it has learned to do before. Therefore, we should be able to use the `fine_tune` method after we resize our images.\n",
"But clearly there are some differences between small images and big ones, so we shouldn't expect our model to continue working exactly as well, with no changes at all. Does this remind you of something? When we developed this idea, it reminded us of transfer learning! We are trying to get our model to learn to do something a little bit different from what it has learned to do before. Therefore, we should be able to use the `fine_tune` method after we resize our images.\n",
"\n",
"There is an additional benefit to progressive resizing: it is another form of data augmentation. Therefore, you should expect to see better generalisation of your models that are trained with progressive resizing.\n",
"There is an additional benefit to progressive resizing: it is another form of data augmentation. Therefore, you should expect to see better generalization of your models that are trained with progressive resizing.\n",
"\n",
"To implement progressive resizing it is most convenient if you first create a `get_dls` function which takes an image size and a batch size, and returns your `DataLoaders`:\n",
"To implement progressive resizing it is most convenient if you first create a `get_dls` function which takes an image size and a batch size as we did in the section before, and returns your `DataLoaders`:\n",
"\n",
"Now you can create your `DataLoaders` with a small size, and `fit_one_cycle` in the usual way, for a few less epochs than you might otherwise do:"
"Now you can create your `DataLoaders` with a small size and use `fit_one_cycle` in the usual way, training for a few less epochs than you might otherwise do:"
]
},
{
@ -460,7 +472,8 @@
],
"source": [
"dls = get_dls(128, 128)\n",
"learn = Learner(dls, xresnet50(), loss_func=CrossEntropyLossFlat(), metrics=accuracy)\n",
"learn = Learner(dls, xresnet50(), loss_func=CrossEntropyLossFlat(), \n",
" metrics=accuracy)\n",
"learn.fit_one_cycle(4, 3e-3)"
]
},
@ -468,7 +481,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Then you can replace the DataLoaders inside the Learner, and `fine_tune`:"
"Then you can replace the `DataLoaders` inside the `Learner`, and fine-tune:"
]
},
{
@ -578,56 +591,49 @@
"source": [
"As you can see, we're getting much better performance, and the initial training on small images was much faster on each epoch.\n",
"\n",
"You can repeat the process of increasing size and training more epochs as many times as you like, for as big an image as you wish--but of course, you will not get any benefit by using an image size larger than the size of your images on disk.\n",
"You can repeat the process of increasing size and training more epochs as many times as you like, for as big an image as you wishbut of course, you will not get any benefit by using an image size larger than the size of your images on disk.\n",
"\n",
"Note that for transfer learning, progressive resizing may actually hurt performance. This would happen if your pretrained model was quite similar to your transfer learning task and dataset, and was trained on similar sized images, so the weights don't need to be changed much. In that case, training on smaller images may damage the pretrained weights.\n",
"Note that for transfer learning, progressive resizing may actually hurt performance. This is most likely to happen if your pretrained model was quite similar to your transfer learning task and dataset and was trained on similar-sized images, so the weights don't need to be changed much. In that case, training on smaller images may damage the pretrained weights.\n",
"\n",
"On the other hand, if the transfer learning task is going to be on images that are of different sizes, shapes, or style to those used in the pretraining tasks, progressive resizing will probably help. As always, the answer to \"does it help?\" is \"try it!\".\n",
"On the other hand, if the transfer learning task is going to use images that are of different sizes, shapes, or styles than those used in the pretraining task, progressive resizing will probably help. As always, the answer to \"Will it help?\" is \"Try it!\"\n",
"\n",
"Another thing we could try is applying data augmentation to the validation set: up until now, we have only applied it on the training set and the validation set always gets the same images. But maybe we could try to make predictions for a few augmented versions of the validation set and average them. This is called *test time augmentation*."
"Another thing we could try is applying data augmentation to the validation set. Up until now, we have only applied it on the training set; the validation set always gets the same images. But maybe we could try to make predictions for a few augmented versions of the validation set and average them. We'll consider this approach next."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test time augmentation"
"## Test Time Augmentation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have been using random cropping as a way to get some useful data augmentation, which leads to better generalisation, and results in a need for less training data. When we use random cropping, fastai will automatically use centre-cropping for the validation set — that is, it will select the largest square area it can in the centre of the image, such that it does not go past the image edges.\n",
"We have been using random cropping as a way to get some useful data augmentation, which leads to better generalization, and results in a need for less training data. When we use random cropping, fastai will automatically use center cropping for the validation set—that is, it will select the largest square area it can in the center of the image, without going past the image's edges.\n",
"\n",
"This can often be problematic. For instance, in a multi-label dataset sometimes there are small objects towards the edges of an image; these could be entirely cropped out by the centre cropping. Even for datasets such as the pet breed classification data we're working on now, it's possible that some critical feature necessary for identifying the correct breed, such as the colour of the nose, could be cropped out.\n",
"This can often be problematic. For instance, in a multi-label dataset sometimes there are small objects toward the edges of an image; these could be entirely cropped out by center cropping. Even for problems such as our pet breed classification example, it's possible that some critical feature necessary for identifying the correct breed, such as the color of the nose, could be cropped out.\n",
"\n",
"One solution to this is to avoid random cropping entirely. Instead, we could simply squish or stretch the rectangular images to fit into a square space. But then we miss out on a very useful data augmentation, and we also make the image recognition more difficult for our model, because it has to learn how to recognise squished and squeezed images, rather than just correctly proportioned images.\n",
"One solution to this problem is to avoid random cropping entirely. Instead, we could simply squish or stretch the rectangular images to fit into a square space. But then we miss out on a very useful data augmentation, and we also make the image recognition more difficult for our model, because it has to learn how to recognize squished and squeezed images, rather than just correctly proportioned images.\n",
"\n",
"Another solution is to not just centre crop for validation, but instead to select a number of areas to crop from the original rectangular image, pass each of them through our model, and take the maximum or average of the predictions. In fact, we could do this not just for different crops, but for different values across all of our test time augmentation parameters. This is known as *test time augmentation* (TTA)."
"Another solution is to not just center crop for validation, but instead to select a number of areas to crop from the original rectangular image, pass each of them through our model, and take the maximum or average of the predictions. In fact, we could do this not just for different crops, but for different values across all of our test time augmentation parameters. This is known as *test time augmentation* (TTA)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: test time augmentation (TTA): during inference or validation, creating multiple versions of each image, using data augmentation, and then taking the average or maximum of the predictions for each augmented version of the image"
"> jargon: test time augmentation (TTA): During inference or validation, creating multiple versions of each image, using data augmentation, and then taking the average or maximum of the predictions for each augmented version of the image."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK pic of TTA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Depending on the dataset, test time augmentation can result in dramatic improvements in accuracy. It does not change the time required to train at all, but will increase the amount of time for validation or inference by the number of test time augmented images requested. By default, fastai will use the unaugmented centre crop image, plus four randomly augmented images.\n",
"Depending on the dataset, test time augmentation can result in dramatic improvements in accuracy. It does not change the time required to train at all, but will increase the amount of time required for validation or inference by the number of test-time-augmented images requested. By default, fastai will use the unaugmented center crop image plus four randomly augmented images.\n",
"\n",
"You can pass any DataLoader to fastai's `tta` method; by default, it will use your validation set:"
"You can pass any `DataLoader` to fastai's `tta` method; by default, it will use your validation set:"
]
},
{
@ -705,9 +711,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, using TTA gives us good a boost of performance, with no additional training required. However, it does make inference slower--if you're averaging 5 images for TTA, inference will be 5x slower.\n",
"As we can see, using TTA gives us good a boost in performance, with no additional training required. However, it does make inference slower—if you're averaging five images for TTA, inference will be five times slower.\n",
"\n",
"Data augmentation helps train better models as we saw. Let's now focus on a new data augmentation technique called *Mixup*."
"We've seen examples of how data augmentation helps train better models. Let's now focus on a new data augmentation technique called *Mixup*."
]
},
{
@ -721,16 +727,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Mixup, introduced in the 2017 paper [mixup: Beyond Empirical Risk Minimization](https://arxiv.org/abs/1710.09412), is a very powerful data augmentation technique which can provide dramatically higher accuracy, especially when you don't have much data, and don't have a pretrained model that was trained on data similar to your dataset. The paper explains: \"While data augmentation consistently leads to improved generalization, the procedure is dataset-dependent, and thus requires the use of expert knowledge.\" For instance, it's common to flip images as part of data augmentation, but should you flip only horizontally, or also vertically? The answer is that it depends on your dataset. In addition, if flipping (for instance) doesn't provide enough data augmentation for you, you can't \"flip more\". It's helpful to have data augmentation techniques where you can \"dial up\" or \"dial down\" the amount of data augmentation, to see what works best for you.\n",
"Mixup, introduced in the 2017 paper [\"*mixup*: Beyond Empirical Risk Minimization\"](https://arxiv.org/abs/1710.09412) byHongyi Zhang et al., is a very powerful data augmentation technique that can provide dramatically higher accuracy, especially when you don't have much data and don't have a pretrained model that was trained on data similar to your dataset. The paper explains: \"While data augmentation consistently leads to improved generalization, the procedure is dataset-dependent, and thus requires the use of expert knowledge.\" For instance, it's common to flip images as part of data augmentation, but should you flip only horizontally, or also vertically? The answer is that it depends on your dataset. In addition, if flipping (for instance) doesn't provide enough data augmentation for you, you can't \"flip more.\" It's helpful to have data augmentation techniques where you can \"dial up\" or \"dial down\" the amount of change, to see what works best for you.\n",
"\n",
"Mixup works as follows, for each image:\n",
"\n",
"1. Select another image from your dataset at random\n",
"1. Pick a weight at random\n",
"1. Take a weighted average (using the weight from step 2) of the selected image with your image; this will be your independent variable\n",
"1. Take a weighted average (with the same weight) of this image's labels with your image's labels; this will be your dependent variable\n",
"1. Select another image from your dataset at random.\n",
"1. Pick a weight at random.\n",
"1. Take a weighted average (using the weight from step 2) of the selected image with your image; this will be your independent variable.\n",
"1. Take a weighted average (with the same weight) of this image's labels with your image's labels; this will be your dependent variable.\n",
"\n",
"In pseudo-code, we're doing (where `t` is the weight for our weighted average):\n",
"In pseudocode, we're doing this (where `t` is the weight for our weighted average):\n",
"\n",
"```\n",
"image2,target2 = dataset[randint(0,len(dataset)]\n",
@ -739,7 +745,7 @@
"new_target = t * target1 + (1-t) * target2\n",
"```\n",
"\n",
"For this to work, our targets need to be one-hot encoded. The paper describes this using these equations (where $\\lambda$ is the same as `t` in our code above):"
"For this to work, our targets need to be one-hot encoded. The paper describes this using the equations shown in <<mixup>> where $\\lambda$ is the same as `t` in our pseudocode:"
]
},
{
@ -753,16 +759,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sidebar: Papers and math"
"### Sidebar: Papers and Math"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to be looking at more and more research papers from here on in the book. Now that you have the basic jargon, you might be surprised to discover how much of them you can understand, with a little practice! One issue you'll notice is that greek letters, such as $\\lambda$, appear in most papers. It's a very good idea to learn the names of all the greek letters, since otherwise it's very hard to read the papers to yourself, and remember them, and it's also hard to read code based on them (since code often uses the name of the greek letter spelled out, such as `lambda`).\n",
"We're going to be looking at more and more research papers from here on in the book. Now that you have the basic jargon, you might be surprised to discover how much of them you can understand, with a little practice! One issue you'll notice is that Greek letters, such as $\\lambda$, appear in most papers. It's a very good idea to learn the names of all the Greek letters, since otherwise it's very hard to read the papers to yourself, and remember them (or to read code based on them, since code often uses the names of the Greek letters spelled out, such as `lambda`).\n",
"\n",
"The bigger issue with papers is that they use math, instead of code, to explain what's going on. If you don't have much of a math background, this will likely be intimidating and confusing at first. But remember: what is being shown in the math, is something that will be implemented in code. It's just another way of talking about the same thing! After reading a few papers, you'll pick up more and more of the notation. If you don't know what a symbol is, try looking it up on Wikipedia's [list of mathematical symbols](https://en.wikipedia.org/wiki/List_of_mathematical_symbols) or draw it on [Detexify](http://detexify.kirelabs.org/classify.html) which (using machine learning!) will find the name of your hard-drawn symbol. Then you can search online for that name to find out what it's for."
"The bigger issue with papers is that they use math, instead of code, to explain what's going on. If you don't have much of a math background, this will likely be intimidating and confusing at first. But remember: what is being shown in the math, is something that will be implemented in code. It's just another way of talking about the same thing! After reading a few papers, you'll pick up more and more of the notation. If you don't know what a symbol is, try looking it up in Wikipedia's [list of mathematical symbols](https://en.wikipedia.org/wiki/List_of_mathematical_symbols) or drawing it in [Detexify](http://detexify.kirelabs.org/classify.html), which (using machine learning!) will find the name of your hand-drawn symbol. Then you can search online for that name to find out what it's for."
]
},
{
@ -776,7 +782,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's what it looks like when we take a *linear combination* of images, as done in Mixup:"
"<<mixup_example>> shows what it looks like when we take a *linear combination* of images, as done in Mixup."
]
},
{
@ -802,7 +808,7 @@
"source": [
"#hide_input\n",
"#id mixup_example\n",
"#caption Mixing a chruch and a gas station\n",
"#caption Mixing a church and a gas station\n",
"#alt An image of a church, a gas station and the two mixed up.\n",
"church = PILImage.create(get_image_files_sorted(path/'train'/'n03028079')[0])\n",
"gas = PILImage.create(get_image_files_sorted(path/'train'/'n03425413')[0])\n",
@ -821,11 +827,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The third image is built by adding 0.3 times the first one and 0.7 times the second. In this example, should the model predict church? gas station? The right answer is 30% church and 70% gas station since that's what we'll get if we take the linear combination of the one-hot encoded targets. For instance, if *church* has for index 2 and *gas station* as for index 7, the one-hot-encoded representations are\n",
"The third image is built by adding 0.3 times the first one and 0.7 times the second. In this example, should the model predict \"church\" or \"gas station\"? The right answer is 30% church and 70% gas station, since that's what we'll get if we take the linear combination of the one-hot-encoded targets. For instance, suppose we have 10 classes and \"church\" is represented by the index 2 and \"gas station\" is reprsented by the index 7, the one-hot-encoded representations are:\n",
"```\n",
"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0] and [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]\n",
"```\n",
"(since we have ten classes in total) so our final target is\n",
"so our final target is:\n",
"```\n",
"[0, 0, 0.3, 0, 0, 0, 0, 0.7, 0, 0]\n",
"```"
@ -835,13 +841,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This all done for us inside fastai by adding a `Callback` to our `Learner`. `Callback`s are what is used inside fastai to inject custom behavior in the training loop (like a learning rate schedule, or training in mixed precision). We'll be learning all about callbacks, including how to make your own, in <<chapter_callbacks>>. For now, all you need to know is that you use the `cbs` parameter to `Learner` to pass callbacks.\n",
"This all done for us inside fastai by adding a *callback* to our `Learner`. `Callback`s are what is used inside fastai to inject custom behavior in the training loop (like a learning rate schedule, or training in mixed precision). We'll be learning all about callbacks, including how to make your own, in <<chapter_accel_sgd>>. For now, all you need to know is that you use the `cbs` parameter to `Learner` to pass callbacks.\n",
"\n",
"Here is how you train a model with Mixup:\n",
"Here is how we train a model with Mixup:\n",
"\n",
"```\n",
"```python\n",
"model = xresnet50()\n",
"learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy, cbs=Mixup)\n",
"learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), \n",
" metrics=accuracy, cbs=Mixup)\n",
"learn.fit_one_cycle(5, 3e-3)\n",
"```"
]
@ -850,70 +857,70 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"So what happens if we train a model where our data is \"mixed up\" in this way? Clearly, it's going to be harder to train, because it's harder to see what's in each image. And the model has to predict two labels per image, rather than just one, as well as figuring out how much each one is weighted. Overfitting seems less likely to be a problem, because we're not showing the same image each epoch, but are instead showing a random combination of two images.\n",
"What happens when we train a model with data that's \"mixed up\" in this way? Clearly, it's going to be harder to train, because it's harder to see what's in each image. And the model has to predict two labels per image, rather than just one, as well as figuring out how much each one is weighted. Overfitting seems less likely to be a problem, however, because we're not showing the same image in each epoch, but are instead showing a random combination of two images.\n",
"\n",
"Mixup requires far more epochs to train to a better accuracy, compared to other augmentation approaches we've seen. You can try training Imagenette with and without Mixup by using the `examples/train_imagenette.py` script in the fastai repo. At the time of writing, the leaderboard in the [Imagenette repo](https://github.com/fastai/imagenette/) is showing that mixup is used for all leading results for trainings of >80 epochs, and for few epochs Mixup is not being used. This is inline with our experience of using Mixup too.\n",
"Mixup requires far more epochs to train to get better accuracy, compared to other augmentation approaches we've seen. You can try training Imagenette with and without Mixup by using the *examples/train_imagenette.py* script in the [fastai repo](https://github.com/fastai/fastai). At the time of writing, the leaderboard in the [Imagenette repo](https://github.com/fastai/imagenette/) is showing that Mixup is used for all leading results for trainings of >80 epochs, and for fewer epochs Mixup is not being used. This is in line with our experience of using Mixup too.\n",
"\n",
"One of the reasons that mixup is so exciting is that it can be applied to types of data other than photos. In fact, some people have even shown good results by using mixup on activations *inside* their model, not just on inputs--these allows Mixup to be used for NLP and other data types too.\n",
"One of the reasons that Mixup is so exciting is that it can be applied to types of data other than photos. In fact, some people have even shown good results by using Mixup on activations *inside* their models, not just on inputs—this allows Mixup to be used for NLP and other data types too.\n",
"\n",
"There's another subtle issue that Mixup deals with for us, which is that it's not actually possible with the models we've seen before for our loss to ever be perfect. The problem is that our labels are ones and zeros, but softmax and sigmoid *never* can equal one or zero. So when we train our model, it causes it to push our activations ever closer to zero and one, such that the more epochs we do, the more extreme our activations become.\n",
"There's another subtle issue that Mixup deals with for us, which is that it's not actually possible with the models we've seen before for our loss to ever be perfect. The problem is that our labels are 1s and 0s, but the outputs of softmax and sigmoid can never equal 1 or 0. This means training our model pushes our activations ever closer to those values, such that the more epochs we do, the more extreme our activations become.\n",
"\n",
"With Mixup, we no longer have that problem, because our labels will only be exactly one or zero if we happen to \"mix\" with another image of the same class. The rest of the time, our labels will be a linear combination, such as the 0.7 and 0.3 we got in the church and gas station example above.\n",
"With Mixup we no longer have that problem, because our labels will only be exactly 1 or 0 if we happen to \"mix\" with another image of the same class. The rest of the time our labels will be a linear combination, such as the 0.7 and 0.3 we got in the church and gas station example earlier.\n",
"\n",
"One issue with this, however, is that Mixup is \"accidentally\" making the labels bigger than zero, or smaller than one. That is to say, we're not *explicitly* telling our model that we want to change the labels in this way. So if we want to change to make the labels closer, or further away, from zero and one, we have to change the amount of Mixup--which also changes the amount of data augmentation, which might not be what we want. There is, however, a way to handle this more directly, which is to use *label smoothing*."
"One issue with this, however, is that Mixup is \"accidentally\" making the labels bigger than 0, or smaller than 1. That is to say, we're not *explicitly* telling our model that we want to change the labels in this way. So, if we want to change to make the labels closer to, or further away from 0 and 1, we have to change the amount of Mixup—which also changes the amount of data augmentation, which might not be what we want. There is, however, a way to handle this more directly, which is to use *label smoothing*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Label smoothing"
"## Label Smoothing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the theoretical expression of the loss, in classification problems, our targets are one-hot encoded (in practice we tend to avoid doing it to save memory, but what we compute is the same loss as if we had used one-hot encoding). That means the model is trained to return 0 for all categories but one, for which it is trained to return 1. Even 0.999 is not *good enough*, the model will get gradients and learn to predict activations that are even more confident. This encourages overfitting and gives you at inference time a model that is not going to give meaningful probabilities: it will always say 1 for the predicted category even if it's not too sure, just because it was trained this way.\n",
"In the theoretical expression of loss, in classification problems, our targets are one-hot encoded (in practice we tend to avoid doing this to save memory, but what we compute is the same loss as if we had used one-hot encoding). That means the model is trained to return 0 for all categories but one, for which it is trained to return 1. Even 0.999 is not \"good enough\", the model will get gradients and learn to predict activations with even higher confidence. This encourages overfitting and gives you at inference time a model that is not going to give meaningful probabilities: it will always say 1 for the predicted category even if it's not too sure, just because it was trained this way.\n",
"\n",
"It can become very harmful if your data is not perfectly labeled. In the bear classifier we studied in <<chapter_production>>, we saw that some of the images were mislabeled, or contained two different kinds of bears. In general, your data will never be perfect. Even if the labels were manually produced by humans, they could make mistakes, or have differences of opinions on images harder to label.\n",
"This can become very harmful if your data is not perfectly labeled. In the bear classifier we studied in <<chapter_production>>, we saw that some of the images were mislabeled, or contained two different kinds of bears. In general, your data will never be perfect. Even if the labels were manually produced by humans, they could make mistakes, or have differences of opinions on images that are harder to label.\n",
"\n",
"Instead, we could replace all our `1`s by a number a bit less than `1`, and our `0`s by a number a bit more than `0`, and then train. This is called *label smoothing*. By encouraging your model to be less confident, label smoothing will make your training more robust, even if there is mislabeled data, and will produce a model that generalizes better at inference.\n",
"Instead, we could replace all our 1s with a number a bit less than 1, and our 0s by a number a bit more than 0, and then train. This is called *label smoothing*. By encouraging your model to be less confident, label smoothing will make your training more robust, even if there is mislabeled data. The result will be a model that generalizes better.\n",
"\n",
"This is how label smoothing works in practice: we start with one-hot encoded labels, then replace all zeros by $\\frac{\\epsilon}{N}$ (that's the greek letter *epsilon*, which is what was used in the [paper which introduced label smoothing](https://arxiv.org/abs/1512.00567), and is used in the fastai code) where $N$ is the number of classes and $\\epsilon$ is a parameter (usually 0.1, which would mean we are 10% unsure of our labels). Since you want the labels to add up to 1, replace the 1 by $1-\\epsilon + \\frac{\\epsilon}{N}$. This way, we don't encourage the model to predict something overconfident: in our Imagenette example where we have 10 classes, the targets become something like:\n",
"This is how label smoothing works in practice: we start with one-hot-encoded labels, then replace all 0s with $\\frac{\\epsilon}{N}$ (that's the Greek letter *epsilon*, which is what was used in the [paper that introduced label smoothing](https://arxiv.org/abs/1512.00567) and is used in the fastai code), where $N$ is the number of classes and $\\epsilon$ is a parameter (usually 0.1, which would mean we are 10% unsure of our labels). Since we want the labels to add up to 1, replace the 1 by $1-\\epsilon + \\frac{\\epsilon}{N}$. This way, we don't encourage the model to predict something overconfidently. In our Imagenette example where we have 10 classes, the targets become something like (here for a target that corresponds to the index 3):\n",
"```\n",
"[0.01, 0.01, 0.01, 0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]\n",
"```\n",
"(here for a target that corresponds to the index 3). In practice, we don't want to one-hot encode the labels, and fortunately we won't need too (the one-hot encoding is just good to explain what label smoothing is and visualize it)."
"In practice, we don't want to one-hot encode the labels, and fortunately we won't need to (the one-hot encoding is just good to explain what label smoothing is and visualize it)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sidebar: Label smoothing, the paper"
"### Sidebar: Label Smoothing, the Paper"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is how the reasoning behind label smoothing was explained in the paper:\n",
"Here is how the reasoning behind label smoothing was explained in the paper by Christian Szegedy et al.:\n",
"\n",
"\"This maximum is not achievable for finite $z_k$ but is approached if $z_y\\gg z_k$ for all $k\\neq y$ -- that is, if the logit corresponding to the ground-truth label is much great than all other logits. This, however, can cause two problems. First, it may result in over-fitting: if the model learns to assign full probability to the ground-truth label for each training example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient $\\frac{\\partial\\ell}{\\partial z_k}$, reduces the ability of the model to adapt. Intuitively, this happens because the model becomes too confident about its predictions.\""
"> : This maximum is not achievable for finite $z_k$ but is approached if $z_y\\gg z_k$ for all $k\\neq y$that is, if the logit corresponding to the ground-truth label is much great than all other logits. This, however, can cause two problems. First, it may result in over-fitting: if the model learns to assign full probability to the ground-truth label for each training example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient $\\frac{\\partial\\ell}{\\partial z_k}$, reduces the ability of the model to adapt. Intuitively, this happens because the model becomes too confident about its predictions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's practice our paper reading skills to try to interpret this. \"This maximum\" is refering to the previous section of the paper, which talked about the fact that `1` is the value of the label for the positive class. So any value (except infinity) can't result in `1` after sigmoid or softmax. In a paper, you won't normally see \"any value\" written, but instead it would get a symbol; in this case, it's $z_k$. This is helpful in a paper, because it can be refered to again later, and the reader knows what value is being discussed.\n",
"Let's practice our paper-reading skills to try to interpret this. \"This maximum\" is refering to the previous part of the paragraph, which talked about the fact that 1 is the value of the label for the positive class. So it's not possible for any value (except infinity) to result in 1 after sigmoid or softmax. In a paper, you won't normally see \"any value\" written; instead it will get a symbol, which in this case is $z_k$. This shorthand is helpful in a paper, because it can be refered to again later and the reader will know what value is being discussed.\n",
"\n",
"The it says: $z_y\\gg z_k$ for all $k\\neq y$. In this case, the paper immediately follows with \"that is...\", which is handy, because you can just read the English instead of the math. In the math, the $y$ is refering to the target ($y$ is defined earlier in the paper; sometimes it's hard to find where symbols are defined, but nearly all papers will define all their symbols somewhere), and $z_y$ is the activation corresponding to the target. So to get close to `1`, this activation needs to be much higher than all the others for that prediction.\n",
"Then it says \"if $z_y\\gg z_k$ for all $k\\neq y$.\" In this case, the paper immediately follows the math with an English description, which is handy because you can just read that. In the math, the $y$ is refering to the target ($y$ is defined earlier in the paper; sometimes it's hard to find where symbols are defined, but nearly all papers will define all their symbols somewhere), and $z_y$ is the activation corresponding to the target. So to get close to 1, this activation needs to be much higher than all the others for that prediction.\n",
"\n",
"Next up is \"if the model learns to assign full probability to the ground-truth label for each training example, it is not guaranteed to generalize\". This is saying that making $z_y$ really big means we'll need large weights and large activations throughout our model. Large weights lead to \"bumpy\" functions, where a small change in input results in a big change to predictions. This is really bad for generalization, because it means just one pixel changing a bit could change our prediction entirely!\n",
"Next, consider the statement \"if the model learns to assign full probability to the ground-truth label for each training example, it is not guaranteed to generalize.\" This is saying that making $z_y$ really big means we'll need large weights and large activations throughout our model. Large weights lead to \"bumpy\" functions, where a small change in input results in a big change to predictions. This is really bad for generalization, because it means just one pixel changing a bit could change our prediction entirely!\n",
"\n",
"Finally, we have \"it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient $\\frac{\\partial\\ell}{\\partial z_k}$, reduces the ability of the model to adapt\". The gradient of cross entropy, remember, is basically `output-target`, and both `output` and `target` are between zero and one. So the difference is between `-1` and `1`, which is why the paper says the gradient is \"bounded\" (it can't be infinite). Therefore our SGD steps are bounded too. \"Reduces the ability of the model to adapt\" means that it is hard for it to be updated in a transfer learning setting. This follows because the difference in loss due to incorrect predictions is unbounded, but we can only take a limited step each time."
"Finally, we have \"it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient $\\frac{\\partial\\ell}{\\partial z_k}$, reduces the ability of the model to adapt.\" The gradient of cross-entropy, remember, is basically `output - target`. Both `output` and `target` are between 0 and 1, so the difference is between `-1` and `1`, which is why the paper says the gradient is \"bounded\" (it can't be infinite). Therefore our SGD steps are bounded too. \"Reduces the ability of the model to adapt\" means that it is hard for it to be updated in a transfer learning setting. This follows because the difference in loss due to incorrect predictions is unbounded, but we can only take a limited step each time."
]
},
{
@ -927,15 +934,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To use it in practice, we just have to change the loss function in our call to `Learner`:\n",
"To use this in practice, we just have to change the loss function in our call to `Learner`:\n",
"\n",
"```python\n",
"model = xresnet50()\n",
"learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(), metrics=accuracy)\n",
"learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(), \n",
" metrics=accuracy)\n",
"learn.fit_one_cycle(5, 3e-3)\n",
"```\n",
"\n",
"Like Mixup, you won't generally see significant improvements from label smoothing until you train more epochs. Try it yourself and see: how many epochs do you have to train before label smoothing shows an improvement?"
"Like with Mixup, you won't generally see significant improvements from label smoothing until you train more epochs. Try it yourself and see: how many epochs do you have to train before label smoothing shows an improvement?"
]
},
{
@ -953,7 +961,7 @@
"\n",
"Most importantly, remember that if your dataset is big, there is no point prototyping on the whole thing. Find a small subset that is representative of the whole, like we did with Imagenette, and experiment on it.\n",
"\n",
"In the next three chapters, we will look at the other applications directly supported by fastai: collaborative filtering, tabular and text. We will go back to computer vision in the next section of the book, with a deep dive in convolutional neural networks in <<chapter_convolutions>>. "
"In the next three chapters, we will look at the other applications directly supported by fastai: collaborative filtering, tabular modeling and working with text. We will go back to computer vision in the next section of the book, with a deep dive into convolutional neural networks in <<chapter_convolutions>>. "
]
},
{
@ -976,23 +984,23 @@
"1. Is using TTA at inference slower or faster than regular inference? Why?\n",
"1. What is Mixup? How do you use it in fastai?\n",
"1. Why does Mixup prevent the model from being too confident?\n",
"1. Why does a training with Mixup for 5 epochs end up worse than a training without Mixup?\n",
"1. Why does training with Mixup for five epochs end up worse than training without Mixup?\n",
"1. What is the idea behind label smoothing?\n",
"1. What problems in your data can label smoothing help with?\n",
"1. When using label smoothing with 5 categories, what is the target associated with the index 1?\n",
"1. What is the first step to take when you want to prototype quick experiments on a new dataset."
"1. When using label smoothing with five categories, what is the target associated with the index 1?\n",
"1. What is the first step to take when you want to prototype quick experiments on a new dataset?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further research\n",
"### Further Research\n",
"\n",
"1. Use the fastai documentation to build a function that crops an image to a square in the four corners, then implement a TTA method that averages the predictions on a center crop and those four crops. Did it help? Is it better than the TTA method of fastai?\n",
"1. Find the Mixup paper on arxiv and read it. Pick one or two more recent articles introducing variants of Mixup and read them, then try to implement them on your problem.\n",
"1. Find the script training Imagenette using Mixup and use it as an example to build a script for a long training on your own project. Execute it and see if it helped.\n",
"1. Read the sidebar on the math of label smoothing, and look at the relevant section of the original paper, and see if you can follow it. Don't be afraid to ask for help!"
"1. Use the fastai documentation to build a function that crops an image to a square in each of the four corners, then implement a TTA method that averages the predictions on a center crop and those four crops. Did it help? Is it better than the TTA method of fastai?\n",
"1. Find the Mixup paper on arXiv and read it. Pick one or two more recent articles introducing variants of Mixup and read them, then try to implement them on your problem.\n",
"1. Find the script training Imagenette using Mixup and use it as an example to build a script for a long training on your own project. Execute it and see if it helps.\n",
"1. Read the sidebar \"Label Smoothing, the Paper\", look at the relevant section of the original paper and see if you can follow it. Don't be afraid to ask for help!"
]
},
{

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

1310
11_midlevel_data.ipynb Normal file

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

2365
12_nlp_dive.ipynb Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -1,5 +1,17 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"!pip install -Uqq fastbook\n",
"import fastbook\n",
"fastbook.setup_book()"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -9,7 +21,7 @@
"outputs": [],
"source": [
"#hide\n",
"from utils import *"
"from fastbook import *"
]
},
{
@ -23,23 +35,32 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Resnets"
"# ResNets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Going back to Imagenette"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's going to be tough to judge any improvement we do to our models when we are already at an accuracy that is as high as we saw on MNIST in the previous chapter, so we will tackle a tougher problem by going back to Imagenette. We'll stick with small images to keep things reasonably fast.\n",
"In this chapter, we will build on top of the CNNs introduced in the previous chapter and explain to you the ResNet (residual network) architecture. It was introduced in 2015 by Kaiming He et al. in the article [\"Deep Residual Learning for Image Recognition\"](https://arxiv.org/abs/1512.03385) and is by far the most used model architecture nowadays. More recent developments in image models almost always use the same trick of residual connections, and most of the time, they are just a tweak of the original ResNet.\n",
"\n",
"Let's grab the data--we'll use the already-resized 160px version to make things faster still, and will random crop to 128px:"
"We will first show you the basic ResNet as it was first designed, then explain to you what modern tweaks make it more performant. But first, we will need a problem a little bit more difficult than the MNIST dataset, since we are already close to 100% accuracy with a regular CNN on it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Going Back to Imagenette"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's going to be tough to judge any improvements we make to our models when we are already at an accuracy that is as high as we saw on MNIST in the previous chapter, so we will tackle a tougher image classification problem by going back to Imagenette. We'll stick with small images to keep things reasonably fast.\n",
"\n",
"Let's grab the data—we'll use the already-resized 160 px version to make things faster still, and will random crop to 128 px:"
]
},
{
@ -94,14 +115,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When we looked at MNIST we were dealing with 28 x 28 pixel images. For Imagenette we are going to be training with 128 x 128 pixel images, and later on would like to be able to use larger images as well — at least as big as 224 x 224 pixels, the ImageNet standard. Do you recall how we managed to get a single vector of activations for each image out of the MNIST convolutional neural network?\n",
"When we looked at MNIST we were dealing with 28×28-pixel images. For Imagenette we are going to be training with 128×128-pixel images. Later, we would like to be able to use larger images as well—at least as big as 224×224 pixels, the ImageNet standard. Do you recall how we managed to get a single vector of activations for each image out of the MNIST convolutional neural network?\n",
"\n",
"The approach we used was to ensure that there was enough stride two convolutions such that the final layer would have a grid size of one. Then we just flattened out the unit axes that we ended up with, to get a vector for each image (so a matrix of activations for a mini batch). We could do the same thing for Imagenette, but that's going to cause two problems:\n",
"The approach we used was to ensure that there were enough stride-2convolutions such that the final layer would have a grid size of 1. Then we just flattened out the unit axes that we ended up with, to get a vector for each image (so, a matrix of activations for a mini-batch). We could do the same thing for Imagenette, but that's would cause two problems:\n",
"\n",
"- We are going to need lots of stride two layers to make our grid one by one at the end — perhaps more than we would otherwise choose\n",
"- The model will not work on images of any size other than the size we originally trained on.\n",
"- We'd need lots of stride-2 layers to make our grid 1×1 at the end—perhaps more than we would otherwise choose.\n",
"- The model would not work on images of any size other than the size we originally trained on.\n",
"\n",
"One approach to dealing with the first of these issues would be to flatten the final convolutional layer in a way that handles a grid size other than one by one. That is, we could simply flatten a matrix into a vector as we have done before, by laying out each row after the previous row. In fact, this is the approach that convolutional neural networks up until 2013 nearly always did. The most famous, still sometimes used today, is the 2013 ImageNet winner VGG. But there was another problem with this architecture: not only does it not work with images other than those of the same size as the training set, but it required a lot of memory, because flattening out the convolutional create resulted in many activations being fed into the final layers. Therefore, the weight matrices of the final layers were enormous.\n",
"One approach to dealing with the first of these issues would be to flatten the final convolutional layer in a way that handles a grid size other than 1×1. That is, we could simply flatten a matrix into a vector as we have done before, by laying out each row after the previous row. In fact, this is the approach that convolutional neural networks up until 2013 nearly always took. The most famous example is the 2013 ImageNet winner VGG, still sometimes used today. But there was another problem with this architecture: not only did it not work with images other than those of the same size used in the training set, but it required a lot of memory, because flattening out the convolutional layer resulted in many activations being fed into the final layers. Therefore, the weight matrices of the final layers were enormous.\n",
"\n",
"This problem was solved through the creation of *fully convolutional networks*. The trick in fully convolutional networks is to take the average of activations across a convolutional grid. In other words, we can simply use this function:"
]
@ -119,9 +140,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As you see, it is taking the mean over the X and Y axes. This function will always convert a grid of activations into a single activation per image. PyTorch provides a slightly more versatile module called `nn.AdaptiveAvgPool2d`, which averages a grid of activations into whatever sized destination you require (although we nearly always use the size of one).\n",
"As you see, it is taking the mean over the x- and y-axes. This function will always convert a grid of activations into a single activation per image. PyTorch provides a slightly more versatile module called `nn.AdaptiveAvgPool2d`, which averages a grid of activations into whatever sized destination you require (although we nearly always use a size of 1).\n",
"\n",
"A fully convolutional network, therefore, has a number of convolutional layers, some of which will be stride two, at the end of which is an adaptive average pooling layer, a flatten layer to remove the unit axes, and finally a linear layer. Here is our first fully convolutional network:"
"A fully convolutional network, therefore, has a number of convolutional layers, some of which will be stride 2, at the end of which is an adaptive average pooling layer, a flatten layer to remove the unit axes, and finally a linear layer. Here is our first fully convolutional network:"
]
},
{
@ -147,25 +168,25 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to be replacing the implementation of `block` in the network with other variants in a moment, which is why we're not calling it `conv` any more. We're saving some time by taking advantage of fastai's `ConvLayer` that already provides the functionality of `conv` from the last chapter (plus a lot more!)"
"We're going to be replacing the implementation of `block` in the network with other variants in a moment, which is why we're not calling it `conv` any more. We're also saving some time by taking advantage of fastai's `ConvLayer`, which that already provides the functionality of `conv` from the last chapter (plus a lot more!)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> stop: Consider this question: Would this approach makes sense for an optical character recognition (OCR) problem such as MNIST? We see the vast majority of practitioners tackling OCR and similar problems tend to use fully convolutional networks, because that's what nearly everybody learns nowadays. But it really doesn't make any sense! You can't decide whether, for instance, whether a number is a \"3\" or an \"8\" by slicing it into small pieces, jumbling them up, and deciding whether on average each piece looks like a \"3\" or an \"8\". But that's what adaptive average pooling effectively does! Fully convolutional networks are only really a good choice for objects that don't have a single correct orientation or size (i.e. like most natural photos)."
"> stop: Consider this question: would this approach makes sense for an optical character recognition (OCR) problem such as MNIST? The vast majority of practitioners tackling OCR and similar problems tend to use fully convolutional networks, because that's what nearly everybody learns nowadays. But it really doesn't make any sense! You can't decide, for instance, whether a number is a 3 or an 8 by slicing it into small pieces, jumbling them up, and deciding whether on average each piece looks like a 3 or an 8. But that's what adaptive average pooling effectively does! Fully convolutional networks are only really a good choice for objects that don't have a single correct orientation or size (e.g., like most natural photos)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we are done with our convolutional layers, we will get activations of size `bs x ch x h x w` (batch size, a certain number of channels, height and width). We want to convert this to a tensor of size `bs x ch`, so we take the average over the last two dimensions and flatten the trailing `1 x 1` dimension like we did in our previous model. \n",
"Once we are done with our convolutional layers, we will get activations of size `bs x ch x h x w` (batch size, a certain number of channels, height, and width). We want to convert this to a tensor of size `bs x ch`, so we take the average over the last two dimensions and flatten the trailing 1×1 dimension like we did in our previous model. \n",
"\n",
"This is different from regular pooling in the sense that those layers will generally take the average (for average pooling) or the maximum (for max pooling) of a window of a given size: for instance max pooling layers of size 2 that were very popular in older CNNs reduce the size of our image by on each dimension by taking the maximum of each 2 by 2 window (with a stride of 2).\n",
"This is different from regular pooling in the sense that those layers will generally take the average (for average pooling) or the maximum (for max pooling) of a window of a given size. For instance, max pooling layers of size 2, which were very popular in older CNNs, reduce the size of our image by half on each dimension by taking the maximum of each 2×2 window (with a stride of 2).\n",
"\n",
"As before, we can define a `Learner` with our custom model and the data we grabbed before then train it:"
"As before, we can define a `Learner` with our custom model and then train it on the data we grabbed earlier:"
]
},
{
@ -227,7 +248,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"`3e-3` is very often a good learning rate for CNNs, and that appears to be the case here too, so let's try that:"
"3e-3 is often a good learning rate for CNNs, and that appears to be the case here too, so let's try that:"
]
},
{
@ -303,72 +324,72 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"That's a pretty good start, considering we have to pick the correct one of ten categories, and we're training from scratch for just 5 epochs!"
"That's a pretty good start, considering we have to pick the correct one of 10 categories, and we're training from scratch for just 5 epochs! We can do way better than this using a deeper mode, but just stacking new layers won't really improve our results (you can try and see for yourself!). To work around this problem, ResNets introduce the idea of *skip connections*. We'll explore those and other aspects of ResNets in the next section."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Building a modern CNN: ResNet"
"## Building a Modern CNN: ResNet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have all the pieces needed to build the models we have been using in each computer vision task since the beginning of this book: ResNets. We introduce the main idea behind them and show how it improves accuracy Imagenette compared to our previous model, before building a version with all the recent tweaks."
"We now have all the pieces we need to build the models we have been using in our computer vision tasks since the beginning of this book: ResNets. We'll introduce the main idea behind them and show how it improves accuracy on Imagenette compared to our previous model, before building a version with all the recent tweaks."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Skip-connections"
"### Skip Connections"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In 2015 the authors of ResNet paper noticed something that they found curious. Even after using batchnorm, they saw that a network using more layers was doing less well than a network using less layers — and there were no other differences between the models. Most interestingly, the difference was observed not only in the validation set, but also in the training set; so, it wasn't just a generalisation issue, but a training issue. As the paper explains:\n",
"In 2015, the authors of the ResNet paper noticed something that they found curious. Even after using batchnorm, they saw that a network using more layers was doing less well than a network using fewer layers—and there were no other differences between the models. Most interestingly, the difference was observed not only in the validation set, but also in the training set; so, it wasn't just a generalization issue, but a training issue. As the paper explains:\n",
"\n",
"> : Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as [previously reported] and thoroughly verified by our experiments.\n",
"\n",
"This is the graph they showed, with training error on the left, and test on the right:"
"This phenomenon was illustrated by the graph in <<resnet_depth>>, with training error on the left and test error on the right."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Training of networks of different depth\" width=\"700\" caption=\"Training of networks of different depth\" id=\"resnet_depth\" src=\"images/att_00042.png\">"
"<img alt=\"Training of networks of different depth\" width=\"700\" caption=\"Training of networks of different depth (courtesy of Kaiming He et al.)\" id=\"resnet_depth\" src=\"images/att_00042.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As the authors mention here, they are not the first people to have noticed this curious fact. But they were the 1st to make a very important leap:\n",
"As the authors mention here, they are not the first people to have noticed this curious fact. But they were the first to make a very important leap:\n",
"\n",
"> : Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model.\n",
"\n",
"Being an academic paper, this process written in a rather inaccessible way — but it's actually saying something very simple: start with the 20 layer neural network that is trained well, and add another 36 layers that do nothing at all (for instance, they linear layer with a single weight equal to one, and bias equal to 0). This would be a 56 layer network which does exactly the same thing as the 20 layer network. This shows that there are always deep networks which should be *at least as good* as any shallow network. But for some reason, SGD does not seem able to find them.\n",
"As this is an academic paper this process is described in a rather inaccessible way, but the concept is actually very simple: start with a 20-layer neural network that is trained well, and add another 36 layers that do nothing at all (for instance, they could be linear layers with a single weight equal to 1, and bias equal to 0). The result will be a 56-layer network that does exactly the same thing as the 20-layer network, proving that there are always deep networks that should be *at least as good* as any shallow network. But for some reason, SGD does not seem able to find them.\n",
"\n",
"> jargon: Identity mapping: a function that just returns its input without changing it at all. Also known as *identity function*.\n",
"> jargon: Identity mapping: Returning the input without changing it at all. This process is performed by an _identity function_.\n",
"\n",
"Actually, there is another way to create those extra 36 layers, which is much more interesting. What if we replaced every occurrence of `conv(x)` with `x + conv(x)`, where `conv` is the function from the previous chapter which does a 2nd convolution, then relu, then batchnorm. Furthermore, recall that batchnorm does `gamma*y + beta`. What if we initialized `gamma` for every one of these batchnorm layers to zero? Then our `conv(x)` for those extra 36 layers will always be equal to zero, which means `x+conv(x)` will always be equal to `x`.\n",
"Actually, there is another way to create those extra 36 layers, which is much more interesting. What if we replaced every occurrence of `conv(x)` with `x + conv(x)`, where `conv` is the function from the previous chapter that adds a second convolution, then a ReLU, then a batchnorm layer. Furthermore, recall that batchnorm does `gamma*y + beta`. What if we initialized `gamma` to zero for every one of those final batchnorm layers? Then our `conv(x)` for those extra 36 layers will always be equal to zero, which means `x+conv(x)` will always be equal to `x`.\n",
"\n",
"What has that gained us, then? The key thing is that those 36 extra layers, as they stand, are an *identity mapping*, but they have *parameters*, which means they are *trainable*. So, we can start with our best 20 layer model, add these 36 extra layers which initially do nothing at all, and then *fine tune the whole 56 layer model*. If those extra 36 layers can be useful, then they can learn parameters to do so!\n",
"What has that gained us? The key thing is that those 36 extra layers, as they stand, are an *identity mapping*, but they have *parameters*, which means they are *trainable*. So, we can start with our best 20-layer model, add these 36 extra layers which initially do nothing at all, and then *fine-tune the whole 56-layer model*. Those extra 36 layers can then learn the parameters that make them most useful.\n",
"\n",
"The ResNet paper actually proposed a variant of this, which is to instead \"skip over\" every 2nd convolution, so effectively we get `x+conv2(conv1(x))`. Or In diagram form (from the paper):"
"The ResNet paper actually proposed a variant of this, which is to instead \"skip over\" every second convolution, so effectively we get `x+conv2(conv1(x))`. This is shown by the diagram in <<resnet_block>> (from the paper)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"A simple ResNet block\" width=\"331\" caption=\"A simple ResNet block\" id=\"resnet_block\" src=\"images/att_00043.png\">"
"<img alt=\"A simple ResNet block\" width=\"331\" caption=\"A simple ResNet block (courtesy of Kaiming He et al.)\" id=\"resnet_block\" src=\"images/att_00043.png\">"
]
},
{
@ -377,24 +398,24 @@
"source": [
"That arrow on the right is just the `x` part of `x+conv2(conv1(x))`, and is known as the *identity branch* or *skip connection*. The path on the left is the `conv2(conv1(x))` part. You can think of the identity path as providing a direct route from the input to the output.\n",
"\n",
"In a ResNet, we don't actually train it by first training a smaller number of layers, and then add new layers on the end and fine-tune. Instead, we use ResNet blocks (like the above) throughout the CNN, initialized from scratch in the usual way, and trained with SGD in the usual way. We rely on the skip connections to make the network easier to train for SGD."
"In a ResNet, we don't actually proceed by first training a smaller number of layers, and then adding new layers on the end and fine-tuning. Instead, we use ResNet blocks like the one in <<resnet_block>> throughout the CNN, initialized from scratch in the usual way, and trained with SGD in the usual way. We rely on the skip connections to make the network easier to train with SGD."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There's another (largely equivalent) way to think of these \"ResNet blocks\". This is how the paper describes it:\n",
"There's another (largely equivalent) way to think of these ResNet blocks. This is how the paper describes it:\n",
"\n",
"> : Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x)x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.\n",
"\n",
"Again, this is rather inaccessible prose—so let's try to restate it in plain English! If the outcome of a given layer is `x`, when using a ResNet block that return `y = x+block(x)`, we're not asking the block to predict `y`, we are asking it to predict the difference between `y-x`. So the job of those blocks isn't to predict certain features anymore, but a little extra step that will minimize the error between `x` and the desired `y`. ResNet is, therefore, good at learning about slight differences between doing nothing and some other feature that the layer learns. Since we predict residuals (reminder: \"residual\" is predictions minus targets), this is why those kinds of models were named ResNets.\n",
"Again, this is rather inaccessible prose—so let's try to restate it in plain English! If the outcome of a given layer is `x`, when using a ResNet block that returns `y = x+block(x)` we're not asking the block to predict `y`, we are asking it to predict the difference between `y` and `x`. So the job of those blocks isn't to predict certain features, but to minimize the error between `x` and the desired `y`. A ResNet is, therefore, good at learning about slight differences between doing nothing and passing though a block of two convolutional layers (with trainable weights). This is how these models got their name: they're predicting residuals (reminder: \"residual\" is prediction minus target).\n",
"\n",
"One key concept that both of these two ways of thinking about ResNets share is the idea of \"easy to learn\". This is an important theme. Recall the universal approximation theorem, which states that a sufficiently large network *can* learn anything. This is still true. But there turns out to be a very important difference between what a network *can learn* in principle, and what it is *easy for it to learn* under realistic data and training regimes. Many of the advances in neural networks over the last decade have been like the ResNet block: the result of realizing how to make something which was always possible actually feasible.\n",
"One key concept that both of these two ways of thinking about ResNets share is the idea of ease of learning. This is an important theme. Recall the universal approximation theorem, which states that a sufficiently large network can learn anything. This is still true, but there turns out to be a very important difference between what a network *can learn* in principle, and what it is *easy for it to learn* with realistic data and training regimes. Many of the advances in neural networks over the last decade have been like the ResNet block: the result of realizing how to make something yjay was always possible actually feasible.\n",
"\n",
"> note: The original paper didn't actually do the trick of using zero for the initial value of gamma in the batchnorm layer; that came a couple of years later. So the original version of ResNet didn't quite begin training with a truly identity path through the ResNet blocks, but nonetheless having the ability to \"navigate through\" the skip connections did indeed make it train better. Adding the batchnorm gamma init trick made the models train at even higher learning rates.\n",
"> note: True Identity Path: The original paper didn't actually do the trick of using zero for the initial value of `gamma` in the last batchnorm layer of each block; that came a couple of years later. So, the original version of ResNet didn't quite begin training with a truly identity path through the ResNet blocks, but nonetheless having the ability to \"navigate through\" the skip connections did indeed make it train better. Adding the batchnorm `gamma` init trick made the models train at even higher learning rates.\n",
"\n",
"Here's the definition of a simple ResNet block (where `norm_type=NormType.BatchZero` causes fastai to init the `gamma` weights of that batchnorm layer to zero):"
"Here's the definition of a simple ResNet block (where `norm_type=NormType.BatchZero` causes fastai to init the `gamma` weights of the last batchnorm layer to zero):"
]
},
{
@ -416,27 +437,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"One problem with this, however, is that it can't handle a stride other than `1`, and it requires that `ni==nf`. Stop for a moment, to think carefully about why this is...\n",
"There are two problems with this, however: it can't handle a stride other than 1, and it requires that `ni==nf`. Stop for a moment to think carefully about why this is.\n",
"\n",
"The issue is that with a stride of, say, `2`, on one of the convolutions, the grid size of the output activations will be half the size on each axis of the input. So then we can't add that back to `x` in `forward` because `x` and the output activations have different dimensions. The same basic issue occurs if `ni!=nf`: the shapes of the input and output connections won't allow us to add them together.\n",
"The issue is that with a stride of, say, 2 on one of the convolutions, the grid size of the output activations will be half the size on each axis of the input. So then we can't add that back to `x` in `forward` because `x` and the output activations have different dimensions. The same basic issue occurs if `ni!=nf`: the shapes of the input and output connections won't allow us to add them together.\n",
"\n",
"To fix this, we need a way to change the shape of `x` to match the result of `self.convs`. Halving the grid size can be done using an average pooling layer with a stride of 2: that is, a layer which takes 2x2 patches from the input, and replaces them with their average.\n",
"To fix this, we need a way to change the shape of `x` to match the result of `self.convs`. Halving the grid size can be done using an average pooling layer with a stride of 2: that is, a layer that takes 2×2 patches from the input and replaces them with their average.\n",
"\n",
"Changing the number of channels can be done by using a convolution. We want this skip connection to be as close to an identity map as possible, however, which means making this convolution as simple as possible. The simplest possible convolution is one where the kernel size is `1`. That means that the kernel is size `ni*nf*1*1`, so it's only doing a dot product over the channels of each input pixel--it's not combining across pixels at all. This kind of *1x1 convolution* is very widely used in modern CNNs, so take a moment to think about how it works."
"Changing the number of channels can be done by using a convolution. We want this skip connection to be as close to an identity map as possible, however, which means making this convolution as simple as possible. The simplest possible convolution is one where the kernel size is 1. That means that the kernel is size `ni*nf*1*1`, so it's only doing a dot product over the channels of each input pixelit's not combining across pixels at all. This kind of *1x1 convolution* is very widely used in modern CNNs, so take a moment to think about how it works."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> question: Create a `1x1 convolution` with `F.conv2d` or `nn.Conv2d` and apply it to an image. What happens to the `shape` of the image?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: 1x1 convolution: A convolution with a kernel size of one."
"> jargon: 1x1 convolution: A convolution with a kernel size of 1."
]
},
{
@ -480,7 +494,7 @@
"source": [
"Note that we're using the `noop` function here, which simply returns its input unchanged (*noop* is a computer science term that stands for \"no operation\"). In this case, `idconv` does nothing at all if `nf==nf`, and `pool` does nothing if `stride==1`, which is what we wanted in our skip connection.\n",
"\n",
"Also, you'll see that we've removed relu (`act_cls=None`) from the final convolution in `convs` and from `idconv`, and moved it to *after* we add the skip connection. The thinking behind this is that the whole ResNet block is like a layer, and you want your activation to be *after* your layer.\n",
"Also, you'll see that we've removed the ReLU (`act_cls=None`) from the final convolution in `convs` and from `idconv`, and moved it to *after* we add the skip connection. The thinking behind this is that the whole ResNet block is like a layer, and you want your activation to be after your layer.\n",
"\n",
"Let's replace our `block` with `ResBlock`, and try it out:"
]
@ -568,7 +582,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It's not much better. But the whole point of this was to allow us to train *deeper* models, and we're not really taking advantage of that yet. To create a deeper model that's, say, twice as deep, all we need to do is replace our `block` with two `ResBlock`s in a row:"
"It's not much better. But the whole point of this was to allow us to train *deeper* models, and we're not really taking advantage of that yet. To create a model that's, say, twice as deep, all we need to do is replace our `block` with two `ResBlock`s in a row:"
]
},
{
@ -657,44 +671,51 @@
"source": [
"Now we're making good progress!\n",
"\n",
"The authors of the ResNet paper went on to win the 2015 ImageNet challenge. At the time, this was by far the most important annual event in computer vision. We have already seen another ImageNet winner: the 2013 winners, Zeiler and Fergus. It is interesting to note that in both cases the starting point for the breakthroughs were experimental observations. Observations about what layers actually learn, in the case of Zeiler and Fergus, and observations about which kind of networks can be trained, in the case of the ResNet authors. This ability to design and analyse thoughtful experiments, or even just to see an unexpected result say \"hmmm, that's interesting\" — and then, most importantly, to figure out what on earth is going on, with great tenacity, is at the heart of many scientific discoveries. Deep learning is not like pure mathematics. It is a heavily experimental field, so it's important to be strong practitioner, not just a theoretician.\n",
"The authors of the ResNet paper went on to win the 2015 ImageNet challenge. At the time, this was by far the most important annual event in computer vision. We have already seen another ImageNet winner: the 2013 winners, Zeiler and Fergus. It is interesting to note that in both cases the starting points for the breakthroughs were experimental observations: observations about what layers actually learn, in the case of Zeiler and Fergus, and observations about which kinds of networks can be trained, in the case of the ResNet authors. This ability to design and analyze thoughtful experiments, or even just to see an unexpected result, say \"Hmmm, that's interesting,\" and then, most importantly, set about figuring out what on earth is going on, with great tenacity, is at the heart of many scientific discoveries. Deep learning is not like pure mathematics. It is a heavily experimental field, so it's important to be a strong practitioner, not just a theoretician.\n",
"\n",
"Since the ResNet was introduced, there's been many papers studying it and applying it to many domains. One of the most interesting, published in 2018, is [Visualizing the Loss Landscape of Neural Nets](https://arxiv.org/abs/1712.09913). It shows that using skip connections help smoothen the loss function, which makes training easier as it avoids us falling into a very sharp area. Here's a stunning picture from the paper, showing the bumpy terrain that SGD has to navigate to optimize a regular CNN (left) versus the smooth surface of a ResNet (right):"
"Since the ResNet was introduced, it's been widely studied and applied to many domains. One of the most interesting papers, published in 2018, is Hao Li et al.'s [\"Visualizing the Loss Landscape of Neural Nets\"](https://arxiv.org/abs/1712.09913). It shows that using skip connections helps smooth the loss function, which makes training easier as it avoids falling into a very sharp area. <<resnet_surface>> shows a stunning picture from the paper, illustrating the difference between the bumpy terrain that SGD has to navigate to optimize a regular CNN (left) versus the smooth surface of a ResNet (right)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Impact of ResNet on loss landscape\" width=\"600\" caption=\"Impact of ResNet on loss landscape\" id=\"resnet_surface\" src=\"images/att_00044.png\">"
"<img alt=\"Impact of ResNet on loss landscape\" width=\"600\" caption=\"Impact of ResNet on loss landscape (courtesy of Hao Li et al.)\" id=\"resnet_surface\" src=\"images/att_00044.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### A state-of-the-art ResNet"
"Our first model is already good, but further research has discovered more tricks we can apply to make it better. We'll look at those next."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In [Bag of Tricks for Image Classification with Convolutional Neural Networks](https://arxiv.org/abs/1812.01187), the authors study different variations of the ResNet architecture that come at almost no additional cost in terms of number of parameters or computation. By using this tweaked ResNet50 architecture and Mixup they achieve 94.6% top-5 accuracy on ImageNet, instead of 92.2% with a regular ResNet50 without Mixup. This result is better than regular ResNet models that are twice as deep (and twice as slow, and much more likely to overfit)."
"### A State-of-the-Art ResNet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: top-5 accuracy: A metric testing how often the label we want is in the top 5 predictions of our model. It was used in the Imagenet competition, since many images contained multiple objects, or contained objects that could be easily confused or may even have been mislabeled with a similar label. In these situations, looking at top-1 accuracy may be inappropriate. However, recently CNNs have been getting so good that top-5 accuracy is nearly 100%, so some researchers are using top-1 accuracy for Imagenet too now."
"In [\"Bag of Tricks for Image Classification with Convolutional Neural Networks\"](https://arxiv.org/abs/1812.01187), Tong He et al. study different variations of the ResNet architecture that come at almost no additional cost in terms of number of parameters or computation. By using a tweaked ResNet-50 architecture and Mixup they achieved 94.6% top-5 accuracy on ImageNet, in comparison to 92.2% with a regular ResNet-50 without Mixup. This result is better than that achieved by regular ResNet models that are twice as deep (and twice as slow, and much more likely to overfit)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, as we scale up to the full ResNet, we won't show the original one, but the tweaked one, since it's substantially better. It differs a little bit from our the implementation we had before in that it begins with a few convolutional layers followed by a max pooling layer, instead of just starting with ResNet blocks. This is what the first layers look like:"
"> jargon: top-5 accuracy: A metric testing how often the label we want is in the top 5 predictions of our model. It was used in the ImageNet competition because many of the images contained multiple objects, or contained objects that could be easily confused or may even have been mislabeled with a similar label. In these situations, looking at top-1 accuracy may be inappropriate. However, recently CNNs have been getting so good that top-5 accuracy is nearly 100%, so some researchers are using top-1 accuracy for ImageNet too now."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll use this tweaked version as we scale up to the full ResNet, because it's substantially better. It differs a little bit from our previous implementation, in that instead of just starting with ResNet blocks, it begins with a few convolutional layers followed by a max pooling layer. This is what the first layers, called the *stem* of the network, look like:"
]
},
{
@ -768,7 +789,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: Stem: The stem of a CNN are its first few layers. Generally, the stem has a different structure to the main body of the CNN."
"> jargon: Stem: The first few layers of a CNN. Generally, the stem has a different structure than the main body of the CNN."
]
},
{
@ -777,13 +798,13 @@
"source": [
"The reason that we have a stem of plain convolutional layers, instead of ResNet blocks, is based on a very important insight about all deep convolutional neural networks: the vast majority of the computation occurs in the early layers. Therefore, we should keep the early layers as fast and simple as possible.\n",
"\n",
"To see why so much computation occurs in the early layers, consider the very first convolution on a 128 pixel input image. If it is a stride one convolution, then it will apply the kernel to every one of the 128×128 pixels. That's a lot of work! In the later layers, however, the grid size could be as small as 4x4 or even 2x2. So there are far fewer kernel applications to do.\n",
"To see why so much computation occurs in the early layers, consider the very first convolution on a 128-pixel input image. If it is a stride-1 convolution, then it will apply the kernel to every one of the 128×128 pixels. That's a lot of work! In the later layers, however, the grid size could be as small as 4×4 or even 2×2, so there are far fewer kernel applications to do.\n",
"\n",
"On the other hand, the first layer convolution only has three input features, and 32 output features. Since it is a 3x3 kernel, this is 3×32×3×3 = 864 parameters in the weights. On the other hand, the last convolution will be 256 input features and 512 output features, which will be 1,179,648 weights! So the first layers contain vast majority of the computation, but the last layers contain the vast majority of the parameters.\n",
"On the other hand, the first-layer convolution only has 3 input features and 32 output features. Since it is a 3×3 kernel, this is 3×32×3×3 = 864 parameters in the weights. But the last convolution will have 256 input features and 512 output features, resulting in 1,179,648 weights! So the first layers contain the vast majority of the computation, but the last layers contain the vast majority of the parameters.\n",
"\n",
"A ResNet block takes more computation than a plain convolutional block, since (in the stride two case) a ResNet block has three convolutions and a pooling layer. That's why we want to have plain convolutions to start off our ResNet.\n",
"A ResNet block takes more computation than a plain convolutional block, since (in the stride-2 case) a ResNet block has three convolutions and a pooling layer. That's why we want to have plain convolutions to start off our ResNet.\n",
"\n",
"We're now ready to show the implementation of a modern ResNet, with the \"bag of tricks\". The ResNet use four groups of ResNet blocks, with 64, 128, 256 then 512 filters. Each groups starts with a stride 2 block, except for the first one, since it's just after a `MaxPooling` layer."
"We're now ready to show the implementation of a modern ResNet, with the \"bag of tricks.\" It uses four groups of ResNet blocks, with 64, 128, 256, then 512 filters. Each group starts with a stride-2 block, except for the first one, since it's just after a `MaxPooling` layer:"
]
},
{
@ -815,9 +836,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The `_make_layer` function is just there to create a series of `nl` blocks. The first one is is going from `ch_in` to `ch_out` with the indicated `stride` and all the others are blocks of stride 1 with `ch_out` to `ch_out` tensors. Once the blocks are defined, our model is purely sequential, which is why we define it as a subclass of `nn.Sequential`. (Ignore the `expansion` parameter for now--we'll discuss it in the next section' for now, it'll be `1`, so it doesn't do anything.)\n",
"The `_make_layer` function is just there to create a series of `n_layers` blocks. The first one is going from `ch_in` to `ch_out` with the indicated `stride` and all the others are blocks of stride 1 with `ch_out` to `ch_out` tensors. Once the blocks are defined, our model is purely sequential, which is why we define it as a subclass of `nn.Sequential`. (Ignore the `expansion` parameter for now; we'll discuss it in the next section. For now, it'll be `1`, so it doesn't do anything.)\n",
"\n",
"The various versions of the models (ResNet 18, 34, 50, etc) just change the number of blocks in each of those groups. This is the definition of a ResNet18:"
"The various versions of the models (ResNet-18, -34, -50, etc.) just change the number of blocks in each of those groups. This is the definition of a ResNet-18:"
]
},
{
@ -910,37 +931,39 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Even although we have more channels (and our model is therefore even more accurate), our training is just as fast as before, thanks to our optimized stem."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bottleneck layers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Things are a tiny bit more complicated for deeper models like `resnet50` as they don't use the same resnet blocks: instead of stacking two convolutions with a kernel size of 3, they use three different convolutions: two 1x1 (at the beginning and the end) and one 3x3, as shown in the right of this image from the ResNet paper (using an example of 64 channel output, comparing to the regular ResBlock on the left):"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Comparison of regular and bottleneck ResNet blocks\" width=\"550\" caption=\"Comparison of regular and bottleneck ResNet blocks\" id=\"resnet_compare\" src=\"images/att_00045.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why? 1x1 convolutions are much faster, so even if this seems to be a more complex design, this block executes faster than the first resnet block we saw. This then lets us use more filters: as we see on the illustration, the number of filters in and out is 4 times higher (256) and the 1 by 1 convs are here to diminish then restore the number of channels (hence the name bottleneck). The overall impact is that we can use more filters in the same amount of time.\n",
"Even though we have more channels (and our model is therefore even more accurate), our training is just as fast as before, thanks to our optimized stem.\n",
"\n",
"Let's try replacing our ResBlock with this bottleneck design:"
"To make our model deeper without taking too much compute or memory, we can use another kind of layer introduced by the ResNet paper for ResNets with a depth of 50 or more: the bottleneck layer. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bottleneck Layers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead of stacking two convolutions with a kernel size of 3, bottleneck layers use three different convolutions: two 1×1 (at the beginning and the end) and one 3×3, as shown on the right in <<resnet_compare>>."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Comparison of regular and bottleneck ResNet blocks\" width=\"550\" caption=\"Comparison of regular and bottleneck ResNet blocks (courtesy of Kaiming He et al.)\" id=\"resnet_compare\" src=\"images/att_00045.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why is that useful? 1×1 convolutions are much faster, so even if this seems to be a more complex design, this block executes faster than the first ResNet block we saw. This then lets us use more filters: as we see in the illustration, the number of filters in and out is 4 times higher (256 instead of 64) diminish then restore the number of channels (hence the name bottleneck). The overall impact is that we can use more filters in the same amount of time.\n",
"\n",
"Let's try replacing our `ResBlock` with this bottleneck design:"
]
},
{
@ -960,7 +983,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll use this to create a ResNet50, which uses this bottleneck block, and uses group sizes of `(3,4,6,3)`. We now need to pass `4` in to the `expansion` parameter of `ResNet`, since we need to start with four times less channels, and we'll end with four times more channels.\n",
"We'll use this to create a ResNet-50 with group sizes of `(3,4,6,3)`. We now need to pass `4` in to the `expansion` parameter of `ResNet`, since we need to start with four times less channels and we'll end with four times more channels.\n",
"\n",
"Deeper networks like this don't generally show improvements when training for only 5 epochs, so we'll bump it up to 20 epochs this time to make the most of our bigger model. And to really get great results, let's use bigger images too:"
]
@ -978,7 +1001,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We don't have to do anything to account for the larger 224 pixel images--thanks to our fully convolutional network, it just works. This is also why we were able to do *progressive resizing* earlier in the book--the models we used were fully convolutional, so we were even able to fine-tune models trained with different sizes."
"We don't have to do anything to account for the larger 224-pixel images; thanks to our fully convolutional network, it just works. This is also why we were able to do *progressive resizing* earlier in the bookthe models we used were fully convolutional, so we were even able to fine-tune models trained with different sizes. We can now train our model and see the effects:"
]
},
{
@ -1171,7 +1194,21 @@
"source": [
"We're getting a great result now! Try adding Mixup, and then training this for a hundred epochs while you go get lunch. You'll have yourself a very accurate image classifier, trained from scratch.\n",
"\n",
"The bottleneck design we've shown here is only used in ResNet50, 101, and 152 in all official models we've seen. ResNet18 and 34 use the non-bottleneck design seen in the previous section. However, we've noticed that the bottleneck layer generally works better even for the shallower networks. This just goes to show that the little details in papers tend to stick around for years, even if they're actually not quite the best design! Questioning assumptions and \"stuff everyone knows\" is always a good idea, because this is still a new field, and there's lots of details that aren't always done well."
"The bottleneck design we've shown here is typically only used in ResNet-50, -101, and -152 models. ResNet-18 and -34 models usually use the non-bottleneck design seen in the previous section. However, we've noticed that the bottleneck layer generally works better even for the shallower networks. This just goes to show that the little details in papers tend to stick around for years, even if they're actually not quite the best design! Questioning assumptions and \"stuff everyone knows\" is always a good idea, because this is still a new field, and there are lots of details that aren't always done well."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You have now seen how the models we have been using for computer vision since the first chapter are built, using skip connections to allow deeper models to be trained. Even if there has been a lot of research into better architectures, they all use one version or another of this trick, to make a direct path from the input to the end of the network. When using transfer learning, the ResNet is the pretrained model. In the next chapter, we will look at the final details of how the models we actually used were built from it."
]
},
{
@ -1185,43 +1222,44 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. How did we get to a single vector of activations in the convnets used for MNIST in previous chapters? Why isn't that suitable for Imagenette?\n",
"1. How did we get to a single vector of activations in the CNNs used for MNIST in previous chapters? Why isn't that suitable for Imagenette?\n",
"1. What do we do for Imagenette instead?\n",
"1. What is adaptive pooling?\n",
"1. What is average pooling?\n",
"1. What is \"adaptive pooling\"?\n",
"1. What is \"average pooling\"?\n",
"1. Why do we need `Flatten` after an adaptive average pooling layer?\n",
"1. What is a skip connection?\n",
"1. What is a \"skip connection\"?\n",
"1. Why do skip connections allow us to train deeper models?\n",
"1. What does <<resnet_depth>> show? How did that lead to the idea of skip connections?\n",
"1. What is an identity mapping?\n",
"1. What is the basic equation for a resnet block (ignoring batchnorm and relu layers)?\n",
"1. What do ResNets have to do with \"residuals\"?\n",
"1. How do we deal with the skip connection when there is a stride 2 convolution? How about when the number of filters changes?\n",
"1. How can we express a 1x1 convolution in terms of a vector dot product?\n",
"1. What is \"identity mapping\"?\n",
"1. What is the basic equation for a ResNet block (ignoring batchnorm and ReLU layers)?\n",
"1. What do ResNets have to do with residuals?\n",
"1. How do we deal with the skip connection when there is a stride-2 convolution? How about when the number of filters changes?\n",
"1. How can we express a 1×1 convolution in terms of a vector dot product?\n",
"1. Create a `1x1 convolution` with `F.conv2d` or `nn.Conv2d` and apply it to an image. What happens to the `shape` of the image?\n",
"1. What does the `noop` function return?\n",
"1. Explain what is shown in <<resnet_surface>>.\n",
"1. When is top-5 accuracy a better metric than top-1 accuracy?\n",
"1. What is the stem of a CNN?\n",
"1. Why use plain convs in the CNN stem, instead of resnet blocks?\n",
"1. How does a bottleneck block differ from a plain resnet block?\n",
"1. What is the \"stem\" of a CNN?\n",
"1. Why do we use plain convolutions in the CNN stem, instead of ResNet blocks?\n",
"1. How does a bottleneck block differ from a plain ResNet block?\n",
"1. Why is a bottleneck block faster?\n",
"1. How do fully convolution nets (and nets with adaptive pooling in general) allow for progressive resizing?"
"1. How do fully convolutional nets (and nets with adaptive pooling in general) allow for progressive resizing?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further research"
"### Further Research"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Try creating a fully convolutional net with adaptive average pooling for MNIST (note that you'll need fewer stride 2 layers). How does it compare to a network without such a pooling layer?\n",
"1. In <<chapter_foundations>> we introduce *Einstein summation notation*. Skip ahead to see how this works, and then write an implementation of the 1x1 convolution operation using `torch.einsum`. Compare it to the same operation using `torch.conv2d`.\n",
"1. Write a \"top 5 accuracy\" function using plain PyTorch or plain Python.\n",
"1. Try creating a fully convolutional net with adaptive average pooling for MNIST (note that you'll need fewer stride-2 layers). How does it compare to a network without such a pooling layer?\n",
"1. In <<chapter_foundations>> we introduce *Einstein summation notation*. Skip ahead to see how this works, and then write an implementation of the 1×1 convolution operation using `torch.einsum`. Compare it to the same operation using `torch.conv2d`.\n",
"1. Write a \"top-5 accuracy\" function using plain PyTorch or plain Python.\n",
"1. Train a model on Imagenette for more epochs, with and without label smoothing. Take a look at the Imagenette leaderboards and see how close you can get to the best results shown. Read the linked pages describing the leading approaches."
]
},
@ -1230,9 +1268,7 @@
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s"
]
"source": []
}
],
"metadata": {
@ -1243,33 +1279,8 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

816
15_arch_details.ipynb Normal file
View File

@ -0,0 +1,816 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"!pip install -Uqq fastbook\n",
"import fastbook\n",
"fastbook.setup_book()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"from fastbook import *"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"[[chapter_arch_details]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Application Architectures Deep Dive"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are now in the exciting position that we can fully understand the architectures that we have been using for our state-of-the-art models for computer vision, natural language processing, and tabular analysis. In this chapter, we're going to fill in all the missing details on how fastai's application models work and show you how to build the models they use.\n",
"\n",
"We will also go back to the custom data preprocessing pipeline we saw in <<chapter_midlevel_data>> for Siamese networks and show you how you can use the components in the fastai library to build custom pretrained models for new tasks.\n",
"\n",
"We'll start with computer vision."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computer Vision"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For computer vision application we use the functions `cnn_learner` and `unet_learner` to build our models, depending on the task. In this section we'll explore how to build the `Learner` objects we used in Parts 1 and 2 of this book."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### cnn_learner"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at what happens when we use the `cnn_learner` function. We begin by passing this function an architecture to use for the *body* of the network. Most of the time we use a ResNet, which you already know how to create, so we don't need to delve into that any further. Pretrained weights are downloaded as required and loaded into the ResNet.\n",
"\n",
"Then, for transfer learning, the network needs to be *cut*. This refers to slicing off the final layer, which is only responsible for ImageNet-specific categorization. In fact, we do not slice off only this layer, but everything from the adaptive average pooling layer onwards. The reason for this will become clear in just a moment. Since different architectures might use different types of pooling layers, or even completely different kinds of *heads*, we don't just search for the adaptive pooling layer to decide where to cut the pretrained model. Instead, we have a dictionary of information that is used for each model to determine where its body ends, and its head starts. We call this `model_meta`—here it is for resnet-50:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'cut': -2,\n",
" 'split': <function fastai.vision.learner._resnet_split(m)>,\n",
" 'stats': ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model_meta[resnet50]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: Body and Head: The \"head\" of a neural net is the part that is specialized for a particular task. For a CNN, it's generally the part after the adaptive average pooling layer. The \"body\" is everything else, and includes the \"stem\" (which we learned about in <<chapter_resnet>>)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we take all of the layers prior to the cut point of `-2`, we get the part of the model that fastai will keep for transfer learning. Now, we put on our new head. This is created using the function `create_head`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Sequential(\n",
" (0): AdaptiveConcatPool2d(\n",
" (ap): AdaptiveAvgPool2d(output_size=1)\n",
" (mp): AdaptiveMaxPool2d(output_size=1)\n",
" )\n",
" (1): full: False\n",
" (2): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (3): Dropout(p=0.25, inplace=False)\n",
" (4): Linear(in_features=20, out_features=512, bias=False)\n",
" (5): ReLU(inplace=True)\n",
" (6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (7): Dropout(p=0.5, inplace=False)\n",
" (8): Linear(in_features=512, out_features=2, bias=False)\n",
")"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#hide_output\n",
"create_head(20,2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"Sequential(\n",
" (0): AdaptiveConcatPool2d(\n",
" (ap): AdaptiveAvgPool2d(output_size=1)\n",
" (mp): AdaptiveMaxPool2d(output_size=1)\n",
" )\n",
" (1): Flatten()\n",
" (2): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True)\n",
" (3): Dropout(p=0.25, inplace=False)\n",
" (4): Linear(in_features=20, out_features=512, bias=False)\n",
" (5): ReLU(inplace=True)\n",
" (6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)\n",
" (7): Dropout(p=0.5, inplace=False)\n",
" (8): Linear(in_features=512, out_features=2, bias=False)\n",
")\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this function you can choose how many additional linear layers are added to the end, how much dropout to use after each one, and what kind of pooling to use. By default, fastai will apply both average pooling, and max pooling, and will concatenate the two together (this is the `AdaptiveConcatPool2d` layer). This is not a particularly common approach, but it was developed independently at fastai and other research labs in recent years, and tends to provide some small improvement over using just average pooling.\n",
"\n",
"fastai is a bit different from most libraries in that by default it adds two linear layers, rather than one, in the CNN head. The reason for this is that transfer learning can still be useful even, as we have seen, when transferring the pretrained model to very different domains. However, just using a single linear layer is unlikely to be enough in these cases; we have found that using two linear layers can allow transfer learning to be used more quickly and easily, in more situations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> note: One Last Batchnorm?: One parameter to `create_head` that is worth looking at is `bn_final`. Setting this to `true` will cause a batchnorm layer to be added as your final layer. This can be useful in helping your model scale appropriately for your output activations. We haven't seen this approach published anywhere as yet, but we have found that it works well in practice wherever we have used it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now take a look at what `unet_learner` did in the segmentation problem we showed in <<chapter_intro>>."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### unet_learner"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the most interesting architectures in deep learning is the one that we used for segmentation in <<chapter_intro>>. Segmentation is a challenging task, because the output required is really an image, or a pixel grid, containing the predicted label for every pixel. There are other tasks that share a similar basic design, such as increasing the resolution of an image (*super-resolution*), adding color to a black-and-white image (*colorization*), or converting a photo into a synthetic painting (*style transfer*)—these tasks are covered by an [online](https://book.fast.ai/) chapter of this book, so be sure to check it out after you've read this chapter. In each case, we are starting with an image and converting it to some other image of the same dimensions or aspect ratio, but with the pixels altered in some way. We refer to these as *generative vision models*.\n",
"\n",
"The way we do this is to start with the exact same approach to developing a CNN head as we saw in the previous problem. We start with a ResNet, for instance, and cut off the adaptive pooling layer and everything after that. Then we replace those layers with our custom head, which does the generative task.\n",
"\n",
"There was a lot of handwaving in that last sentence! How on earth do we create a CNN head that generates an image? If we start with, say, a 224-pixel input image, then at the end of the ResNet body we will have a 7×7 grid of convolutional activations. How can we convert that into a 224-pixel segmentation mask?\n",
"\n",
"Naturally, we do this with a neural network! So we need some kind of layer that can increase the grid size in a CNN. One very simple approach to this is to replace every pixel in the 7×7 grid with four pixels in a 2×2 square. Each of those four pixels will have the same value—this is known as *nearest neighbor interpolation*. PyTorch provides a layer that does this for us, so one option is to create a head that contains stride-1 convolutional layers (along with batchnorm and ReLU layers as usual) interspersed with 2×2 nearest neighbor interpolation layers. In fact, you can try this now! See if you can create a custom head designed like this, and try it on the CamVid segmentation task. You should find that you get some reasonable results, although they won't be as good as our <<chapter_intro>> results.\n",
"\n",
"Another approach is to replace the nearest neighbor and convolution combination with a *transposed convolution*, otherwise known as a *stride half convolution*. This is identical to a regular convolution, but first zero padding is inserted between all the pixels in the input. This is easiest to see with a picture—<<transp_conv>> shows a diagram from the excellent [convolutional arithmetic paper](https://arxiv.org/abs/1603.07285) we discussed in <<chapter_convolutions>>, showing a 3×3 transposed convolution applied to a 3×3 image."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"A transposed convolution\" width=\"815\" caption=\"A transposed convolution (courtesy of Vincent Dumoulin and Francesco Visin)\" id=\"transp_conv\" src=\"images/att_00051.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you see, the result of this is to increase the size of the input. You can try this out now by using fastai's `ConvLayer` class; pass the parameter `transpose=True` to create a transposed convolution, instead of a regular one, in your custom head.\n",
"\n",
"Neither of these approaches, however, works really well. The problem is that our 7×7 grid simply doesn't have enough information to create a 224×224-pixel output. It's asking an awful lot of the activations of each of those grid cells to have enough information to fully regenerate every pixel in the output. The solution to this problem is to use *skip connections*, like in a ResNet, but skipping from the activations in the body of the ResNet all the way over to the activations of the transposed convolution on the opposite side of the architecture. This approach, illustrated in <<unet>>, was developed by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in the 2015 paper [\"U-Net: Convolutional Networks for Biomedical Image Segmentation\"](https://arxiv.org/abs/1505.04597). Although the paper focused on medical applications, the U-Net has revolutionized all kinds of generative vision models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"The U-Net architecture\" width=\"630\" caption=\"The U-Net architecture (courtesy of Olaf Ronneberger, Philipp Fischer, and Thomas Brox)\" id=\"unet\" src=\"images/att_00052.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This picture shows the CNN body on the left (in this case, it's a regular CNN, not a ResNet, and they're using 2×2 max pooling instead of stride-2 convolutions, since this paper was written before ResNets came along) and the transposed convolutional (\"up-conv\") layers on the right. Then extra skip connections are shown as gray arrows crossing from left to right (these are sometimes called *cross connections*). You can see why it's called a \"U-Net!\"\n",
"\n",
"With this architecture, the input to the transposed convolutions is not just the lower-resolution grid in the preceding layer, but also the higher-resolution grid in the ResNet head. This allows the U-Net to use all of the information of the original image, as it is needed. One challenge with U-Nets is that the exact architecture depends on the image size. fastai has a unique `DynamicUnet` class that autogenerates an architecture of the right size based on the data provided.\n",
"\n",
"Let's focus now on an example where we leverage the fastai library to write a custom model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### A Siamese Network"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"from fastai.vision.all import *\n",
"path = untar_data(URLs.PETS)\n",
"files = get_image_files(path/\"images\")\n",
"\n",
"class SiameseImage(Tuple):\n",
" def show(self, ctx=None, **kwargs): \n",
" img1,img2,same_breed = self\n",
" if not isinstance(img1, Tensor):\n",
" if img2.size != img1.size: img2 = img2.resize(img1.size)\n",
" t1,t2 = tensor(img1),tensor(img2)\n",
" t1,t2 = t1.permute(2,0,1),t2.permute(2,0,1)\n",
" else: t1,t2 = img1,img2\n",
" line = t1.new_zeros(t1.shape[0], t1.shape[1], 10)\n",
" return show_image(torch.cat([t1,line,t2], dim=2), \n",
" title=same_breed, ctx=ctx)\n",
" \n",
"def label_func(fname):\n",
" return re.match(r'^(.*)_\\d+.jpg$', fname.name).groups()[0]\n",
"\n",
"class SiameseTransform(Transform):\n",
" def __init__(self, files, label_func, splits):\n",
" self.labels = files.map(label_func).unique()\n",
" self.lbl2files = {l: L(f for f in files if label_func(f) == l) for l in self.labels}\n",
" self.label_func = label_func\n",
" self.valid = {f: self._draw(f) for f in files[splits[1]]}\n",
" \n",
" def encodes(self, f):\n",
" f2,t = self.valid.get(f, self._draw(f))\n",
" img1,img2 = PILImage.create(f),PILImage.create(f2)\n",
" return SiameseImage(img1, img2, t)\n",
" \n",
" def _draw(self, f):\n",
" same = random.random() < 0.5\n",
" cls = self.label_func(f)\n",
" if not same: cls = random.choice(L(l for l in self.labels if l != cls)) \n",
" return random.choice(self.lbl2files[cls]),same\n",
" \n",
"splits = RandomSplitter()(files)\n",
"tfm = SiameseTransform(files, label_func, splits)\n",
"tls = TfmdLists(files, tfm, splits=splits)\n",
"dls = tls.dataloaders(after_item=[Resize(224), ToTensor], \n",
" after_batch=[IntToFloatTensor, Normalize.from_stats(*imagenet_stats)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's go back to the input pipeline we set up in <<chapter_midlevel_data>> for a Siamese network. If you remember, it consisted of pair of images with the label being `True` or `False`, depending on if they were in the same class or not.\n",
"\n",
"Using what we just saw, let's build a custom model for this task and train it. How? We will use a pretrained architecture and pass our two images through it. Then we can concatenate the results and send them to a custom head that will return two predictions. In terms of modules, this looks like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class SiameseModel(Module):\n",
" def __init__(self, encoder, head):\n",
" self.encoder,self.head = encoder,head\n",
" \n",
" def forward(self, x1, x2):\n",
" ftrs = torch.cat([self.encoder(x1), self.encoder(x2)], dim=1)\n",
" return self.head(ftrs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To create our encoder, we just need to take a pretrained model and cut it, as we explained before. The function `create_body` does that for us; we just have to pass it the place where we want to cut. As we saw earlier, per the dictionary of metadata for pretrained models, the cut value for a resnet is `-2`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"encoder = create_body(resnet34, cut=-2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we can create our head. A look at the encoder tells us the last layer has 512 features, so this head will need to receive `512*4`. Why 4? First we have to multiply by 2 because we have two images. Then we need a second multiplication by 2 because of our concat-pool trick. So we create the head as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"head = create_head(512*4, 2, ps=0.5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With our encoder and head, we can now build our model:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = SiameseModel(encoder, head)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before using `Learner`, we have two more things to define. First, we must define the loss function we want to use. It's regular cross-entropy, but since our targets are Booleans, we need to convert them to integers or PyTorch will throw an error:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def loss_func(out, targ):\n",
" return nn.CrossEntropyLoss()(out, targ.long())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More importantly, to take full advantage of transfer learning, we have to define a custom *splitter*. A splitter is a function that tells the fastai library how to split the model into parameter groups. These are used behind the scenes to train only the head of a model when we do transfer learning. \n",
"\n",
"Here we want two parameter groups: one for the encoder and one for the head. We can thus define the following splitter (`params` is just a function that returns all parameters of a given module):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def siamese_splitter(model):\n",
" return [params(model.encoder), params(model.head)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we can define our `Learner` by passing the data, model, loss function, splitter, and any metric we want. Since we are not using a convenience function from fastai for transfer learning (like `cnn_learner`), we have to call `learn.freeze` manually. This will make sure only the last parameter group (in this case, the head) is trained:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"learn = Learner(dls, model, loss_func=loss_func, \n",
" splitter=siamese_splitter, metrics=accuracy)\n",
"learn.freeze()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we can directly train our model with the usual methods:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.367015</td>\n",
" <td>0.281242</td>\n",
" <td>0.885656</td>\n",
" <td>00:26</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0.307688</td>\n",
" <td>0.214721</td>\n",
" <td>0.915426</td>\n",
" <td>00:26</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>0.275221</td>\n",
" <td>0.170615</td>\n",
" <td>0.936401</td>\n",
" <td>00:26</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>0.223771</td>\n",
" <td>0.159633</td>\n",
" <td>0.943843</td>\n",
" <td>00:26</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.fit_one_cycle(4, 3e-3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before unfreezing and fine-tuning the whole model a bit more with discriminative learning rates (that is: a lower learning rate for the body and a higher one for the head):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.212744</td>\n",
" <td>0.159033</td>\n",
" <td>0.944520</td>\n",
" <td>00:35</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0.201893</td>\n",
" <td>0.159615</td>\n",
" <td>0.942490</td>\n",
" <td>00:35</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>0.204606</td>\n",
" <td>0.152338</td>\n",
" <td>0.945196</td>\n",
" <td>00:36</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>0.213203</td>\n",
" <td>0.148346</td>\n",
" <td>0.947903</td>\n",
" <td>00:36</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.unfreeze()\n",
"learn.fit_one_cycle(4, slice(1e-6,1e-4))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"94.8\\% is very good when we remember a classifier trained the same way (with no data augmentation) had an error rate of 7%."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we've seen how to create complete state-of-the-art computer vision models, let's move on to NLP."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Natural Language Processing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Converting an AWD-LSTM language model into a transfer learning classifier, as we did in <<chapter_nlp>>, follows a very similar process to what we did with `cnn_learner` in the first section of this chapter. We do not need a \"meta\" dictionary in this case, because we do not have such a variety of architectures to support in the body. All we need to do is select the stacked RNN for the encoder in the language model, which is a single PyTorch module. This encoder will provide an activation for every word of the input, because a language model needs to output a prediction for every next word.\n",
"\n",
"To create a classifier from this we use an approach described in the [ULMFiT paper](https://arxiv.org/abs/1801.06146) as \"BPTT for Text Classification (BPT3C)\":"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> : We divide the document into fixed-length batches of size *b*. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length backpropagation sequences."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In other words, the classifier contains a `for` loop, which loops over each batch of a sequence. The state is maintained across batches, and the activations of each batch are stored. At the end, we use the same average and max concatenated pooling trick that we use for computer vision models—but this time, we do not pool over CNN grid cells, but over RNN sequences.\n",
"\n",
"For this `for` loop we need to gather our data in batches, but each text needs to be treated separately, as they each have their own labels. However, it's very likely that those texts won't all be of the same length, which means we won't be able to put them all in the same array, like we did with the language model.\n",
"\n",
"That's where padding is going to help: when grabbing a bunch of texts, we determine the one with the greatest length, then we fill the ones that are shorter with a special token called `xxpad`. To avoid extreme cases where we have a text with 2,000 tokens in the same batch as a text with 10 tokens (so a lot of padding, and a lot of wasted computation), we alter the randomness by making sure texts of comparable size are put together. The texts will still be in a somewhat random order for the training set (for the validation set we can simply sort them by order of length), but not completely so.\n",
"\n",
"This is done automatically behind the scenes by the fastai library when creating our `DataLoaders`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tabular"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's take a look at `fastai.tabular` models. (We don't need to look at collaborative filtering separately, since we've already seen that these models are just tabular models, or use the dot product approach, which we've implemented earlier from scratch.)\n",
"\n",
"Here is the `forward` method for `TabularModel`:\n",
"\n",
"```python\n",
"if self.n_emb != 0:\n",
" x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]\n",
" x = torch.cat(x, 1)\n",
" x = self.emb_drop(x)\n",
"if self.n_cont != 0:\n",
" x_cont = self.bn_cont(x_cont)\n",
" x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont\n",
"return self.layers(x)\n",
"```\n",
"\n",
"We won't show `__init__` here, since it's not that interesting, but we will look at each line of code in `forward` in turn. The first line:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"if self.n_emb != 0:\n",
"```\n",
"\n",
"is just testing whether there are any embeddings to deal with—we can skip this section if we only have continuous variables. `self.embeds` contains the embedding matrices, so this gets the activations of each:\n",
" \n",
"```python\n",
" x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]\n",
"```\n",
"\n",
"and concatenates them into a single tensor:\n",
"\n",
"```python\n",
" x = torch.cat(x, 1)\n",
"```\n",
"\n",
"Then dropout is applied. You can pass `emb_drop` to `__init__` to change this value:\n",
"\n",
"```python\n",
" x = self.emb_drop(x)\n",
"```\n",
"\n",
"Now we test whether there are any continuous variables to deal with:\n",
"\n",
"```python\n",
"if self.n_cont != 0:\n",
"```\n",
"\n",
"They are passed through a batchnorm layer:\n",
"\n",
"```python\n",
" x_cont = self.bn_cont(x_cont)\n",
"```\n",
"\n",
"and concatenated with the embedding activations, if there were any:\n",
"\n",
"```python\n",
" x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont\n",
"```\n",
"\n",
"Finally, this is passed through the linear layers (each of which includes batchnorm, if `use_bn` is `True`, and dropout, if `ps` is set to some value or list of values):\n",
"\n",
"```python\n",
"return self.layers(x)\n",
"\n",
"```\n",
"\n",
"Congratulations! Now you know every single piece of the architectures used in the fastai library!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wrapping Up Architectures"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the details of deep learning architectures need not scare you now. You can look inside the code of fastai and PyTorch and see just what is going on. More importantly, try to understand *why* it's going on. Take a look at the papers that are being referenced in the code, and try to see how the code matches up to the algorithms that are described.\n",
"\n",
"Now that we have investigated all of the pieces of a model and the data that is passed into it, we can consider what this means for practical deep learning. If you have unlimited data, unlimited memory, and unlimited time, then the advice is easy: train a huge model on all of your data for a really long time. But the reason that deep learning is not straightforward is because your data, memory, and time are typically limited. If you are running out of memory or time, then the solution is to train a smaller model. If you are not able to train for long enough to overfit, then you are not taking advantage of the capacity of your model.\n",
"\n",
"So, step one is to get to the point where you can overfit. Then the question is how to reduce that overfitting. <<reduce_overfit>> shows how we recommend prioritizing the steps from there."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Steps to reducing overfitting\" width=\"400\" caption=\"Steps to reducing overfitting\" id=\"reduce_overfit\" src=\"images/att_00047.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Many practitioners, when faced with an overfitting model, start at exactly the wrong end of this diagram. Their starting point is to use a smaller model, or more regularization. Using a smaller model should be absolutely the last step you take, unless training your model is taking up too much time or memory. Reducing the size of your model reduces the ability of your model to learn subtle relationships in your data.\n",
"\n",
"Instead, your first step should be to seek to *create more data*. That could involve adding more labels to data that you already have, finding additional tasks that your model could be asked to solve (or, to think of it another way, identifying different kinds of labels that you could model), or creating additional synthetic data by using more or different data augmentation techniques. Thanks to the development of Mixup and similar approaches, effective data augmentation is now available for nearly all kinds of data.\n",
"\n",
"Once you've got as much data as you think you can reasonably get hold of, and are using it as effectively as possible by taking advantage of all the labels that you can find and doing all the augmentation that makes sense, if you are still overfitting you should think about using more generalizable architectures. For instance, adding batch normalization may improve generalization.\n",
"\n",
"If you are still overfitting after doing the best you can at using your data and tuning your architecture, then you can take a look at regularization. Generally speaking, adding dropout to the last layer or two will do a good job of regularizing your model. However, as we learned from the story of the development of AWD-LSTM, it is often the case that adding dropout of different types throughout your model can help even more. Generally speaking, a larger model with more regularization is more flexible, and can therefore be more accurate than a smaller model with less regularization.\n",
"\n",
"Only after considering all of these options would we recommend that you try using a smaller version of your architecture."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Questionnaire"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. What is the \"head\" of a neural net?\n",
"1. What is the \"body\" of a neural net?\n",
"1. What is \"cutting\" a neural net? Why do we need to do this for transfer learning?\n",
"1. What is `model_meta`? Try printing it to see what's inside.\n",
"1. Read the source code for `create_head` and make sure you understand what each line does.\n",
"1. Look at the output of `create_head` and make sure you understand why each layer is there, and how the `create_head` source created it.\n",
"1. Figure out how to change the dropout, layer size, and number of layers created by `create_cnn`, and see if you can find values that result in better accuracy from the pet recognizer.\n",
"1. What does `AdaptiveConcatPool2d` do?\n",
"1. What is \"nearest neighbor interpolation\"? How can it be used to upsample convolutional activations?\n",
"1. What is a \"transposed convolution\"? What is another name for it?\n",
"1. Create a conv layer with `transpose=True` and apply it to an image. Check the output shape.\n",
"1. Draw the U-Net architecture.\n",
"1. What is \"BPTT for Text Classification\" (BPT3C)?\n",
"1. How do we handle different length sequences in BPT3C?\n",
"1. Try to run each line of `TabularModel.forward` separately, one line per cell, in a notebook, and look at the input and output shapes at each step.\n",
"1. How is `self.layers` defined in `TabularModel`?\n",
"1. What are the five steps for preventing over-fitting?\n",
"1. Why don't we reduce architecture complexity before trying other approaches to preventing overfitting?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further Research"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Write your own custom head and try training the pet recognizer with it. See if you can get a better result than fastai's default.\n",
"1. Try switching between `AdaptiveConcatPool2d` and `AdaptiveAvgPool2d` in a CNN head and see what difference it makes.\n",
"1. Write your own custom splitter to create a separate parameter group for every ResNet block, and a separate group for the stem. Try training with it, and see if it improves the pet recognizer.\n",
"1. Read the online chapter about generative image models, and create your own colorizer, super-resolution model, or style transfer model.\n",
"1. Create a custom head using nearest neighbor interpolation and use it to do segmentation on CamVid."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

1324
16_accel_sgd.ipynb Normal file

File diff suppressed because one or more lines are too long

View File

@ -1,476 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"from utils import *"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"[[chapter_arch_details]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Application architectures deep dive"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are now in the exciting position that we can fully understand the entire architectures that we have been using for our state-of-the-art models for computer vision, natural language processing, and tabular analysis. In this chapter, we're going to fill in all the missing details on how fastai's application models work."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computer vision"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### cnn_learner"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at what happens when we use the `cnn_learner` function. We pass it an architecture to use for the *body* of the network. Most of the time we use a resnet, which we already know how to create, so we don't need to delve into that any further. Pretrained weights are downloaded as required and loaded into the resnet.\n",
"\n",
"Then, for transfer learning, the network needs to be *cut*. This refers to slicing off the final layer, which is only responsible for ImageNet-specific categorisation. In fact, we do not only slice off this layer, but everything from the adaptive average pooling layer onwards. The reason for this will become clear in just a moment. Since different architectures might use different types of pooling layers, or even completely different kinds of *heads*, we don't just search for the adaptive pooling layer to decide where to cut the pretrained model. Instead, we have a dictionary of information that is used for each model to know where its body ends, and its head starts. We call this `model_meta` — here it is for resnet 50:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'cut': -2,\n",
" 'split': <function fastai2.vision.learner._resnet_split(m)>,\n",
" 'stats': ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model_meta[resnet50]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: Body and Head: The \"head\" of a neural net is the part that is specialized for a particular task. For a convnet, it's generally the part after the adaptive average pooling layer. The \"body\" is everything else, and includes the \"stem\" (which we learned about in <<chapter_resnet>>)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we take all of the layers prior to the cutpoint of `-2`, we get the part of the model which fastai will keep for transfer learning. Now, we put on our new head. This is created using the function create_head:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Sequential(\n",
" (0): AdaptiveConcatPool2d(\n",
" (ap): AdaptiveAvgPool2d(output_size=1)\n",
" (mp): AdaptiveMaxPool2d(output_size=1)\n",
" )\n",
" (1): full: False\n",
" (2): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (3): Dropout(p=0.25, inplace=False)\n",
" (4): Linear(in_features=20, out_features=512, bias=False)\n",
" (5): ReLU(inplace=True)\n",
" (6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (7): Dropout(p=0.5, inplace=False)\n",
" (8): Linear(in_features=512, out_features=2, bias=False)\n",
")"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#hide_output\n",
"create_head(20,2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"Sequential(\n",
" (0): AdaptiveConcatPool2d(\n",
" (ap): AdaptiveAvgPool2d(output_size=1)\n",
" (mp): AdaptiveMaxPool2d(output_size=1)\n",
" )\n",
" (1): Flatten()\n",
" (2): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True)\n",
" (3): Dropout(p=0.25, inplace=False)\n",
" (4): Linear(in_features=20, out_features=512, bias=False)\n",
" (5): ReLU(inplace=True)\n",
" (6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)\n",
" (7): Dropout(p=0.5, inplace=False)\n",
" (8): Linear(in_features=512, out_features=2, bias=False)\n",
")\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this function you can choose how many additional linear layers are added to the end, how much dropout to use after each one, and what kind of pooling to use. By default, fastai will apply both average pooling, and max pooling, and will concatenate the two together (this is the `AdaptiveConcatPool2d` layer). This is not a particularly common approach, but it was developed independently at fastai and at other research labs in recent years, and tends to provide some small improvement over using just average pooling.\n",
"\n",
"Fastai is also a bit different to most libraries in adding two linear layers, rather than one, by default in the CNN head. The reason for this is that transfer learning can still be useful even, as we have seen, and transferring two very different domains to the pretrained model. However, just using a single linear layer is unlikely to be enough. So we have found that using two linear layers can allow transfer learning to be used more quickly and easily, in more situations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> note: One parameter to create_head that is worth looking at is bn_final. Setting this to true will cause a batchnorm layer to be added as your final layer. This can be useful in helping your model to more easily ensure that it is scaled appropriately for your output activations. We haven't seen this approach published anywhere, as yet, but we have found that it works well in practice, wherever we have used it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### unet_learner"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the most interesting architectures in deep learning is the one that we used for segmentation in <<chapter_intro>>. Segmentation is a challenging task, because the output required is really an image, or a pixel grid, containing the predicted label for every pixel. There are other tasks which share a similar basic design, such as increasing the resolution of an image (*super resolution*), adding colour to a black-and-white image (*colorization*), or converting a photo into a synthetic painting (*style transfer*)--these tasks are covered by an online chapter of this book, so be sure to check it out after you've read this chapter. In each case, we are starting with an image, and converting it to some other image of the same dimensions or aspect ratio, but with the pixels converted in some way. We refer to these as *generative vision models*.\n",
"\n",
"The way we do this is to start with the exact same approach to developing a CNN head as we saw above. We start with a ResNet, for instance, and cut off the adaptive pooling layer and everything after that. And then we replace that with our custom head which does the generative task.\n",
"\n",
"There was a lot of handwaving in that last sentence! How on earth do we create a CNN head which generates an image? If we start with, say, a 224 pixel input image, then at the end of the resnet body we will have a 7x7 grid of convolutional activations. How can we convert that into a 224 pixel segmentation mask?\n",
"\n",
"We will (naturally) do this with a neural network! So we need some kind of layer which can increase the grid size in a CNN. One very simple approach to this is to replace every pixel in the 7x7 grid with four pixels in a 2x2 square. Each of those four pixels would have the same value — this is known as nearest neighbour interpolation. PyTorch provides a layer which does this for us, so we could create a head which contains stride one convolutional layers (along with batchnorm and ReLU as usual) interspersed with 2x2 nearest neighbour interpolation layers. In fact, you could try this now! See if you can create a custom head designed like this, and see if it can complete the CamVid segmentation task. You should find that you get some reasonable results, although it won't be as good as our <<chapter_intro>> results.\n",
"\n",
"Another approach is to replace the nearest neighbour and convolution combination with a *transposed convolution* otherwise known as a *stride half convolution*. This is identical to a regular convolution, but first zero padding is inserted between every pixel in the input. This is easiest to see with a picture — here's a diagram from the excellent convolutional arithmetic paper we have seen before, showing a 3x3 transposed convolution applied to a 3x3 image:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"A transposed convolution\" width=\"815\" caption=\"A transposed convolution\" id=\"transp_conv\" src=\"images/att_00051.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you see, the result of this is to increase the size of the input. You can try this out now, by using fastai's ConvLayer class; pass the parameter `transpose=True` to create a transposed convolution, instead of a regular one, in your custom head.\n",
"\n",
"Neither of these approaches, however, works really well. The problem is that our 7x7 grid simply doesn't have enough information to create a 224x224 pixel output. It's asking an awful lot of the activations of each of those grid cells to have enough information to fully regenerate every pixel in the output. The solution to this problem is to use skip connections, like in a resnet, but skipping from the activations in the body of the resnet all the way over to the activations of the transposed convolution on the opposite side of the architecture. This is known as a U-Net, and it was developed in the 2015 paper [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597). Although the paper focussed on medical applications, the U-Net has revolutionized all kinds of generation vision models.\n",
"\n",
"The U-Net paper shows the architecture like this:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"The U-net architecture\" width=\"630\" caption=\"The U-net architecture\" id=\"unet\" src=\"images/att_00052.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This picture shows the CNN body on the left (in this case, it's a regular CNN, not a ResNet, and they're using 2x2 max pooling instead of stride 2 convolutions, since this paper was written before ResNets came along) and it shows the transposed convolutional layers on the right (they're called \"up-conv\" in this picture). Then then extra skip connections are shown as grey arrows crossing from left to right (these are sometimes called *cross connections*). You can see why it's called a \"U-net\" when you see this picture!\n",
"\n",
"With this architecture, the input to the transposed convolutions is not just the lower resolution grid in the preceding layer, but also the higher resolution grid in the resnet head. This allows the U-Net to use all of the information of the original image, as it is needed. One challenge with U-Nets is that the exact architecture depends on the image size. fastai has a unique `DynamicUnet` class which auto-generates an architecture of the right size based on the data provided."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Natural language processing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we've seen how to create complete state of the art computer vision models, let's move on to NLP.\n",
"\n",
"Converting an AWD-LSTM language model into a transfer learning classifier follows a very similar process to what we saw for `cnn_learner` in the first section of this chapter. We do not need a \"meta\" dictionary in this case, because we do not have such a variety of architectures to support in the body. All we need to do is to select the stacked RNN for the encoder in the language model, which is a single PyTorch module. This encoder will provide an activation for every word of the input, because a language model needs to output a prediction for every next word.\n",
"\n",
"To create a classifier from this we use an approach described in the ULMFiT paper as \"BPTT for Text Classification (BPT3C)\". The paper describes this:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> In order to make fine-tuning a classifier for large documents feasible, we propose BPTT for Text Classification (BPT3C): We divide the document into fixed-length batches of size `b`. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length backpropagation sequences."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In practice, what this is saying is that the classifier contains a for loop, which loops over each batch of a sequence. The state is maintained across batches, and the activations of each batch are stored. At the end, we use the same average and max concatenated pooling trick that we use for computer vision models — but this time, we do not pool over CNN grid cells, but over RNN sequences.\n",
"\n",
"For this for loop we need to gather our data in batches, but each text needs to be treated separately, as they each have their own label. However, it's very likely that those texts won't have the good taste of being all of the same length, which means we won't be able to put them all in the same array, like we did with the language model.\n",
"\n",
"That's where padding is going to help: when grabbing a bunch of texts, we determine the one with the greater length, then we fill the ones that are shorter with a special token called `xxpad`. To avoid having an extreme case where we have a text with 2,000 tokens in the same batch as a text with 10 tokens (so a lot of padding, and a lot of wasted computation) we alter the randomness by making sure texts of comparable size are put together. It will still be in a somewhat random order for the training set (for the validation set we can simply sort them by order of length), but not completely random.\n",
"\n",
"This is done automatically behind the scenes by the fastai library when creating our `DataLoaders`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tabular"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can look at `fastai.tabular` models. (We don't need to look at collaborative filtering separately, since we've already seen that these models are just tabular models, or use dot product, which we've implemented earlier from scratch.\n",
"\n",
"Here is the forward method for `TabularModel`:\n",
"\n",
"```python\n",
"if self.n_emb != 0:\n",
" x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]\n",
" x = torch.cat(x, 1)\n",
" x = self.emb_drop(x)\n",
"if self.n_cont != 0:\n",
" x_cont = self.bn_cont(x_cont)\n",
" x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont\n",
"return self.layers(x)\n",
"```\n",
"\n",
"We won't show `__init__` here, since it's not that interesting, but will look at each line of code in turn in `forward`:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"if self.n_emb != 0:\n",
"```\n",
"\n",
"This is just testing whether there are any embeddings to deal with — we can skip this section if we only have continuous variables.\n",
"\n",
"```python\n",
" x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]\n",
"```\n",
"\n",
"`self.embeds` contains the embedding matrices, so this gets the activations of each…\n",
"\n",
"```python\n",
" x = torch.cat(x, 1)\n",
"```\n",
"\n",
"…and concatenates them into a single tensor.\n",
"\n",
"```python\n",
" x = self.emb_drop(x)\n",
"```\n",
"\n",
"Then dropout is applied. You can pass `emb_drop` to `__init__` to change this value.\n",
"\n",
"```python\n",
"if self.n_cont != 0:\n",
"```\n",
"\n",
"Now we test whether there are any continuous variables to deal with.\n",
"\n",
"```python\n",
" x_cont = self.bn_cont(x_cont)\n",
"```\n",
"\n",
"They are passed through a batchnorm layer…\n",
"\n",
"```python\n",
" x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont\n",
"```\n",
"\n",
"…and concatenated with the embedding activations, if there were any.\n",
"\n",
"```python\n",
"return self.layers(x)\n",
"\n",
"```\n",
"\n",
"Finally, this is passed through the linear layers (each of which includes batchnorm, if `use_bn` is True, and dropout, if `ps` is set to some value or list of values)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wrapping up architectures"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the details of deep learning architectures need not scare you now. You can look inside the code of fastai and PyTorch and see just what is going on. More importantly, try to understand why that is going on. Take a look at the papers that are being implemented in the code, and try to see how the code matches up to the algorithms that are described.\n",
"\n",
"Now that we have investigated all of the pieces of a model and the data that is passed into it, we can consider what this means for practical deep learning. If you have unlimited data, unlimited memory, and unlimited time, then the advice is easy: train a huge model on all of your data for a really long time. The reason that deep learning is not straightforward is because your data, memory, and time is limited. If you are running out of memory or time, then the solution is to train a smaller model. If you are not able to train for long enough to overfit, then you are not taking advantage of the capacity of your model.\n",
"\n",
"So step one is to get to the point that you can overfit. Then, the question is how to reduce that overfitting. Here is how we recommend prioritising the steps from there:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Steps to reducing over-fitting\" width=\"400\" caption=\"Steps to reducing over-fitting\" id=\"reduce_overfit\" src=\"images/att_00047.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Many practitioners when faced with an overfitting model start at exactly the wrong end of this diagram. Their starting point is to use a smaller model, or more regularisation. Using a smaller model should be absolutely the last step you take, unless your model is taking up too much time or memory. Reducing the size of your model as reducing the ability of your model to learn subtle relationships in your data.\n",
"\n",
"Instead, your first step should be to seek to create more data. That could involve adding more labels to data that you already have in your organisation, finding additional tasks that your model could be asked to solve (or to think of it another way, identifying different kinds of labels that you could model), or creating additional synthetic data via using more or different data augmentation. Thanks to the development of mixup and similar approaches, effective data augmentation is now available for nearly all kinds of data.\n",
"\n",
"Once you've got as much data as you think you can reasonably get a hold of, and are using it as effectively as possible by taking advantage of all of the labels that you can find, and all of the augmentation that make sense, if you are still overfitting and you should think about using more generalisable architectures. For instance, adding batch normalisation may improve generalisation.\n",
"\n",
"If you are still overfitting after doing the best you can at using your data and tuning your architecture, then you can take a look at regularisation. Generally speaking, adding dropout to the last layer or two will do a good job of regularising your model. However, as we learnt from the story of the development of AWD-LSTM, it is often the case that adding dropout of different types throughout your model can help regularise even better. Generally speaking, a larger model with more regularisation is more flexible, and can therefore be more accurate, and a smaller model with less regularisation.\n",
"\n",
"Only after considering all of these options would be recommend that you try using smaller versions of your architectures."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Questionnaire"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. What is the head of a neural net?\n",
"1. What is the body of a neural net?\n",
"1. What is \"cutting\" a neural net? Why do we need to do this for transfer learning?\n",
"1. What is \"model_meta\"? Try printing it to see what's inside.\n",
"1. Read the source code for `create_head` and make sure you understand what each line does.\n",
"1. Look at the output of create_head and make sure you understand why each layer is there, and how the create_head source created it.\n",
"1. Figure out how to change the dropout, layer size, and number of layers created by create_cnn, and see if you can find values that result in better accuracy from the pet recognizer.\n",
"1. What does AdaptiveConcatPool2d do?\n",
"1. What is nearest neighbor interpolation? How can it be used to upsample convolutional activations?\n",
"1. What is a transposed convolution? What is another name for it?\n",
"1. Create a conv layer with `transpose=True` and apply it to an image. Check the output shape.\n",
"1. Draw the u-net architecture.\n",
"1. What is BPTT for Text Classification (BPT3C)?\n",
"1. How do we handle different length sequences in BPT3C?\n",
"1. Try to run each line of `TabularModel.forward` separately, one line per cell, in a notebook, and look at the input and output shapes at each step.\n",
"1. How is `self.layers` defined in `TabularModel`?\n",
"1. What are the five steps for preventing over-fitting?\n",
"1. Why don't we reduce architecture complexity before trying other approaches to preventing over-fitting?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further research"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Write your own custom head and try training the pet recognizer with it. See if you can get a better result than fastai's default.\n",
"1. Try switching between AdaptiveConcatPool2d and AdaptiveAvgPool2d in a CNN head and see what difference it makes.\n",
"1. Write your own custom splitter to create a separate parameter group for every resnet block, and a separate group for the stem. Try training with it, and see if it improves the pet recognizer.\n",
"1. Read the online chapter about generative image models, and create your own colorizer, super resolution model, or style transfer model.\n",
"1. Create a custom head using nearest neighbor interpolation and use it to do segmentation on Camvid."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because it is too large Load Diff

678
18_CAM.ipynb Normal file

File diff suppressed because one or more lines are too long

View File

@ -1,424 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"from utils import *"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"[[chapter_callbacks]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Callbacks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction to callbacks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we now know how to create state-of-the-art architectures for computer vision, natural image processing, tabular analysis, and collaborative filtering, and we know how to train them quickly with accelerated optimisers, and we know how to regularise them effectively, we're done, right?\n",
"\n",
"Well… Yes, sort of. But other things come up. Sometimes you need to change how things work a little bit. In fact, we have already seen examples of this: mixup, FP16 training, resetting the model after each epoch for training RNNs, and so forth. How do we go about making these kinds of tweaks to the training process?\n",
"\n",
"We've seen the basic training loop, which, with the help of the `Optimizer` class, looks like this for a single epoch:\n",
"\n",
"```python\n",
"for xb,yb in dl:\n",
" loss = loss_func(model(xb), yb)\n",
" loss.backward()\n",
" opt.step()\n",
" opt.zero_grad()\n",
"```\n",
"\n",
"Here's one way to picture that:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Basic training loop\" width=\"300\" caption=\"Basic training loop\" id=\"basic_loop\" src=\"images/att_00048.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The usual way for deep learning practitioners to customise the training loop is to make a copy of an existing training loop, and then insert their code necessary for their particular changes into it. This is how nearly all code that you find online will look. But it has some very serious problems.\n",
"\n",
"It's not very likely that some particular tweaked training loop is going to meet your particular needs. There are hundreds of changes that can be made to a training loop, which means there are billions and billions of possible permutations. You can't just copy one tweak from a training loop here, another from a training loop there, and expect them all to work together. Each will be based on different assumptions about the environment that it's working in, use different naming conventions, and expect the data to be in different formats.\n",
"\n",
"We need a way to allow users to insert their own code at any part of the training loop, but in a consistent and well-defined way. Computer scientists have already come up with an answer to this question: the callback. A callback is a piece of code that you write, and inject into another piece of code at some predefined point. In fact, callbacks have been used with deep learning training loops for years. The problem is that only a small subset of places that may require code injection have been available in previous libraries, and, more importantly, callbacks were not able to do all the things they needed to do.\n",
"\n",
"In order to be just as flexible as manually copying and pasting a training loop and directly inserting code into it, a callback must be able to read every possible piece of information available in the training loop, modify all of it as needed, and fully control when a batch, epoch, or even all the whole training loop should be terminated. fastai is the first library to provide all of this functionality. It modifies the training loop so it looks like this:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Training loop with callbacks\" width=\"550\" caption=\"Training loop with callbacks\" id=\"cb_loop\" src=\"images/att_00049.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The real test of whether this works has been borne out over the last couple of years — it has turned out that every single new paper implemented, or use a request fulfilled, for modifying the training loop has successfully been achieved entirely by using the fastai callback system. The training loop itself has not required modifications. Here are just a few of the callbacks that have been added:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Some fastai callbacks\" width=\"500\" caption=\"Some fastai callbacks\" id=\"some_cbs\" src=\"images/att_00050.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The reason that this is important for all of us is that it means that whatever idea we have in our head, we can implement it. We need never dig into the source code of PyTorch or fastai and act together some one-off system to try out our ideas. And when we do implement our own callbacks to develop our own ideas, we know that they will work together with all of the other functionality provided by fastai so we will get progress bars, mixed precision training, hyperparameter annealing, and so forth.\n",
"\n",
"Another advantage is that it makes it easy to gradually remove or add functionality and perform ablation studies. You just need to adjust the list of callbacks you pass along to your fit function."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As an example, here is the fastai source code that is run for each batch of the training loop:\n",
"\n",
"```python\n",
"try:\n",
" self._split(b); self('begin_batch')\n",
" self.pred = self.model(*self.xb); self('after_pred')\n",
" self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')\n",
" if not self.training: return\n",
" self.loss.backward(); self('after_backward')\n",
" self.opt.step(); self('after_step')\n",
" self.opt.zero_grad()\n",
"except CancelBatchException: self('after_cancel_batch')\n",
"finally: self('after_batch')\n",
"```\n",
"\n",
"The calls of the form `self('...')` are where the callbacks are called. As you see, after every step a callback is called. The callback will receive the entire state of training, and can also modify it. For instance, as you see above, the input data and target labels are in `self.xb` and `self.yb` respectively. A callback can modify these to modify the data the training loop sees. It can also modify `self.loss`, or even modify the gradients."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a callback"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The full list of available callback events is:\n",
"\n",
"- `begin_fit`: called before doing anything, ideal for initial setup.\n",
"- `begin_epoch`: called at the beginning of each epoch, useful for any behavior you need to reset at each epoch.\n",
"- `begin_train`: called at the beginning of the training part of an epoch.\n",
"- `begin_batch`: called at the beginning of each batch, just after drawing said batch. It can be used to do any setup necessary for the batch (like hyper-parameter scheduling) or to change the input/target before it goes in the model (change of the input with techniques like mixup for instance).\n",
"- `after_pred`: called after computing the output of the model on the batch. It can be used to change that output before it's fed to the loss.\n",
"- `after_loss`: called after the loss has been computed, but before the backward pass. It can be used to add any penalty to the loss (AR or TAR in RNN training for instance).\n",
"- `after_backward`: called after the backward pass, but before the update of the parameters. It can be used to do any change to the gradients before said update (gradient clipping for instance).\n",
"- `after_step`: called after the step and before the gradients are zeroed.\n",
"- `after_batch`: called at the end of a batch, for any clean-up before the next one.\n",
"- `after_train`: called at the end of the training phase of an epoch.\n",
"- `begin_validate`: called at the beginning of the validation phase of an epoch, useful for any setup needed specifically for validation.\n",
"- `after_validate`: called at the end of the validation part of an epoch.\n",
"- `after_epoch`: called at the end of an epoch, for any clean-up before the next one.\n",
"- `after_fit`: called at the end of training, for final clean-up.\n",
"\n",
"This list is available as attributes of the special variable `event`; so just type `event.` and hit `Tab` in your notebook to see a list of all the options"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at an example. Do you recall how in <<chapter_nlp_dive>> we needed to ensure that our special `reset` method was called at the start of training and validation for each epoch? We used the `ModelReseter` callback provided by fastai to do this for us. But how did `ModelReseter` do that exactly? Here's the full actual source code to that class:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class ModelReseter(Callback):\n",
" def begin_train(self): self.model.reset()\n",
" def begin_validate(self): self.model.reset()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yes, that's actually it! It just does what we said in the paragraph above: after completing training and epoch or validation for an epoch, call a method named `reset`.\n",
"\n",
"Callbacks are often \"short and sweet\" like this one. In fact, let's look at one more. Here's the fastai source for the callback that add RNN regularization (*AR* and *TAR*):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class RNNRegularizer(Callback):\n",
" def __init__(self, alpha=0., beta=0.): self.alpha,self.beta = alpha,beta\n",
"\n",
" def after_pred(self):\n",
" self.raw_out,self.out = self.pred[1],self.pred[2]\n",
" self.learn.pred = self.pred[0]\n",
"\n",
" def after_loss(self):\n",
" if not self.training: return\n",
" if self.alpha != 0.:\n",
" self.learn.loss += self.alpha * self.out[-1].float().pow(2).mean()\n",
" if self.beta != 0.:\n",
" h = self.raw_out[-1]\n",
" if len(h)>1:\n",
" self.learn.loss += self.beta * (h[:,1:] - h[:,:-1]\n",
" ).float().pow(2).mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> stop: Go back to where we discussed TAR and AR regularization, and compare to the code here. Made sure you understand what it's doing, and why."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In both of these examples, notice how we can access attributes of the training loop by directly checking `self.model` or `self.pred`. That's because a `Callback` will always try to get an attribute it doesn't have inside the `Learner` associated to it. This is a shortcut for `self.learn.model` or `self.learn.pred`. Note that this shortcut works for reading attributes, but not for writing them, which is why when `RNNRegularizer` changes the loss or the predictions, you see `self.learn.loss = ` or `self.learn.pred = `. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When writing a callback, the following attributes of `Learner` are available:\n",
"\n",
"- `model`: the model used for training/validation\n",
"- `data`: the underlying `DataLoaders`\n",
"- `loss_func`: the loss function used\n",
"- `opt`: the optimizer used to udpate the model parameters\n",
"- `opt_func`: the function used to create the optimizer\n",
"- `cbs`: the list containing all `Callback`s\n",
"- `dl`: current `DataLoader` used for iteration\n",
"- `x`/`xb`: last input drawn from `self.dl` (potentially modified by callbacks). `xb` is always a tuple (potentially with one element) and `x` is detuplified. You can only assign to `xb`.\n",
"- `y`/`yb`: last target drawn from `self.dl` (potentially modified by callbacks). `yb` is always a tuple (potentially with one element) and `y` is detuplified. You can only assign to `yb`.\n",
"- `pred`: last predictions from `self.model` (potentially modified by callbacks)\n",
"- `loss`: last computed loss (potentially modified by callbacks)\n",
"- `n_epoch`: the number of epochs in this training\n",
"- `n_iter`: the number of iterations in the current `self.dl`\n",
"- `epoch`: the current epoch index (from 0 to `n_epoch-1`)\n",
"- `iter`: the current iteration index in `self.dl` (from 0 to `n_iter-1`)\n",
"\n",
"The following attributes are added by `TrainEvalCallback` and should be available unless you went out of your way to remove that callback:\n",
"\n",
"- `train_iter`: the number of training iterations done since the beginning of this training\n",
"- `pct_train`: from 0. to 1., the percentage of training iterations completed\n",
"- `training`: flag to indicate if we're in training mode or not\n",
"\n",
"The following attribute is added by `Recorder` and should be available unless you went out of your way to remove that callback:\n",
"\n",
"- `smooth_loss`: an exponentially-averaged version of the training loss"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Callback ordering and exceptions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes, callbacks need to be able to tell fastai to skip over a batch, or an epoch, or stop training altogether. For instance, consider `TerminateOnNaNCallback`. This handy callback will automatically stop training any time the loss becomes infinite or `NaN` (*not a number*). Here's the fastai source for this callback:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class TerminateOnNaNCallback(Callback):\n",
" run_before=Recorder\n",
" def after_batch(self):\n",
" if torch.isinf(self.loss) or torch.isnan(self.loss):\n",
" raise CancelFitException"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The way it tells the training loop to interrupt training at this point is to `raise CancelFitException`. The training loop catches this exception and does not run any further training or validation. The callback control flow exceptions available are:\n",
"\n",
"- `CancelFitException`: Skip the rest of this batch and go to `after_batch\n",
"- `CancelEpochException`: Skip the rest of the training part of the epoch and go to `after_train\n",
"- `CancelTrainException`: Skip the rest of the validation part of the epoch and go to `after_validate\n",
"- `CancelValidException`: Skip the rest of this epoch and go to `after_epoch\n",
"- `CancelBatchException`: Interrupts training and go to `after_fit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can detect one of those exceptions occurred and add code that executes right after with the following events:\n",
"\n",
"- `after_cancel_batch`: reached immediately after a `CancelBatchException` before proceeding to `after_batch`\n",
"- `after_cancel_train`: reached immediately after a `CancelTrainException` before proceeding to `after_epoch`\n",
"- `after_cancel_valid`: reached immediately after a `CancelValidException` before proceeding to `after_epoch`\n",
"- `after_cancel_epoch`: reached immediately after a `CancelEpochException` before proceeding to `after_epoch`\n",
"- `after_cancel_fit`: reached immediately after a `CancelFitException` before proceeding to `after_fit`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes, callbacks need to be called in a particular order. In the case of `TerminateOnNaNCallback`, it's important that `Recorder` runs its `after_batch` after this callback, to avoid registering an NaN loss. You can specify `run_before` (this callback must run before ...) or `run_after` (this callback must run after ...) in your callback to ensure the ordering that you need."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have seen how to tweak the training loop of fastai to do anything we need, let's take a step back and dig a little bit deeper in the foundations of that training loop."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Questionnaire"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. What are the four steps of a training loop?\n",
"1. Why is the use of callbacks better than writing a new training loop for each tweak you want to add?\n",
"1. What are the necessary points in the design of the fastai's callback system that make it as flexible as copying and pasting bits of code?\n",
"1. How can you get the list of events available to you when writing a callback?\n",
"1. Write the `ModelResetter` callback (without peeking).\n",
"1. How can you access the necessary attributes of the training loop inside a callback? When can you use or not use the shortcut that goes with it?\n",
"1. How can a callback influence the control flow of the training loop.\n",
"1. Write the `TerminateOnNaN` callback (without peeking if possible).\n",
"1. How do you make sure your callback runs after or before another callback?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further research"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Look at the mixed precision callback with the documentation. Try to understand what each event and line of code does.\n",
"1. Implement your own version of ther learning rate finder from scratch. Compare it with fastai's version.\n",
"1. Look at the source code of the callbacks that ship with fastai. See if you can find one that's similar to what you're looking to do, to get some inspiration."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Foundations of Deep Learning: Wrap up"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations, you have made it to the end of the \"foundations of deep learning\" section. You now understand how all of fastai's applications and most important architectures are built, and the recommended ways to train them, and have all the information you need to build these from scratch. Whilst you probably won't need to create your own training loop, or batchnorm layer, for instance, knowing what is going on behind the scenes is very helpful for debugging, profiling, and deploying your solutions.\n",
"\n",
"Since you understand all of the foundations of fastai's applications now, be sure to spend some time digging through fastai's source notebooks, and running and experimenting with parts of them, since you can and see exactly how everything in fastai is developed.\n",
"\n",
"In the next section, we will be looking even further under the covers, to see how the actual forward and backward passes of a neural network are done, and we will see what tools are at our disposal to get better performance. We will then finish up with a project that brings together everything we have learned throughout the book, which we will use to build a method for interpreting convolutional neural networks."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}

1929
19_learner.ipynb Normal file

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

88
20_conclusion.ipynb Normal file
View File

@ -0,0 +1,88 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"!pip install -Uqq fastbook\n",
"import fastbook\n",
"fastbook.setup_book()"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"[[chapter_conclusion]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Concluding Thoughts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations! You've made it! If you have worked through all of the notebooks to this point, then you have joined the small, but growing group of people that are able to harness the power of deep learning to solve real problems. You may not feel that way yet—in fact you probably don't. We have seen again and again that students that complete the fast.ai courses dramatically underestimate how effective they are as deep learning practitioners. We've also seen that these people are often underestimated by others with a classic academic background. So if you are to rise above your own expectations and the expectations of others, what you do next, after closing this book, is even more important than what you've done to get to this point.\n",
"\n",
"The most important thing is to keep the momentum going. In fact, as you know from your study of optimizers, momentum is something that can build upon itself! So think about what you can do now to maintain and accelerate your deep learning journey. <<do_next>> can give you a few ideas."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"What to do next\" width=\"550\" caption=\"What to do next\" id=\"do_next\" src=\"images/att_00053.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've talked a lot in this book about the value of writing, whether it be code or prose. But perhaps you haven't quite written as much as you had hoped so far. That's okay! Now is a great chance to turn that around. You have a lot to say, at this point. Perhaps you have tried some experiments on a dataset that other people don't seem to have looked at in quite the same way. Tell the world about it! Or perhaps thinking about trying out some ideas that occurred to you while you were reading—now is a great time to turn those ideas into code.\n",
"\n",
"If you'd like to share your ideas, one fairly low-key place to do so is the [fast.ai forums](https://forums.fast.ai/). You will find that the community there is very supportive and helpful, so please do drop by and let us know what you've been up to. Or see if you can answer a few questions for those folks who are earlier in their journey than you.\n",
"\n",
"And if you do have some successes, big or small, in your deep learning journey, be sure to let us know! It's especially helpful if you post about them on the forums, because learning about the successes of other students can be extremely motivating.\n",
"\n",
"Perhaps the most important approach for many people to stay connected with their learning journey is to build a community around it. For instance, you could try to set up a small deep learning meetup in your local neighborhood, or a study group, or even offer to do a talk at a local meetup about what you've learned so far or some particular aspect that interested you. It's okay that you are not the world's leading expert just yet—the important thing to remember is that you now know about plenty of stuff that other people don't, so they are very likely to appreciate your perspective.\n",
"\n",
"Another community event which many people find useful is a regular book club or paper reading club. You might find that there are some in your neighbourhood already, and if not you could try to get one started yourself. Even if there is just one other person doing it with you, it will help give you the support and encouragement to get going.\n",
"\n",
"If you are not in a geography where it's easy to get together with like-minded folks in person, drop by the forums, because there are always people starting up virtual study groups. These generally involve a bunch of folks getting together over video chat once a week or so to discuss some deep learning topic.\n",
"\n",
"Hopefully, by this point, you have a few little projects that you've put together and experiments that you've run. Our recommendation for the next step is to pick one of these and make it as good as you can. Really polish it up into the best piece of work that you can—something you are really proud of. This will force you to go much deeper into a topic, which will really test your understanding and give you the opportunity to see what you can do when you really put your mind to it.\n",
"\n",
"Also, you may want to take a look at the fast.ai free online course that covers the same material as this book. Sometimes, seeing the same material in two different ways can really help to crystallize the ideas. In fact, human learning researchers have found that one of the best ways to learn material is to see the same thing from different angles, described in different ways.\n",
"\n",
"Your final mission, should you choose to accept it, is to take this book and give it to somebody that you know—and get somebody else starte on their own deep learning journey!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,76 +0,0 @@
{
"cells": [
{
"cell_type": "raw",
"metadata": {},
"source": [
"[[chapter_conclusion]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Concluding thoughts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations! You've made it! If you have worked through all of the notebooks to this point, then you have joined a small, but growing group of people that are able to harness the power of deep learning to solve real problems. You may not feel that way; in fact you probably do not feel that way. We have seen again and again that students that complete the fast.AI courses dramatically underestimate how effective they are as deep learning practitioners. We've also seen that these people are often underestimated by those that have come out of a classic academic background. So for you to rise above your own expectations and the expectations of others what you do next, after closing this book, is even more important than what you've done to get to this point.\n",
"\n",
"The most important thing is to keep the momentum going. In fact, as you know from your study of optimisers, momentum is something which can build upon itself! So think about what it is you can do now to maintain and accelerate your deep learning journey. Here's a few ideas:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"What to do next\" width=\"550\" caption=\"What to do next\" id=\"do_next\" src=\"images/att_00053.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've talked a lot in this book about the value of writing, whether it be code or prose. But perhaps you haven't quite written as much as you had hoped so far. That's okay! Now is a great chance to turn that around. You have a lot to say, at this point. Perhaps you have tried some experiments on a dataset which other people don't seem to have looked at in quite the same way — so tell the world about it! Or perhaps you are just curious to try out some ideas that you had been thinking about why you are reading; now is a great chance to turn those ideas into code.\n",
"\n",
"One fairly low-key place for your writing is the fast.ai forums at forums.fast.ai. You will find that the community there is very supportive and helpful, so please do drop by and let us know what you've been up to. Or see if you can answer a few questions for those folks who are earlier in their journey then you.\n",
"\n",
"And if you do have some success, big or small, in your deep learning journey, be sure to let us know! It's especially helpful if you post about it on the forums, because for others to learn about the successes of other students can be extremely motivating.\n",
"\n",
"Perhaps the most important approach for many people to stay connected with their learning journey is to build a community around it. For instance, you could try to set up a small deep learning Meetup in your local neighbourhood, or a study group, or even offer to do a talk at a local meet up about what you've learned so far, or some particular aspect that interested you. It is okay that you are not the world's leading expert just yet the important thing to remember is that you now know about plenty of stuff that other people don't, so they are very likely to appreciate your perspective.\n",
"\n",
"Another community event which many people find useful is a regular book club or paper reading club. You might find that there are some in your neighbourhood already, or otherwise you could try to get one started yourself. Even if there is just one other person doing it with you, it will help give you the support and encouragement to get going.\n",
"\n",
"If you are not in a geography where it's easy to get together with like-minded folks in person, drop by the forums, because there are lots of people always starting up virtual study groups. These generally involve a bunch of people getting together over video chat once every week or so, and discussing some deep learning topic.\n",
"\n",
"Hopefully, by this point, you have a few little projects that you put together, and experiments that you've run. Our recommendation is generally to pick one of these and make it as good as you can. Really polish it up into the best piece of work that you can — something you really proud of. This will force you to go much deeper into a topic, which will really test out your understanding, and give you the opportunity to see what you can do when you really put your mind to it.\n",
"\n",
"Also, you may want to take a look at the fast.AI free online course which covers the same material as this book. Sometimes, seeing the same material in two different ways, can really help to crystallise the ideas. In fact, human learning researchers have found that this is one of the best ways to learn material — to see the same thing from different angles, described in different ways.\n",
"\n",
"Your final mission, should you choose to accept it, is to take this book, and give it to somebody that you know — and let somebody else start their way down their own deep learning journey!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -1,3 +1,5 @@
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/fastai/fastbook/master)
# The fastai book - draft
These draft notebooks cover an introduction to deep learning, [fastai](https://docs.fast.ai/), and [PyTorch](https://pytorch.org/). fastai is a layered API for deep learning; for more information, see [the fastai paper](https://www.mdpi.com/2078-2489/11/2/108). Everything in this repo is copyright Jeremy Howard and Sylvain Gugger, 2020 onwards.
@ -12,4 +14,4 @@ If you see someone hosting a copy of these materials somewhere else, please let
This is an early draft. If you get stuck running notebooks, please search the [fastai-v2 forum](https://forums.fast.ai/c/fastai-users/fastai-v2) for answers, and ask for help there if needed. Please don't use GitHub issues for problems running the notebooks.
If you make any pull requests to this repo, then you are assigning copyright of that work to Jeremy Howard and Sylvain Gugger.
If you make any pull requests to this repo, then you are assigning copyright of that work to Jeremy Howard and Sylvain Gugger. (Additionally, if you are making small edits to spelling or text, please specify the name of the file and very brief description of what you're fixing. It's becoming increasingly difficult for reviewers to know which corrections have already been made. Thank you.)

View File

@ -1,29 +1,51 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"!pip install -Uqq fastbook\n",
"import fastbook\n",
"fastbook.setup_book()"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"[[appendix_blog]]\n",
"[appendix]\n",
"[role=\"Creating a blog\"]"
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"from utils import *\n",
"from fastai2.vision.widgets import *"
"from fastbook import *\n",
"from fastai.vision.widgets import *"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Creating a blog"
"# Creating a Blog"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unfortunately, when it comes to blogging, it seems like you have to make a difficult decision: either use a platform that makes it easy but subjects you and your readers to advertisements, paywalls, and fees, or spend hours setting up your own hosting service and weeks learning about all kinds of intricate details. Perhaps the biggest benefit to the \"do-it-yourself\" approach is that you really own your own posts, rather than being at the whim of a service provider and their decisions about how to monetize your content in the future.\n",
"\n",
"It turns out, however, that you can have the best of both worlds! "
]
},
{
@ -37,244 +59,198 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Unfortunately, when it comes to blogging, it seems like you have to make a decision: either use a platform that makes it easy, but subjects you and your readers to advertisements, pay walls, and fees, or spend hours setting up your own hosting and weeks learning about all kinds of intricate details. Perhaps the biggest benefit to the \"do-it-yourself\" approach is that you really owning your own posts, rather than being at the whim of a service provider, and their decisions about how to monetize your content in the future.\n",
"A great solution is to host your blog on a platform called [GitHub Pages](https://pages.github.com/), which is free, has no ads or pay wall, and makes your data available in a standard way such that you can at any time move your blog to another host. But all the approaches Ive seen to using GitHub Pages have required knowledge of the command line and arcane tools that only software developers are likely to be familiar with. For instance, GitHub's [own documentation](https://help.github.com/en/github/working-with-github-pages/creating-a-github-pages-site-with-jekyll) on setting up a blog includes a long list of instructions that involve installing the Ruby programming language, using the `git` command-line tool, copying over version numbers, and more—17 steps in total!\n",
"\n",
"It turns out, however, that you can have the best of both worlds! You can host on a platform called [GitHub Pages](https://pages.github.com/), which is free, has no ads or pay wall, and makes your data available in a standard way such that you can at any time move your blog to another host. But all the approaches Ive seen to using GitHub Pages have required knowledge of the command line and arcane tools that only software developers are likely to be familiar with. For instance, GitHub's [own documentation](https://help.github.com/en/github/working-with-github-pages/creating-a-github-pages-site-with-jekyll) on setting up a blog requires installing the Ruby programming language, using the git command line tool, copying over version numbers, and more. 17 steps in total!\n",
"\n",
"Weve curated an easy approach, which allows you to use an **entirely browser-based interface** for all your blogging needs. You will be up and running with your new blog within about five minutes. It doesnt cost anything, and you can easily add your own custom domain to it if you wish to. Heres how to do it, using a template we've created called **fast\\_template**. (NB: be sure to check the [book website](https://book.fast.ai) for the latest blog recommendations, since new tools are always coming out; for instance, we're currently working with GitHub on creating a new tool called \"fastpages\" which is a more advanced version of `fast_template` that's particularly designed for people using Jupyter Notebooks)."
"To cut down the hassle, weve created an easy approach that allows you to use an *entirely browser-based interface* for all your blogging needs. You will be up and running with your new blog within about five minutes. It doesnt cost anything, and you can easily add your own custom domain to it if you wish to. In this section, we'll explain how to do it, using a template we've created called *fast\\_template*. (NB: be sure to check the [book's website](https://book.fast.ai) for the latest blog recommendations, since new tools are always coming out)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating the repository"
"### Creating the Repository"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Youll need an account on GitHub. So, head over there now, and create an account if you dont have one already. Make sure that you are logged in. Normally, GitHub is used by software developers for writing code, and they use a sophisticated command line tool to work with it. But I'm going to show you an approach that doesn't use the command line at all!\n",
"Youll need an account on GitHub, so head over there now and create an account if you dont have one already. Make sure that you are logged in. Normally, GitHub is used by software developers for writing code, and they use a sophisticated command-line tool to work with it—but we're going to show you an approach that doesn't use the command line at all!\n",
"\n",
"To get started, click on this link: [https://github.com/fastai/fast_template/generate](https://github.com/fastai/fast_template/generate) . This will allow you to create a place to store your blog, called a \"*repository*\". You will see the following screen; you have to enter your repository name using the **exact form you see below**, that is, the username you used at GitHub followed by `.github.io`.\n",
"To get started, point your browser to [https://github.com/fastai/fast_template/generate](https://github.com/fastai/fast_template/generate) (you need to be logged in to GitHub for the link to work). This will allow you to create a place to store your blog, called a *repository*. You will a screen like the one in <<githup_repo>>. Note that you have to enter your repository name using the *exact* format shown here—that is, your GitHub username followed by `.github.io`.\n",
"\n",
"<img width=\"440\" src=\"images/fast_template/image1.png\" id=\"githup_repo\" caption=\"Creating your repository\" alt=\"Screebshot of the GitHub page for creating a new repository\">\n",
"\n",
"> Important: Note that if you don't use username.github.io as the name, it won't work!\n",
"Once youve entered that, and any description you like, click \"Create repository from template.\" You have the choice to make the repository \"private,\" but since you are creating a blog that you want other people to read, having the underlying files publicly available hopefully won't be a problem for you.\n",
"\n",
"Once youve entered that, and any description you like, click on \"create repository from template\". You have the choice to make the repository \"private\" but since you are creating a blog that you want other people to read, having the underlying files publicly available hopefully won't be a problem for you."
"Now, let's set up your home page!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setting up your homepage"
"### Setting Up Your Home Page"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When readers first arrive at your blog the first thing that they will see is the content of a file called \"index.md\". This is a [markdown](https://guides.github.com/features/mastering-markdown/) file. Markdown is a powerful yet simple way of creating formatted text, such as bullet points, italics, hyperlinks, and so forth. It is very widely used, including all the formatting in Jupyter notebooks, nearly every part of the GitHub site, and many other places all over the Internet. To create markdown text, you can just type in plain regular English. But then you can add some special characters to add special behavior. For instance, if you type a `*` character around a word or phrase then that will put it in *italics*. Lets try it now.\n",
"When readers arrive at your blog the first thing that they will see is the content of a file called *index.md*. This is a [markdown](https://guides.github.com/features/mastering-markdown/) file. Markdown is a powerful yet simple way of creating formatted text, such as bullet points, italics, hyperlinks, and so forth. It is very widely used, including for all the formatting in Jupyter notebooks, nearly every part of the GitHub site, and many other places all over the internet. To create markdown text, you can just type in plain English, then add some special characters to add special behavior. For instance, if you type a `*` character before and after a word or phrase, that will put it in *italics*. Lets try it now.\n",
"\n",
"To open the file, click its file name in GitHub.\n",
"To open the file, click its filename in GitHub. To edit it, click on the pencil icon at the far right hand side of the screen as shown in <<fastpage_edit>>.\n",
"\n",
"<img width=\"140\" src=\"images/fast_template/image2.png\" alt=\"Screenshot showing a click on the index.md file\">\n",
"\n",
"To edit it, click on the pencil icon at the far right hand side of the\n",
"screen.\n",
"\n",
"<img width=\"800\" src=\"images/fast_template/image3.png\" alt=\"Screenshot showing where to click to edit the file\">"
"<img width=\"800\" src=\"images/fast_template/image3.png\" alt=\"Screenshot showing where to click to edit the file\" id=\"fastpage_edit\" caption=\"Click Edit this file\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can add, edit, or replace the texts that you see. Click on the\n",
"\"preview changes\" button to see how well your markdown text will look\n",
"on your blog. Lines that you have added or changed will appear with a\n",
"green bar on the left-hand side.\n",
"You can add to, edit, or replace the texts that you see. Click \"Preview changes\" (<<fastpage_preview>>) to see what your markdown text will look like in your blog. Lines that you have added or changed will appear with a green bar on the lefthand side.\n",
"\n",
"<img width=\"350\" src=\"images/fast_template/image4.png\" alt=\"Screenshot showing where to click to preview changes\">\n",
"<img width=\"350\" src=\"images/fast_template/image4.png\" alt=\"Screenshot showing where to click to preview changes\" id=\"fastpage_preview\" caption=\"Preview changes to catch any mistake\">\n",
"\n",
"To save your changes to your blog, you must scroll to the bottom and\n",
"click on the \"commit changes\" green button. On GitHub, to \"commit\"\n",
"something means to save it to the GitHub server.\n",
"To save your changes, scroll to the bottom of the page and click \"Commit changes,\" as shown in <<fastpage_commit>>. On GitHub, to \"commit\" something means to save it to the GitHub server.\n",
"\n",
"<img width=\"600\" src=\"images/fast_template/image5.png\" alt=\"Screenshot showing where to click to commit the changes\">"
"<img width=\"600\" src=\"images/fast_template/image5.png\" alt=\"Screenshot showing where to click to commit the changes\" id=\"fastpage_commit\" caption=\"Commit your changes to save them\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, you should configure your blogs settings. To do so, click on the\n",
"file called \"\\_config.yml\", and then click on the edit button like you\n",
"did for the index file above. Change the title, description, and GitHub\n",
"username values. You need to leave the names before the colons in place\n",
"and type your new values in after the colon and space on each line. You\n",
"can also add to your email and Twitter username if you wish — but note\n",
"that these will appear on your public blog if you do fill them in here.\n",
"Next, you should configure your blogs settings. To do so, click on the file called *\\_config.yml*, then click the edit button like you did for the index file. Change the title, description, and GitHub username values (see <<github_config>>. You need to leave the names before the colons in place, and type your new values in after the colon (and a space) on each line. You can also add to your email address and Twitter username if you wish, but note that these will appear on your public blog if you fill them in here.\n",
"\n",
"<img width=\"800\" src=\"images/fast_template/image6.png\" id=\"github_config\" caption=\"Fill the config file\" alt=\"Screenshot showing the config file and how to fill it\">\n",
"<img width=\"800\" src=\"images/fast_template/image6.png\" id=\"github_config\" caption=\"Fill in the config file\" alt=\"Screenshot showing the config file and how to fill it in\">\n",
"\n",
"After youre done, commit your changes just like you did with the index\n",
"file before. Then wait about a minute, whilst GitHub processes your new\n",
"blog. Then you will be able to go to your blog in your web browser, by\n",
"opening the URL: username.github.io (replace \"username\" with your\n",
"GitHub username). You should see your blog!\n",
"After youre done, commit your changes just like you did with the index file, then wait a minute or so while GitHub processes your new blog. Point your web browser to *&lt;username> .github.io* (replacing *&lt;username>* with your GitHub username). You should see your blog, which will look something like <<github_blog>>.\n",
"\n",
"<img width=\"540\" src=\"images/fast_template/image7.png\" id=\"github_blog\" caption=\"Your blog is onlyine!\" alt=\"Screenshot showing the website username.github.io\">"
"<img width=\"540\" src=\"images/fast_template/image7.png\" id=\"github_blog\" caption=\"Your blog is online!\" alt=\"Screenshot showing the website username.github.io\">"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"metadata": {},
"source": [
"### Creating posts"
"### Creating Posts"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"metadata": {},
"source": [
"Now youre ready to create your first post. All your posts will go in\n",
"the \"\\_posts\" folder. Click on that now, and then click on the \"create\n",
"file\" button. You need to be careful to name your file in the following\n",
"format: \"year-month-day-name.md\", where year is a four-digit number, and\n",
"month and day are two-digit numbers. \"Name\" can be anything you want,\n",
"that will help you remember what this post was about. The \"md\" extension\n",
"is for markdown documents.\n",
"Now youre ready to create your first post. All your posts will go in the *\\_posts* folder. Click on that now, and then click the \"Create file\" button. You need to be careful to name your file using the format *&lt;year>-&lt;month>-&lt;day>-&lt;name>.md*, as shwon in <<fastpage_name>>, where *&lt;year>* is a four-digit number, and *&lt;month>* and *&lt;day>* are two-digit numbers. *&lt;name>* can be anything you want that will help you remember what this post was about. The *.md* extension is for markdown documents.\n",
"\n",
"<img width=\"440\" src=\"images/fast_template/image8.png\" alt=\"Screenshot showing the right syntax to create a new blog post\">\n",
"<img width=\"440\" src=\"images/fast_template/image8.png\" alt=\"Screenshot showing the right syntax to create a new blog post\" id=\"fastpage_name\" caption=\"Naming your posts\">\n",
"\n",
"You can then type the contents of your first post. The only rule is that\n",
"the first line of your post must be a markdown heading. This is created\n",
"by putting `# ` at the start of a line (that creates a level 1\n",
"heading, which you should just use once at the start of your document;\n",
"you create level 2 headings using `## `, level 3 with `###`, and so forth.)\n",
"You can then type the contents of your first post. The only rule is that the first line of your post must be a markdown heading. This is created by putting `# ` at the start of a line, as shown in <<fastpage_title>> (that creates a level-1 heading, which you should just use once at the start of your document; you can create level-2 headings using `## `, level 3 with `###`, and so forth).\n",
"\n",
"<img width=\"300\" src=\"images/fast_template/image9.png\" alt=\"Screenshot showing the start of a blog post\">"
"<img width=\"300\" src=\"images/fast_template/image9.png\" alt=\"Screenshot showing the start of a blog post\" id=\"fastpage_title\" caption=\"Markdown syntax for a title\">"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"metadata": {},
"source": [
"As before, you can click on the \"preview\" button to see how your\n",
"markdown formatting will look.\n",
"As before, you can click the \"Preview\" button to see how your markdown formatting will look (<<fastpage_preview1>>).\n",
"\n",
"<img width=\"400\" src=\"images/fast_template/image10.png\" alt=\"Screenshot showing the same blog post interpreted in HTML\">\n",
"<img width=\"400\" src=\"images/fast_template/image10.png\" alt=\"Screenshot showing the same blog post interpreted in HTML\" id=\"fastpage_preview1\" caption=\"What the previous mardown syntax will look like on your blog\">\n",
"\n",
"And you will need to click the \"commit new file\" button to save it to\n",
"GitHub.\n",
"And you will need to click the \"Commit new file\" button to save it to GitHub, as shown in <<fastpage_commit1>>.\n",
"\n",
"<img width=\"700\" src=\"images/fast_template/image11.png\" alt=\"Screenshot showing where to click to commit the new file\">"
"<img width=\"700\" src=\"images/fast_template/image11.png\" alt=\"Screenshot showing where to click to commit the new file\" id=\"fastpage_commit1\" caption=\"Commit your changes to save them\">"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"metadata": {},
"source": [
"Have a look at your blog homepage again, and you will see that this post\n",
"has now appeared! (Remember that you will need to wait a minute or so\n",
"for GitHub to process it.)\n",
"Have a look at your blog home page again, and you will see that this post has now appeared--<<fastpage_live>> shows the result with the sample pose we just added. (Remember that you will need to wait a minute or so for GitHub to process the request before the file shows up.)\n",
"\n",
"<img width=\"500\" src=\"images/fast_template/image12.png\" alt=\"Screenshot showing the first post on the blog website\">\n",
"<img width=\"500\" src=\"images/fast_template/image12.png\" alt=\"Screenshot showing the first post on the blog website\" id=\"fastpage_live\" caption=\"Your first post is live!\">\n",
"\n",
"Youll also see that we provided a sample blog post, which you can go\n",
"ahead and delete now. Go to your posts folder, as before, and click on\n",
"\"2020-01-14-welcome.md\". Then click on the trash icon on the far\n",
"right.\n",
"You may have noticed that we provided a sample blog post, which you can go ahead and delete now. Go to your *\\_posts* folder, as before, and click on *2020-01-14-welcome.md*. Then click the trash icon on the far right, as shown in <<fastpage_delete>>.\n",
"\n",
"<img width=\"400\" src=\"images/fast_template/image13.png\" alt=\"Screenshot showing how to delete the mock post\">"
"<img width=\"400\" src=\"images/fast_template/image13.png\" alt=\"Screenshot showing how to delete the mock post\" id=\"fastpage_delete\" caption=\"Delete the sample blog post\">"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"metadata": {},
"source": [
"In GitHub, nothing actually changes until you commit— including deleting\n",
"a file! So, after you click the trash icon, scroll down to the bottom\n",
"and commit your changes.\n",
"In GitHub, nothing actually changes until you commit—including when you delete a file! So, after you click the trash icon, scroll down to the bottom of the page and commit your changes.\n",
"\n",
"You can include images in your posts by adding a line of markdown like\n",
"the following:\n",
"\n",
" ![Image description](images/filename.jpg)\n",
"\n",
"For this to work, you will need to put the image inside your \"images\"\n",
"folder. To do this, click on the images folder to go into it in GitHub,\n",
"and then click the \"upload files\" button.\n",
"For this to work, you will need to put the image inside your *images* folder. To do this, click the *images* folder, them click \"Upload files\" button (<<fastpage_upload>>).\n",
"\n",
"<img width=\"400\" src=\"images/fast_template/image14.png\" alt=\"Screenshot showing how to upload new files\">"
"<img width=\"400\" src=\"images/fast_template/image14.png\" alt=\"Screenshot showing how to upload new files\" id=\"fastpage_upload\" caption=\"Upload a file from your computer\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Synchronizing GitHub and your computer"
"Now let's see how to do all of this directly from your computer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Theres lots of reasons you might want to copy your blog content from GitHub to your computer. Perhaps you want to read or edit your posts offline. Or maybe youd like a backup in case something happens to your GitHub repository.\n",
"\n",
"GitHub does more than just let you copy your repository to your computer; it lets you *synchronize* it with your computer. So, you can make changes on GitHub, and theyll copy over to your computer, and you can make changes on your computer, and theyll copy over to GitHub. You can even let other people access and modify your blog, and their changes and your changes will be automatically combined together next time you sync.\n",
"\n",
"To make this work, you have to install an application called [GitHub Desktop](https://desktop.github.com/) to your computer. It runs on Mac, Windows, and Linux. Follow the directions at the link to install it, then when you run it itll ask you to login to GitHub, and then to select your repository to sync; click \"Clone a repository from the Internet\".\n",
"\n",
"<img src=\"images/gitblog/image1.png\" width=\"400\" alt=\"A screenshot showing how to clone your repository\">"
"### Synchronizing GitHub and Your Computer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once GitHub has finished syncing your repo, youll be able to click \"View the files of your repository in Finder\" (or Explorer), and youll see the local copy of your blog! Try editing one of the files on your computer. Then return to GitHub Desktop, and youll see the \"Sync\" button is waiting for you to press it. When you click it, your changes will be copied over to GitHub, where youll see them reflected on the web site.\n",
"There are lots of reasons you might want to copy your blog content from GitHub to your computer--you might want to be able to read or edit your posts offline, or maybe youd like a backup in case something happens to your GitHub repository.\n",
"\n",
"<img src=\"images/gitblog/image2.png\" width=\"600\" alt=\"A screenshot showing the cloned repository\">"
"GitHub does more than just let you copy your repository to your computer; it lets you *synchronize* it with your computer. That means you can make changes on GitHub, and theyll copy over to your computer, and you can make changes on your computer, and theyll copy over to GitHub. You can even let other people access and modify your blog, and their changes and your changes will be automatically combined together the next time you sync.\n",
"\n",
"To make this work, you have to install an application called [GitHub Desktop](https://desktop.github.com/) on your computer. It runs on Mac, Windows, and Linux. Follow the directions to install it, and when you run it itll ask you to log in to GitHub and select the repository to sync. Click \"Clone a repository from the Internet,\" as shown in <<fastpage_clone>>.\n",
"\n",
"<img src=\"images/gitblog/image1.png\" width=\"400\" alt=\"A screenshot showing how to clone your repository\" id=\"fastpage_clone\" caption=\"Clone your repository on GitHub Desktop\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you haven't used git before, GitHub Desktop and a blog is a great way to get started. As you'll discover, it's a fundamental tool used by most data scientists."
"Once GitHub has finished syncing your repo, youll be able to click \"View the files of your repository in Explorer\" (or Finder), as shown in <<fastpage_explorer>> and youll see the local copy of your blog! Try editing one of the files on your computer. Then return to GitHub Desktop, and youll see the \"Sync\" button is waiting for you to press it. When you click it your changes will be copied over to GitHub, where youll see them reflected on the website.\n",
"\n",
"<img src=\"images/gitblog/image2.png\" width=\"600\" alt=\"A screenshot showing the cloned repository\" id=\"fastpage_explorer\" caption=\"Viewing your files locally\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Jupyter for blogging"
"If you haven't used `git` before, GitHub Desktop is a great way to get started. As you'll discover, it's a fundamental tool used by most data scientists. Another tool that we hope you now love is Jupyter Notebooks--and there's a way to write your blog directly with that too!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also write blog posts using Jupyter Notebooks! Your markdown cells, code cells, and all outputs will appear in your exported blog post. The best way to do this may have changed by the time you are reading this book, so be sure to check out the [book website](https://book.fast.ai) for the latest information. As we write this, the easiest way to create a blog from notebooks is to use [fastpages](http://fastpages.fast.ai/), which is a more advanced version of `fast_template`. \n",
"## Jupyter for Blogging"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also write blog posts using Jupyter notebooks. Your markdown cells, code cells, and all the outputs will appear in your exported blog post. The best way to do this may have changed by the time you are reading this book, so be sure to check out the [book's website](https://book.fast.ai) for the latest information. As we write this, the easiest way to create a blog from notebooks is to use [`fastpages`](http://fastpages.fast.ai/), which is a more advanced version of `fast_template`. \n",
"\n",
"To blog with a notebook, just pop it in the `_notebooks` folder in your blog repo, and it will appear in your blog. When you write your notebook, write whatever you want your audience to see. Since most writing platforms make it much harder to include code and outputs, many of us are in a habit of including less real examples than we should. So try to get into a new habit of including lots of examples as you write.\n",
"To blog with a notebook, just pop it in the *\\_notebooks* folder in your blog repo, and it will appear in your list of blog posts. When you write your notebook, write whatever you want your audience to see. Since most writing platforms make it hard to include code and outputs, many of us are in the habit of including fewer real examples than we should. This is a great way to instead get into the habit of including lots of examples as you write.\n",
"\n",
"Often you'll want to hide boilerplate such as import statements. Add `#hide` to the top of any cell to make it not show up in output. Jupyter displays the result of the last line of a cell, so there's no need to include `print()`. (And including extra code that isn't needed means there's more cognitive overhead for the reader; so don't include code that you don't really need!)"
"Often, you'll want to hide boilerplate such as import statements. You can add `#hide` to the top of any cell to make it not show up in output. Jupyter displays the result of the last line of a cell, so there's no need to include `print()`. (Including extra code that isn't needed means there's more cognitive overhead for the reader; so don't include code that you don't really need!)"
]
},
{
@ -293,31 +269,6 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,

View File

@ -1,5 +1,17 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"!pip install -Uqq fastbook\n",
"import fastbook\n",
"fastbook.setup_book()"
]
},
{
"cell_type": "raw",
"metadata": {},
@ -12,16 +24,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Appendix: Jupyter notebook 101"
"# Appendix: Jupyter Notebook 101"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can read this tutorial in the book, but we strongly suggest reading it in a (yes, you guessed it) Jupyter Notebook. This way, you will be able to actually *try* the different commands we will introduce here. If you followed one of our tutorial in the previous section, you should have been left in the course folder. Just click on `nbs` then `dl1` and you should find the tutorial named `00_notebook_tutorial`. Click on it to open a new tab and you'll be ready to go.\n",
"You can read this tutorial in the book, but we strongly suggest reading it in a (yes, you guessed it) Jupyter Notebook. This way, you will be able to actually *try* the different commands we will introduce here. If you followed one of our tutorials in the previous section, you should have been left in the course folder. Just click on `nbs` then `dl1` and you should find the tutorial named `00_notebook_tutorial`. Click on it to open a new tab and you'll be ready to go.\n",
"\n",
"If you are on your personal machine, clone the course repository with and navigate inside before following the same steps.\n"
"If you are on your personal machine, clone the course repository and navigate inside before following the same steps.\n"
]
},
{
@ -35,7 +47,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's build up from the basics, what is a Jupyter Notebook? Well, we wrote this book using jupyter notebooks. It is a document made of cells. You can write in some of them (markdown cells) or your can perform calculations in Python (code cells) and run them like this:"
"Let's build up from the basics: what is a Jupyter Notebook? Well, we wrote this book using Jupyter Notebooks. A notebook is a document made of cells. You can write in some of them (markdown cells) or you can perform calculations in Python (code cells) and run them like this:"
]
},
{
@ -62,9 +74,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Cool huh? This combination of prose and code makes Jupyter Notebook ideal for experimentation: we can see the rationale for each experiment, the code, and the results in one comprehensive document. \n",
"Cool, huh? This combination of prose and code makes Jupyter Notebook ideal for experimentation: we can see the rationale for each experiment, the code, and the results in one comprehensive document. \n",
"\n",
"Other renowned institutions in academy and industry use Jupyter Notebook, including Google, Microsoft, IBM, Bloomberg, Berkeley and NASA among others. Even Nobel-winning economists [use Jupyter Notebooks](https://paulromer.net/jupyter-mathematica-and-the-future-of-the-research-paper/) for their experiments and some suggest that Jupyter Notebooks will be the [new format for research papers](https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/).\n"
"Other renowned institutions in academia and industry use Jupyter Notebook, including Google, Microsoft, IBM, Bloomberg, Berkeley and NASA among others. Even Nobel-winning economists [use Jupyter Notebooks](https://paulromer.net/jupyter-mathematica-and-the-future-of-the-research-paper/) for their experiments and some suggest that Jupyter Notebooks will be the [new format for research papers](https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/).\n"
]
},
{
@ -138,7 +150,7 @@
"\n",
"- Command Mode:: Allows you to edit the notebook as a whole and use keyboard shortcuts but not edit a cell's content. \n",
"\n",
"You can toggle between these two by either pressing <kbd>ESC</kbd> and <kbd>Enter</kbd> or clicking outside a cell or inside it (you need to double click if its a Markdown cell). You can always tell which mode you're on: the current cell will have a green border if in **Edit Mode** and a blue border in **Command Mode**. Try it!\n"
"You can toggle between these two by either pressing <kbd>ESC</kbd> and <kbd>Enter</kbd> or clicking outside a cell or inside it (you need to double click if it's a Markdown cell). You can always tell which mode you're on: the current cell will have a green border in **Edit Mode** and a blue border in **Command Mode**. Try it!\n"
]
},
{
@ -156,13 +168,13 @@
"\n",
"![Save](images/chapter1_save.png)\n",
"\n",
"To know if your *kernel* (the python engine executing your instructions behind the scene) is computing or not you can check the dot in your upper right corner. If the dot is full, it means that the kernel is working. If not, it is idle. You can place the mouse on it and the state of the kernel will be displayed.\n",
"To know if your *kernel* (the Python engine executing your instructions behind the scenes) is computing or not, you can check the dot in your upper right corner. If the dot is full, it means that the kernel is working. If not, it is idle. You can place the mouse on it and the state of the kernel will be displayed.\n",
"\n",
"![Busy](images/chapter1_busy.png)\n",
"\n",
"There are a couple of shortcuts you must know about which we use **all** the time (always in **Command Mode**). These are:\n",
"\n",
" - Shift+Enter:: Runs the code or markdown on a cell\n",
" - Shift+Enter:: Run the code or markdown on a cell\n",
" \n",
" - Up Arrow+Down Arrow:: Toggle across cells\n",
" \n",
@ -172,7 +184,7 @@
"\n",
"You can find more shortcuts by typing <kbd>h</kbd> (for help).\n",
"\n",
"You may need to use a terminal in a Jupyter Notebook environment (for example to git pull on a repository). That is very easy to do, just press 'New' in your Home directory and 'Terminal'. Don't know how to use the Terminal? We made a tutorial for that as well. You can find it [here](http://course-v3.fast.ai/terminal_tutorial.html).\n",
"You may need to use a terminal in a Jupyter Notebook environment (for example to git pull on a repository). That is very easy to do: just press 'New' in your Home directory and 'Terminal'. Don't know how to use the Terminal? We made a tutorial for that as well. You can find it [here](https://course.fast.ai/terminal_tutorial.html).\n",
"\n",
"![Terminal](images/chapter1_terminal.png)"
]
@ -188,7 +200,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Markdown formatting\n"
"## Markdown Formatting\n"
]
},
{
@ -289,7 +301,7 @@
"outputs": [],
"source": [
"# Import necessary libraries\n",
"from fastai2.vision.all import * \n",
"from fastai.vision.all import * \n",
"import matplotlib.pyplot as plt"
]
},
@ -353,7 +365,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also print images while experimenting. I am watching you."
"We can also print images while experimenting."
]
},
{
@ -381,7 +393,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running the app locally"
"## Running the App Locally"
]
},
{
@ -399,14 +411,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a notebook"
"## Creating a Notebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you have your own jupyter notebook running, you will probably want to write your owns. Click on 'New' in the upper left corner and 'Python 3' in the drop-down list (we are going to use a [Python kernel](https://github.com/ipython/ipython) for all our experiments).\n",
"Now that you have your own Jupyter Notebook server running, you will probably want to write your own notebook. Click on 'New' in the upper left corner and 'Python 3' in the drop-down list (we are going to use a [Python kernel](https://github.com/ipython/ipython) for all our experiments).\n",
"\n",
"![new_notebook](images/chapter1_new_notebook.png)\n"
]
@ -415,7 +427,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Shortcuts and tricks"
"## Shortcuts and Tricks"
]
},
{
@ -436,23 +448,23 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a couple of useful keyboard shortcuts in `Command Mode` that you can leverage to make Jupyter Notebook faster to use. Remember that to switch back and forth between `Command Mode` and `Edit Mode` with <kbd>Esc</kbd> and <kbd>Enter</kbd>.\n",
"There are a couple of useful keyboard shortcuts in `Command Mode` that you can leverage to make Jupyter Notebook faster to use. Remember that you can switch back and forth between `Command Mode` and `Edit Mode` with <kbd>Esc</kbd> and <kbd>Enter</kbd>.\n",
"\n",
"- m:: Convert cell to Markdown\n",
"\n",
"- y:: Convert cell to Code\n",
"\n",
"- D+D:: Delete cell\n",
"- d+d:: Delete cell\n",
"\n",
"- o:: Toggle between hide or show output\n",
"\n",
"- Shift+Arrow up/Arrow down:: Selects multiple cells. Once you have selected them you can operate on them like a batch (run, copy, paste etc).\n",
"- Shift+Arrow up/Arrow down:: Select multiple cells. Once you have selected them you can operate on them like a batch (run, copy, paste etc).\n",
"\n",
"- Shift+M:: Merge selected cells.\n",
"- Shift+M:: Merge selected cells\n",
"\n",
"- Shift+Tab (press once):: Tells you which parameters to pass on a function \n",
"- Shift+Tab (press once):: See which parameters to pass to a function \n",
"\n",
"- Shift+Tab (press three times):: Gives additional information on the method\n"
"- Shift+Tab (press three times):: Get additional information on the method\n"
]
},
{
@ -489,15 +501,15 @@
"source": [
"Line magics are functions that you can run on cells. They should be at the beginning of a line and take as an argument the rest of the line from where they are called. You call them by placing a '%' sign before the command. The most useful ones are:\n",
"\n",
"- `%matplotlib inline`:: This command ensures that all matplotlib plots will be plotted in the output cell within the notebook and will be kept in the notebook when saved.\n",
"- `%matplotlib inline`:: Ensures that all matplotlib plots will be plotted in the output cell within the notebook and will be kept in the notebook when saved.\n",
"\n",
"This command is always called together at the beggining of every notebook of the fast.ai course.\n",
"This command is always called together at the beginning of every notebook of the fast.ai course.\n",
"\n",
"``` python\n",
"%matplotlib inline\n",
"```\n",
"\n",
"- `%timeit`:: Runs a line a ten thousand times and displays the average time it took to run it."
"- `%timeit`:: Runs a line ten thousand times and displays the average time it took to run."
]
},
{
@ -521,43 +533,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"`%debug`: Allows to inspect a function which is showing an error using the [Python debugger](https://docs.python.org/3/library/pdb.html). If you type this in a cell just after an error, you will be directed in a console where you can inspect the values of all the variables.\n"
"`%debug`: Inspects a function which is showing an error using the [Python debugger](https://docs.python.org/3/library/pdb.html). If you type this in a cell just after an error, you will be directed to a console where you can inspect the values of all the variables.\n"
]
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,

1576
clean/01_intro.ipynb Normal file

File diff suppressed because one or more lines are too long

1094
clean/02_production.ipynb Normal file

File diff suppressed because one or more lines are too long

311
clean/03_ethics.ipynb Normal file
View File

@ -0,0 +1,311 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Ethics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sidebar: Acknowledgement: Dr. Rachel Thomas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### End sidebar"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Examples for Data Ethics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bugs and Recourse: Buggy Algorithm Used for Healthcare Benefits"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feedback Loops: YouTube's Recommendation System"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bias: Professor Lantanya Sweeney \"Arrested\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Why Does This Matter?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Integrating Machine Learning with Product Design"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Topics in Data Ethics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Recourse and Accountability"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feedback Loops"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Historical bias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Measurement bias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Aggregation bias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Representation bias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Addressing different types of bias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Disinformation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Identifying and Addressing Ethical Issues"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analyze a Project You Are Working On"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Processes to Implement"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Ethical lenses"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Power of Diversity"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fairness, Accountability, and Transparency"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Role of Policy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Effectiveness of Regulation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rights and Policy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cars: A Historical Precedent"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Questionnaire"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Does ethics provide a list of \"right answers\"?\n",
"1. How can working with people of different backgrounds help when considering ethical questions?\n",
"1. What was the role of IBM in Nazi Germany? Why did the company participate as it did? Why did the workers participate?\n",
"1. What was the role of the first person jailed in the Volkswagen diesel scandal?\n",
"1. What was the problem with a database of suspected gang members maintained by California law enforcement officials?\n",
"1. Why did YouTube's recommendation algorithm recommend videos of partially clothed children to pedophiles, even though no employee at Google had programmed this feature?\n",
"1. What are the problems with the centrality of metrics?\n",
"1. Why did Meetup.com not include gender in its recommendation system for tech meetups?\n",
"1. What are the six types of bias in machine learning, according to Suresh and Guttag?\n",
"1. Give two examples of historical race bias in the US.\n",
"1. Where are most images in ImageNet from?\n",
"1. In the paper [\"Does Machine Learning Automate Moral Hazard and Error\"](https://scholar.harvard.edu/files/sendhil/files/aer.p20171084.pdf) why is sinusitis found to be predictive of a stroke?\n",
"1. What is representation bias?\n",
"1. How are machines and people different, in terms of their use for making decisions?\n",
"1. Is disinformation the same as \"fake news\"?\n",
"1. Why is disinformation through auto-generated text a particularly significant issue?\n",
"1. What are the five ethical lenses described by the Markkula Center?\n",
"1. Where is policy an appropriate tool for addressing data ethics issues?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further Research:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Read the article \"What Happens When an Algorithm Cuts Your Healthcare\". How could problems like this be avoided in the future?\n",
"1. Research to find out more about YouTube's recommendation system and its societal impacts. Do you think recommendation systems must always have feedback loops with negative results? What approaches could Google take to avoid them? What about the government?\n",
"1. Read the paper [\"Discrimination in Online Ad Delivery\"](https://arxiv.org/abs/1301.6822). Do you think Google should be considered responsible for what happened to Dr. Sweeney? What would be an appropriate response?\n",
"1. How can a cross-disciplinary team help avoid negative consequences?\n",
"1. Read the paper \"Does Machine Learning Automate Moral Hazard and Error\". What actions do you think should be taken to deal with the issues identified in this paper?\n",
"1. Read the article \"How Will We Prevent AI-Based Forgery?\" Do you think Etzioni's proposed approach could work? Why?\n",
"1. Complete the section \"Analyze a Project You Are Working On\" in this chapter.\n",
"1. Consider whether your team could be more diverse. If so, what approaches might help?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deep Learning in Practice: That's a Wrap!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations! You've made it to the end of the first section of the book. In this section we've tried to show you what deep learning can do, and how you can use it to create real applications and products. At this point, you will get a lot more out of the book if you spend some time trying out what you've learned. Perhaps you have already been doing this as you go along—in which case, great! If not, that's no problem either... Now is a great time to start experimenting yourself.\n",
"\n",
"If you haven't been to the [book's website](https://book.fast.ai) yet, head over there now. It's really important that you get yourself set up to run the notebooks. Becoming an effective deep learning practitioner is all about practice, so you need to be training models. So, please go get the notebooks running now if you haven't already! And also have a look on the website for any important updates or notices; deep learning changes fast, and we can't change the words that are printed in this book, so the website is where you need to look to ensure you have the most up-to-date information.\n",
"\n",
"Make sure that you have completed the following steps:\n",
"\n",
"- Connect to one of the GPU Jupyter servers recommended on the book's website.\n",
"- Run the first notebook yourself.\n",
"- Upload an image that you find in the first notebook; then try a few different images of different kinds to see what happens.\n",
"- Run the second notebook, collecting your own dataset based on image search queries that you come up with.\n",
"- Think about how you can use deep learning to help you with your own projects, including what kinds of data you could use, what kinds of problems may come up, and how you might be able to mitigate these issues in practice.\n",
"\n",
"In the next section of the book you will learn about how and why deep learning works, instead of just seeing how you can use it in practice. Understanding the how and why is important for both practitioners and researchers, because in this fairly new field nearly every project requires some level of customization and debugging. The better you understand the foundations of deep learning, the better your models will be. These foundations are less important for executives, product managers, and so forth (although still useful, so feel free to keep reading!), but they are critical for anybody who is actually training and deploying models themselves."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

4301
clean/04_mnist_basics.ipynb Normal file

File diff suppressed because one or more lines are too long

1769
clean/05_pet_breeds.ipynb Normal file

File diff suppressed because one or more lines are too long

1565
clean/06_multicat.ipynb Normal file

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

1746
clean/08_collab.ipynb Normal file

File diff suppressed because one or more lines are too long

8450
clean/09_tabular.ipynb Normal file

File diff suppressed because one or more lines are too long

1568
clean/10_nlp.ipynb Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

2669
clean/13_convolutions.ipynb Normal file

File diff suppressed because one or more lines are too long

893
clean/14_resnet.ipynb Normal file

File diff suppressed because one or more lines are too long

442
clean/15_arch_details.ipynb Normal file
View File

@ -0,0 +1,442 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"from utils import *"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Application Architectures Deep Dive"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computer Vision"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### cnn_learner"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'cut': -2,\n",
" 'split': <function fastai.vision.learner._resnet_split(m)>,\n",
" 'stats': ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model_meta[resnet50]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Sequential(\n",
" (0): AdaptiveConcatPool2d(\n",
" (ap): AdaptiveAvgPool2d(output_size=1)\n",
" (mp): AdaptiveMaxPool2d(output_size=1)\n",
" )\n",
" (1): full: False\n",
" (2): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (3): Dropout(p=0.25, inplace=False)\n",
" (4): Linear(in_features=20, out_features=512, bias=False)\n",
" (5): ReLU(inplace=True)\n",
" (6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (7): Dropout(p=0.5, inplace=False)\n",
" (8): Linear(in_features=512, out_features=2, bias=False)\n",
")"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"create_head(20,2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### unet_learner"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### A Siamese Network"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"from fastai.vision.all import *\n",
"path = untar_data(URLs.PETS)\n",
"files = get_image_files(path/\"images\")\n",
"\n",
"class SiameseImage(Tuple):\n",
" def show(self, ctx=None, **kwargs): \n",
" img1,img2,same_breed = self\n",
" if not isinstance(img1, Tensor):\n",
" if img2.size != img1.size: img2 = img2.resize(img1.size)\n",
" t1,t2 = tensor(img1),tensor(img2)\n",
" t1,t2 = t1.permute(2,0,1),t2.permute(2,0,1)\n",
" else: t1,t2 = img1,img2\n",
" line = t1.new_zeros(t1.shape[0], t1.shape[1], 10)\n",
" return show_image(torch.cat([t1,line,t2], dim=2), \n",
" title=same_breed, ctx=ctx)\n",
" \n",
"def label_func(fname):\n",
" return re.match(r'^(.*)_\\d+.jpg$', fname.name).groups()[0]\n",
"\n",
"class SiameseTransform(Transform):\n",
" def __init__(self, files, label_func, splits):\n",
" self.labels = files.map(label_func).unique()\n",
" self.lbl2files = {l: L(f for f in files if label_func(f) == l) for l in self.labels}\n",
" self.label_func = label_func\n",
" self.valid = {f: self._draw(f) for f in files[splits[1]]}\n",
" \n",
" def encodes(self, f):\n",
" f2,t = self.valid.get(f, self._draw(f))\n",
" img1,img2 = PILImage.create(f),PILImage.create(f2)\n",
" return SiameseImage(img1, img2, t)\n",
" \n",
" def _draw(self, f):\n",
" same = random.random() < 0.5\n",
" cls = self.label_func(f)\n",
" if not same: cls = random.choice(L(l for l in self.labels if l != cls)) \n",
" return random.choice(self.lbl2files[cls]),same\n",
" \n",
"splits = RandomSplitter()(files)\n",
"tfm = SiameseTransform(files, label_func, splits)\n",
"tls = TfmdLists(files, tfm, splits=splits)\n",
"dls = tls.dataloaders(after_item=[Resize(224), ToTensor], \n",
" after_batch=[IntToFloatTensor, Normalize.from_stats(*imagenet_stats)])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class SiameseModel(Module):\n",
" def __init__(self, encoder, head):\n",
" self.encoder,self.head = encoder,head\n",
" \n",
" def forward(self, x1, x2):\n",
" ftrs = torch.cat([self.encoder(x1), self.encoder(x2)], dim=1)\n",
" return self.head(ftrs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"encoder = create_body(resnet34, cut=-2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"head = create_head(512*4, 2, ps=0.5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = SiameseModel(encoder, head)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def loss_func(out, targ):\n",
" return nn.CrossEntropyLoss()(out, targ.long())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def siamese_splitter(model):\n",
" return [params(model.encoder), params(model.head)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"learn = Learner(dls, model, loss_func=loss_func, \n",
" splitter=siamese_splitter, metrics=accuracy)\n",
"learn.freeze()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.367015</td>\n",
" <td>0.281242</td>\n",
" <td>0.885656</td>\n",
" <td>00:26</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0.307688</td>\n",
" <td>0.214721</td>\n",
" <td>0.915426</td>\n",
" <td>00:26</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>0.275221</td>\n",
" <td>0.170615</td>\n",
" <td>0.936401</td>\n",
" <td>00:26</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>0.223771</td>\n",
" <td>0.159633</td>\n",
" <td>0.943843</td>\n",
" <td>00:26</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.fit_one_cycle(4, 3e-3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.212744</td>\n",
" <td>0.159033</td>\n",
" <td>0.944520</td>\n",
" <td>00:35</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0.201893</td>\n",
" <td>0.159615</td>\n",
" <td>0.942490</td>\n",
" <td>00:35</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>0.204606</td>\n",
" <td>0.152338</td>\n",
" <td>0.945196</td>\n",
" <td>00:36</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>0.213203</td>\n",
" <td>0.148346</td>\n",
" <td>0.947903</td>\n",
" <td>00:36</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.unfreeze()\n",
"learn.fit_one_cycle(4, slice(1e-6,1e-4))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Natural Language Processing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tabular"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wrapping Up Architectures"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Questionnaire"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. What is the \"head\" of a neural net?\n",
"1. What is the \"body\" of a neural net?\n",
"1. What is \"cutting\" a neural net? Why do we need to do this for transfer learning?\n",
"1. What is `model_meta`? Try printing it to see what's inside.\n",
"1. Read the source code for `create_head` and make sure you understand what each line does.\n",
"1. Look at the output of `create_head` and make sure you understand why each layer is there, and how the `create_head` source created it.\n",
"1. Figure out how to change the dropout, layer size, and number of layers created by `create_cnn`, and see if you can find values that result in better accuracy from the pet recognizer.\n",
"1. What does `AdaptiveConcatPool2d` do?\n",
"1. What is \"nearest neighbor interpolation\"? How can it be used to upsample convolutional activations?\n",
"1. What is a \"transposed convolution\"? What is another name for it?\n",
"1. Create a conv layer with `transpose=True` and apply it to an image. Check the output shape.\n",
"1. Draw the U-Net architecture.\n",
"1. What is \"BPTT for Text Classification\" (BPT3C)?\n",
"1. How do we handle different length sequences in BPT3C?\n",
"1. Try to run each line of `TabularModel.forward` separately, one line per cell, in a notebook, and look at the input and output shapes at each step.\n",
"1. How is `self.layers` defined in `TabularModel`?\n",
"1. What are the five steps for preventing over-fitting?\n",
"1. Why don't we reduce architecture complexity before trying other approaches to preventing overfitting?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further Research"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Write your own custom head and try training the pet recognizer with it. See if you can get a better result than fastai's default.\n",
"1. Try switching between `AdaptiveConcatPool2d` and `AdaptiveAvgPool2d` in a CNN head and see what difference it makes.\n",
"1. Write your own custom splitter to create a separate parameter group for every ResNet block, and a separate group for the stem. Try training with it, and see if it improves the pet recognizer.\n",
"1. Read the online chapter about generative image models, and create your own colorizer, super-resolution model, or style transfer model.\n",
"1. Create a custom head using nearest neighbor interpolation and use it to do segmentation on CamVid."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -13,46 +13,17 @@
]
},
{
"cell_type": "raw",
"cell_type": "markdown",
"metadata": {},
"source": [
"[[chapter_accel_sgd]]"
"# The Training Process"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Variants of SGD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you know all about how the architectures are put together, it's time to start exploring the training process.\n",
"\n",
"We explained earlier the basis of Stochastic Gradient Descent: pass a minibatch in the model, compare it to our target with the loss function then compute the gradients of this loss function with regards to each weight before updating the weights with the formula:\n",
"\n",
"```python\n",
"new_weight = weight - lr * weight.grad\n",
"```\n",
"\n",
"We implemented this from scratch in a training loop, and also saw that Pytorch provides a simple `nn.SGD` class that does this calculation for each parameter for us. Let's now build some faster optimizers, using a flexible foundation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Let's start with SGD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we'll create a baseline, using plain SGD, and compare it to fastai's default optimizer. We'll start by grabbing Imagenette with the same `get_data` we used in <<chapter_resnet>>:"
"## Establishing a Baseline"
]
},
{
@ -61,7 +32,6 @@
"metadata": {},
"outputs": [],
"source": [
"#hide_input\n",
"def get_data(url, presize, resize):\n",
" path = untar_data(url)\n",
" return DataBlock(\n",
@ -82,13 +52,6 @@
"dls = get_data(URLs.IMAGENETTE_160, 160, 128)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll create a ResNet34 without pretraining, and pass along any arguments received:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -100,13 +63,6 @@
" metrics=accuracy, **kwargs).to_fp16()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's the default fastai optimizer, with the usual 3e-3 learning rate:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -163,13 +119,6 @@
"learn.fit_one_cycle(3, 0.003)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's try plain SGD. We can pass `opt_func` (optimization function) to `cnn_learner` to get fastai to use any optimizer:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -179,13 +128,6 @@
"learn = get_learner(opt_func=SGD)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first thing to look at is `lr_find`:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -228,13 +170,6 @@
"learn.lr_find()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like we'll need to use a higher learning rate than we normally use:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -294,47 +229,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"(Because accelerated SGD using momentum with is such a good idea, fastai uses it by default in `fit_one_cycle`, so we turn it off with `moms=(0,0,0)`; we'll be learning about momentum shortly.)\n",
"\n",
"Clearly, plain SGD isn't training as fast as we'd like. So let's learn the tricks to get accelerated training!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A generic optimizer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to build up our accelerated SGD tricks, we'll need to start with a nice flexible optimizer foundation. No library prior to fastai provided such a foundation, but during fastai's development we realized that all optimizer improvements we'd seen in the academic literature could be handled using *optimizer callbacks*. These are small pieces of code that an optimizer can add to the optimizer `step`. They are called by fastai's `Optimizer` class. This is a small class (less than a screen of code); these are the definitions in `Optimizer` of the two key methods that we've been using in this book:\n",
"\n",
"```python\n",
"def zero_grad(self):\n",
" for p,*_ in self.all_params():\n",
" p.grad.detach_()\n",
" p.grad.zero_()\n",
"\n",
"def step(self):\n",
" for p,pg,state,hyper in self.all_params():\n",
" for cb in self.cbs:\n",
" state = _update(state, cb(p, **{**state, **hyper}))\n",
" self.state[p] = state\n",
"```\n",
"\n",
"As we saw when training an MNIST model from scratch, `zero_grad` just loops through the parameters of the model and sets the gradients to zero. It also calls `detach_`, which removes any history of gradient computation, since it won't be needed after `zero_grad`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The more interesting method is `step`, which loops through the callbacks (`cbs`) and calls them to update the parameters (the `_update` function just calls `state.update` if there's anything returned by `cb(...)`). As you can see, `Optimizer` doesn't actually do any SGD steps itself. Let's see how we can add SGD to `Optimizer`.\n",
"\n",
"Here's an optimizer callback that does a single SGD step, by multiplying `-lr` by the gradients, and adding that to the parameter (when `Tensor.add_` in PyTorch is passed two parameters, they are multiplied together before the addition): "
"## A Generic Optimizer"
]
},
{
@ -346,27 +241,13 @@
"def sgd_cb(p, lr, **kwargs): p.data.add_(-lr, p.grad.data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can pass this to `Optimizer` using the `cbs` parameter; we'll need to use `partial` since `Learner` will call this function to create our optimizer later:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"opt_func = partial(Optimizer, cbs=[sgd_step])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if this trains:"
"opt_func = partial(Optimizer, cbs=[sgd_cb])"
]
},
{
@ -425,13 +306,6 @@
"learn.fit(3, 0.03)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's working! So that's how we create SGD from scratch in fastai."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -439,26 +313,6 @@
"## Momentum"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SGD is the idea of taking a step in the direction of the steepest slope at each point of time. But what if we have a ball rolling down the mountain? It won't, at each given point, exactly follow the direction of the gradient, as it will have *momentum*. A ball with more momentum (for instance, a heavier ball) will skip over little bumps and holes, and be more likely to get to the bottom of a bumpy mountain. A ping pong ball, on the other hand, will get stuck in every little crevice.\n",
"\n",
"So how could we bring this idea over to SGD? We can use a moving average, instead of only the current gradient, to make our step:\n",
"\n",
"```python\n",
"weight.avg = beta * weight.avg + (1-beta) * weight.grad\n",
"new_weight = weight - lr * weight.avg\n",
"```\n",
"\n",
"Here `beta` is some number we choose which defines how much momentum to use. If `beta` is zero, then the first equation above becomes `weight.avg = weight.grad`, so we end up with plain SGD. But if it's a number close to one, then the main direction chosen is an average of previous steps. (If you have done a bit of statistics, you may recognize in the first equation an *exponentially weighted moving average*, which is very often used to denoise data and get the underlying tendency.)\n",
"\n",
"Note that we are writing `weight.avg` to highlight the fact we need to store thoe moving averages for each parameter of the model (and they all their own independent moving averages).\n",
"\n",
"Here is an example of noisy data for a single parameter, with the momentum curve plotted in red, and the gradients of the parameter plotted in blue. The gradients increase, and then decrease, and the momentum does a good job of following the general trend, without getting too influenced by noise:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -493,15 +347,6 @@
"plt.plot(x1[idx],np.array(res), color='red');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It works particularly well if the loss function has narrow canyons we need to navigate: vanilla SGD would send us from one side to the other while SGD with momentum will average those to roll down inside. The parameter `beta` determines the strength of that momentum we are using: with a small beta we stay closer to the actual gradient values whereas with a high beta, we will mostly go in the direction of the average of the gradients and it will take a while before any change in the gradients makes that trend move.\n",
"\n",
"With a large beta, we might miss that the gradients have changed directions and roll over a small local minima which is a desired side-effect: intuitively, when we show a new picture/text/data to our model, it will look like something in the training set but won't be exactly like it. That means it will correspond to a point in the loss function that is closest to the minimum we ended up with at the end of training, but not exactly *at* that minimum. We then would rather end up training in a wide minimum, where nearby points have approximately the same loss (or if you prefer, a point where the loss is as flat as possible). Here's how the above chart varies as we change beta:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -540,22 +385,6 @@
" ax.set_title(f'beta={beta}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see in these examples that a beta that's too high results in the overall changes in gradient getting ignored. In SGD with momentum, a value of `beta` that is often used is 0.9.\n",
"\n",
"`fit_one_cycle` by default starts with a beta of 0.95, gradually adjusts it to 0.85, and then gradually moves it back to 0.95 at the end of training. Let's see how our training goes with momentum added to plain SGD:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to add momentum to our optimizer, we'll first need to keep track of the moving average gradient, which we can do with another callback. When an optimizer callback returns a dict, it is used to update the state of the optimizer, and is passed back to the optimizer on the next step. So this callback will keep track of the gradient averages in a parameter called `grad_avg`:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -567,13 +396,6 @@
" return {'grad_avg': grad_avg*mom + p.grad.data}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To use it, we just have to replace `p.grad.data` with `grad_avg` in our step function:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -592,13 +414,6 @@
"opt_func = partial(Optimizer, cbs=[average_grad,momentum_step], mom=0.9)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`Learner` will automatically schedule `mom` and `lr`, so fit_one_cycle will even work with our custom Optimizer:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -677,13 +492,6 @@
"learn.recorder.plot_sched()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're still not getting great results, so let's see what else we can do."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -691,31 +499,6 @@
"## RMSProp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"RMSProp is another variant of SGD introduced by Geoffrey Hinton in [Lecture 6e of his Coursera class](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf). The main difference with SGD is that it uses an adaptive learning rate: instead of using the same learning rate for every parameter, each parameter gets it's own specific learning rate controlled by a global learning rate. That way we can speed up training by giving a high learning rate to the weights that needs to change a lot while the ones that are good enough get a lower learning rate.\n",
"\n",
"How do we decide which parameter should have a high learning rate and which should not? We can look at the gradients to get an idea. Not just the one we computed, but all of them: if they have been close to 0 for a while, it means this parameter will need a higher learning rate because the loss is very flat. On the opposite, if they are all over the place, we should probably be careful and pick a low learning rate to avoid divergence. We can't just average the gradients to see if they're changing a lot, since the average of a large positive and a large negative number is close to zero. So we can use the usual trick of either taking the absolute value, or the squared values (and then taking the square root after the mean).\n",
"\n",
"Once again, to pick the general tendency behind the noise, we will use a moving average, specifically the moving average of the gradients squared. Then, we will update the corresponding weight by using the current gradient (for the direction) divided by the square root of this moving average (that way if it's low, the effective learning rate will be higher, and if it's big, the effective learning rate will be lower).\n",
"\n",
"```python\n",
"w.square_avg = alpha * w.square_avg + (1-alpha) * (w.grad ** 2)\n",
"new_w = w - lr * w.grad / math.sqrt(w.square_avg + eps)\n",
"```\n",
"\n",
"The `eps` (*epsilon*) is added for numerical stability (usually set at 1e-8) and the default value for `alpha` is usually 0.99."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can add this to `Optimizer` by doing much the same thing we did for `avg_grad`, but with an extra `**2`:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -727,13 +510,6 @@
" return {'sqr_avg': sqr_avg*sqr_mom + p.grad.data**2}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we can define our step function and optimizer as before:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -748,13 +524,6 @@
" sqr_mom=0.99, eps=1e-7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try it out:"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -811,13 +580,6 @@
"learn.fit_one_cycle(3, 0.003)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Much better! Now we just have to bring these ideas together, and we have Adam, fastai's default optimizer."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -829,67 +591,83 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Adam mixes the ideas of SGD with momentum and RMSProp together: it uses the moving average of the gradients as a direction and divides by the square root of the moving average of the gradients squared to give an adaptive learning rate to each parameter.\n",
"\n",
"There is one other difference with how Adam calculates moving averages, is that it takes the *unbiased* moving average which is:\n",
"\n",
"``` python\n",
"w.avg = beta * w.avg + (1-beta) * w.grad\n",
"unbias_avg = w.avg / (1 - (beta**(i+1)))\n",
"```\n",
"\n",
"if we are the `i`-th iteration (starting at 0 like python does). This divisor of `1 - (beta**(i+1))` makes sure the unbiased average looks more like the gradients at the beginning (since `beta < 1` the denominator is very quickly very close to 1).\n",
"\n",
"Putting everything together, our update step looks like:\n",
"``` python\n",
"w.avg = beta1 * w.avg + (1-beta1) * w.grad\n",
"unbias_avg = w.avg / (1 - (beta1**(i+1)))\n",
"w.sqr_avg = beta2 * w.sqr_avg + (1-beta2) * (w.grad ** 2)\n",
"new_w = w - lr * unbias_avg / sqrt(w.sqr_avg + eps)\n",
"```\n",
"\n",
"Like for RMSProp, `eps` is usually set to 1e-8, and the default for `(beta1,beta2)` suggested by the literature `(0.9,0.999)`. \n",
"\n",
"In fastai, Adam is the default optimizer we use since it allows faster training, but we found that `beta2=0.99` is better suited for the type of schedule we are using. `beta1` is the momentum parameter, which we specify with the argument `moms` in our call to `fit_one_cycle`. As for `eps`, fastai uses a default of 1e-5. `eps` is not just useful for numerical stability. A higher `eps` limits the maximum value of the adjusted learning rate. To take an extreme example, if `eps` is 1, then the adjusted learning will never be higher than the base learning rate. \n",
"\n",
"Rather than show all the code for this in the book, we'll let you look at the optimizer notebook in fastai's GitHub repository--you'll see all the code we've seen so far, along with Adam and other optimizers, and lots of examples and tests."
"## Decoupled Weight Decay"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Decoupled weight_decay"
"## Callbacks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've discussed weight decay before, which is equivalent to (in the case of vanilla SGD) updating the parameters\n",
"with:\n",
"### Creating a Callback"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class ModelResetter(Callback):\n",
" def begin_train(self): self.model.reset()\n",
" def begin_validate(self): self.model.reset()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class RNNRegularizer(Callback):\n",
" def __init__(self, alpha=0., beta=0.): self.alpha,self.beta = alpha,beta\n",
"\n",
"``` python\n",
"new_weight = weight - lr*weight.grad - lr*wd*weight\n",
"```\n",
" def after_pred(self):\n",
" self.raw_out,self.out = self.pred[1],self.pred[2]\n",
" self.learn.pred = self.pred[0]\n",
"\n",
"This last formula explains why the name of this technique is weight decay, as each weight is decayed by a factor `lr * wd`. \n",
"\n",
"However, this only works correctly for standard SGD, because we have seen that with momentum, RMSProp or in Adam, the update has some additional formulas around the gradient. In those cases, the formula that comes from L2 regularization:\n",
"\n",
"``` python\n",
"weight.grad += wd*weight\n",
"```\n",
"\n",
"is different than weight decay:\n",
"\n",
"``` python\n",
"new_weight = weight - lr*weight.grad - lr*wd*weight\n",
"```\n",
"\n",
"Most libraries use the first formulation, but it was pointed out in [Decoupled Weight Regularization](https://arxiv.org/pdf/1711.05101.pdf) by Ilya Loshchilov and Frank Hutter, second one is the only correct approach with the Adam optimizer or momentum, which is why fastai makes it its default.\n",
"\n",
"Now you know everything that is hidden behind the line `learn.fit_one_cycle`!"
" def after_loss(self):\n",
" if not self.training: return\n",
" if self.alpha != 0.:\n",
" self.learn.loss += self.alpha * self.out[-1].float().pow(2).mean()\n",
" if self.beta != 0.:\n",
" h = self.raw_out[-1]\n",
" if len(h)>1:\n",
" self.learn.loss += self.beta * (h[:,1:] - h[:,:-1]\n",
" ).float().pow(2).mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Callback Ordering and Exceptions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class TerminateOnNaNCallback(Callback):\n",
" run_before=Recorder\n",
" def after_batch(self):\n",
" if torch.isinf(self.loss) or torch.isnan(self.loss):\n",
" raise CancelFitException"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion"
]
},
{
@ -909,7 +687,7 @@
"1. What does `zero_grad` do in an optimizer?\n",
"1. What does `step` do in an optimizer? How is it implemented in the general optimizer?\n",
"1. Rewrite `sgd_cb` to use the `+=` operator, instead of `add_`.\n",
"1. What is momentum? Write out the equation.\n",
"1. What is \"momentum\"? Write out the equation.\n",
"1. What's a physical analogy for momentum? How does it apply in our model training settings?\n",
"1. What does a bigger value for momentum do to the gradients?\n",
"1. What are the default values of momentum for 1cycle training?\n",
@ -917,24 +695,54 @@
"1. What do the squared values of the gradients indicate?\n",
"1. How does Adam differ from momentum and RMSProp?\n",
"1. Write out the equation for Adam.\n",
"1. Calculate the value of `unbias_avg` and `w.avg` for a few batches of dummy values.\n",
"1. What's the impact of having a high eps in Adam?\n",
"1. Calculate the values of `unbias_avg` and `w.avg` for a few batches of dummy values.\n",
"1. What's the impact of having a high `eps` in Adam?\n",
"1. Read through the optimizer notebook in fastai's repo, and execute it.\n",
"1. In what situations do dynamic learning rate methods like Adam change the behaviour of weight decay?"
"1. In what situations do dynamic learning rate methods like Adam change the behavior of weight decay?\n",
"1. What are the four steps of a training loop?\n",
"1. Why is using callbacks better than writing a new training loop for each tweak you want to add?\n",
"1. What aspects of the design of fastai's callback system make it as flexible as copying and pasting bits of code?\n",
"1. How can you get the list of events available to you when writing a callback?\n",
"1. Write the `ModelResetter` callback (without peeking).\n",
"1. How can you access the necessary attributes of the training loop inside a callback? When can you use or not use the shortcuts that go with them?\n",
"1. How can a callback influence the control flow of the training loop.\n",
"1. Write the `TerminateOnNaN` callback (without peeking, if possible).\n",
"1. How do you make sure your callback runs after or before another callback?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further research"
"### Further Research"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Look up the \"rectified Adam\" paper and implement it using the general optimizer framework, and try it out. Search for other recent optimizers that work well in practice, and pick one to implement."
"1. Look up the \"Rectified Adam\" paper, implement it using the general optimizer framework, and try it out. Search for other recent optimizers that work well in practice, and pick one to implement.\n",
"1. Look at the mixed-precision callback with the documentation. Try to understand what each event and line of code does.\n",
"1. Implement your own version of ther learning rate finder from scratch. Compare it with fastai's version.\n",
"1. Look at the source code of the callbacks that ship with fastai. See if you can find one that's similar to what you're looking to do, to get some inspiration."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Foundations of Deep Learning: Wrap up"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations, you have made it to the end of the \"foundations of deep learning\" section of the book! You now understand how all of fastai's applications and most important architectures are built, and the recommended ways to train them—and you have all the information you need to build these from scratch. While you probably won't need to create your own training loop, or batchnorm layer, for instance, knowing what is going on behind the scenes is very helpful for debugging, profiling, and deploying your solutions.\n",
"\n",
"Since you understand the foundations of fastai's applications now, be sure to spend some time digging through the source notebooks and running and experimenting with parts of them. This will give you a better idea of how everything in fastai is developed.\n",
"\n",
"In the next section, we will be looking even further under the covers: we'll explore how the actual forward and backward passes of a neural network are done, and we will see what tools are at our disposal to get better performance. We will then continue with a project that brings together all the material in the book, which we will use to build a tool for interpreting convolutional neural networks. Last but not least, we'll finish by building fastai's `Learner` class from scratch."
]
},
{
@ -953,31 +761,6 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,

1616
clean/17_foundations.ipynb Normal file

File diff suppressed because it is too large Load Diff

484
clean/18_CAM.ipynb Normal file

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

30
clean/20_conclusion.ipynb Normal file
View File

@ -0,0 +1,30 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Concluding Thoughts"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

83
clean/app_blog.ipynb Normal file
View File

@ -0,0 +1,83 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"from utils import *\n",
"from fastai.vision.widgets import *"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Creating a Blog"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Blogging with GitHub Pages"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating the Repository"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setting Up Your Home Page"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating Posts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Synchronizing GitHub and Your Computer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Jupyter for Blogging"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

270
clean/app_jupyter.ipynb Normal file

File diff suppressed because one or more lines are too long

1
clean/images Symbolic link
View File

@ -0,0 +1 @@
../images

64
clean/utils.py Normal file
View File

@ -0,0 +1,64 @@
# Numpy and pandas by default assume a narrow screen - this fixes that
from fastai.vision.all import *
from nbdev.showdoc import *
from ipywidgets import widgets
from pandas.api.types import CategoricalDtype
import matplotlib as mpl
# mpl.rcParams['figure.dpi']= 200
mpl.rcParams['savefig.dpi']= 200
mpl.rcParams['font.size']=12
set_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
pd.set_option('display.max_columns',999)
np.set_printoptions(linewidth=200)
torch.set_printoptions(linewidth=200)
import graphviz
def gv(s): return graphviz.Source('digraph G{ rankdir="LR"' + s + '; }')
def get_image_files_sorted(path, recurse=True, folders=None): return get_image_files(path, recurse, folders).sorted()
# +
# pip install azure-cognitiveservices-search-imagesearch
from azure.cognitiveservices.search.imagesearch import ImageSearchClient as api
from msrest.authentication import CognitiveServicesCredentials as auth
def search_images_bing(key, term, min_sz=128):
client = api('https://api.cognitive.microsoft.com', auth(key))
return L(client.images.search(query=term, count=150, min_height=min_sz, min_width=min_sz).value)
# -
def plot_function(f, tx=None, ty=None, title=None, min=-2, max=2, figsize=(6,4)):
x = torch.linspace(min,max)
fig,ax = plt.subplots(figsize=figsize)
ax.plot(x,f(x))
if tx is not None: ax.set_xlabel(tx)
if ty is not None: ax.set_ylabel(ty)
if title is not None: ax.set_title(title)
# +
from sklearn.tree import export_graphviz
def draw_tree(t, df, size=10, ratio=0.6, precision=0, **kwargs):
s=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True, rounded=True,
special_characters=True, rotate=False, precision=precision, **kwargs)
return graphviz.Source(re.sub('Tree {', f'Tree {{ size={size}; ratio={ratio}', s))
# +
from scipy.cluster import hierarchy as hc
def cluster_columns(df, figsize=(10,6), font_size=12):
corr = np.round(scipy.stats.spearmanr(df).correlation, 4)
corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=figsize)
hc.dendrogram(z, labels=df.columns, orientation='left', leaf_font_size=font_size)
plt.show()

12
environment.yml Normal file
View File

@ -0,0 +1,12 @@
name: fastbook
channels:
- fastai
- pytorch
- defaults
dependencies:
- python>=3.6
- pytorch>=1.6
- torchvision
- pip
- pip:
- -r file:requirements.txt

Binary file not shown.

Before

Width:  |  Height:  |  Size: 945 KiB

BIN
images/decision_tree.PNG Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 64 KiB

BIN
images/doc_ex.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

BIN
images/driver.PNG Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.7 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 260 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 261 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 385 KiB

BIN
images/layer1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

BIN
images/layer2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.2 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 163 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 18 KiB

View File

@ -1,4 +1,4 @@
fastai2>=0.0.11
fastai>=0.0.11
graphviz
ipywidgets
matplotlib
@ -6,3 +6,4 @@ nbdev>=0.2.12
pandas
scikit_learn
azure-cognitiveservices-search-imagesearch
sentencepiece

25
settings.ini Normal file
View File

@ -0,0 +1,25 @@
[DEFAULT]
lib_name = fastbook
user = fastai
description = fastbook
keywords = jupyter notebook asciidoc
author = Jeremy Howard and Sylvain Gugger
author_email = info@fast.ai
copyright = fast.ai
branch = master
version = 0.0.1
min_python = 3.6
audience = Developers
language = English
custom_sidebar = False
license = custom
status = 2
nbs_path = .
doc_path = docs
git_url = https://github.com/fastai/fastbook/tree/master/
lib_path = fastbook
title = fastbook
doc_host = https://fastai.github.io
doc_baseurl = /fastbook/
host = github

View File

@ -1,5 +1,5 @@
# Numpy and pandas by default assume a narrow screen - this fixes that
from fastai2.vision.all import *
from fastai.vision.all import *
from nbdev.showdoc import *
from ipywidgets import widgets
from pandas.api.types import CategoricalDtype