proofread ResNet

This commit is contained in:
Brad S 2020-03-07 22:21:07 +08:00
parent b2f1c12d4c
commit 493f0ed52f

View File

@ -30,9 +30,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this chapter, we will build on top of the CNNs introduced in the previous chapter and explain to you the ResNet (for residual network) architecture. It was introduced in 2015 in [this article](https://arxiv.org/abs/1512.03385) and is by far the most used model architecture nowadays. More recent developments in models almost always use the same trick of residual connections, and most of the time, they are just a tweak of the original ResNet.\n",
"In this chapter, we will build on top of the CNNs (Convolutional Neural Networks) introduced in the previous chapter and explain to you the ResNet (for residual network) architecture. It was introduced in 2015 in [this article](https://arxiv.org/abs/1512.03385) and is by far the most used model architecture nowadays. More recent developments in image models almost always use the same trick of residual connections, and most of the time, they are just a tweak of the original ResNet.\n",
"\n",
"We will frist show you the basic ResNet as it was first designed, then explain to you what modern tweaks to it make it more performamt. But first, we will need a problem a little bit more difficult that the MNIST dataset, since we are already close to 100% accuracy with a regular CNN on it."
"We will first show you the basic ResNet as it was first designed, then explain to you what modern tweaks make it more performamt. But first, we will need a problem a little bit more difficult than the MNIST dataset, since we are already close to 100% accuracy with a regular CNN on it."
]
},
{
@ -103,14 +103,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When we looked at MNIST we were dealing with 28 x 28 pixel images. For Imagenette we are going to be training with 128 x 128 pixel images, and later on would like to be able to use larger images as well — at least as big as 224 x 224 pixels, the ImageNet standard. Do you recall how we managed to get a single vector of activations for each image out of the MNIST convolutional neural network?\n",
"When we looked at MNIST we were dealing with 28 x 28 pixel images. For Imagenette we are going to be training with 128 x 128 pixel images. Later on we would like to be able to use larger images as well — at least as big as 224 x 224 pixels, the ImageNet standard. Do you recall how we managed to get a single vector of activations for each image out of the MNIST convolutional neural network?\n",
"\n",
"The approach we used was to ensure that there was enough stride two convolutions such that the final layer would have a grid size of one. Then we just flattened out the unit axes that we ended up with, to get a vector for each image (so a matrix of activations for a mini batch). We could do the same thing for Imagenette, but that's going to cause two problems:\n",
"The approach we used was to ensure that there were enough stride two convolutions such that the final layer would have a grid size of one. Then we just flattened out the unit axes that we ended up with, to get a vector for each image (so a matrix of activations for a mini batch). We could do the same thing for Imagenette, but that's going to cause two problems:\n",
"\n",
"- We are going to need lots of stride two layers to make our grid one by one at the end — perhaps more than we would otherwise choose\n",
"- The model will not work on images of any size other than the size we originally trained on.\n",
"\n",
"One approach to dealing with the first of these issues would be to flatten the final convolutional layer in a way that handles a grid size other than one by one. That is, we could simply flatten a matrix into a vector as we have done before, by laying out each row after the previous row. In fact, this is the approach that convolutional neural networks up until 2013 nearly always did. The most famous, still sometimes used today, is the 2013 ImageNet winner VGG. But there was another problem with this architecture: not only does it not work with images other than those of the same size as the training set, but it required a lot of memory, because flattening out the convolutional create resulted in many activations being fed into the final layers. Therefore, the weight matrices of the final layers were enormous.\n",
"One approach to dealing with the first of these issues would be to flatten the final convolutional layer in a way that handles a grid size other than one by one. That is, we could simply flatten a matrix into a vector as we have done before, by laying out each row after the previous row. In fact, this is the approach that convolutional neural networks up until 2013 nearly always did. The most famous is the 2013 ImageNet winner VGG, still sometimes used today. But there was another problem with this architecture: not only does it not work with images other than those of the same size as the training set, but it required a lot of memory, because flattening out the convolutional create resulted in many activations being fed into the final layers. Therefore, the weight matrices of the final layers were enormous.\n",
"\n",
"This problem was solved through the creation of *fully convolutional networks*. The trick in fully convolutional networks is to take the average of activations across a convolutional grid. In other words, we can simply use this function:"
]
@ -172,9 +172,9 @@
"source": [
"Once we are done with our convolutional layers, we will get activations of size `bs x ch x h x w` (batch size, a certain number of channels, height and width). We want to convert this to a tensor of size `bs x ch`, so we take the average over the last two dimensions and flatten the trailing `1 x 1` dimension like we did in our previous model. \n",
"\n",
"This is different from regular pooling in the sense that those layers will generally take the average (for average pooling) or the maximum (for max pooling) of a window of a given size: for instance max pooling layers of size 2 that were very popular in older CNNs reduce the size of our image by on each dimension by taking the maximum of each 2 by 2 window (with a stride of 2).\n",
"This is different from regular pooling in the sense that those layers will generally take the average (for average pooling) or the maximum (for max pooling) of a window of a given size: for instance max pooling layers of size 2 that were very popular in older CNNs reduce the size of our image by half on each dimension by taking the maximum of each 2 by 2 window (with a stride of 2).\n",
"\n",
"As before, we can define a `Learner` with our custom model and the data we grabbed before then train it:"
"As before, we can define a `Learner` with our custom model and then train it on the data we grabbed before:"
]
},
{
@ -326,7 +326,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have all the pieces needed to build the models we have been using in each computer vision task since the beginning of this book: ResNets. We'll introduce the main idea behind them and show how it improves accuracy Imagenette compared to our previous model, before building a version with all the recent tweaks."
"We now have all the pieces needed to build the models we have been using in each computer vision task since the beginning of this book: ResNets. We'll introduce the main idea behind them and show how it improves accuracy on Imagenette compared to our previous model, before building a version with all the recent tweaks."
]
},
{
@ -340,7 +340,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In 2015 the authors of ResNet paper noticed something that they found curious. Even after using batchnorm, they saw that a network using more layers was doing less well than a network using less layers — and there were no other differences between the models. Most interestingly, the difference was observed not only in the validation set, but also in the training set; so, it wasn't just a generalisation issue, but a training issue. As the paper explains:\n",
"In 2015 the authors of the ResNet paper noticed something that they found curious. Even after using batchnorm, they saw that a network using more layers was doing less well than a network using less layers — and there were no other differences between the models. Most interestingly, the difference was observed not only in the validation set, but also in the training set; so, it wasn't just a generalisation issue, but a training issue. As the paper explains:\n",
"\n",
"> : Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as [previously reported] and thoroughly verified by our experiments.\n",
"\n",
@ -666,9 +666,9 @@
"source": [
"Now we're making good progress!\n",
"\n",
"The authors of the ResNet paper went on to win the 2015 ImageNet challenge. At the time, this was by far the most important annual event in computer vision. We have already seen another ImageNet winner: the 2013 winners, Zeiler and Fergus. It is interesting to note that in both cases the starting point for the breakthroughs were experimental observations. Observations about what layers actually learn, in the case of Zeiler and Fergus, and observations about which kind of networks can be trained, in the case of the ResNet authors. This ability to design and analyse thoughtful experiments, or even just to see an unexpected result say \"hmmm, that's interesting\" — and then, most importantly, to figure out what on earth is going on, with great tenacity, is at the heart of many scientific discoveries. Deep learning is not like pure mathematics. It is a heavily experimental field, so it's important to be strong practitioner, not just a theoretician.\n",
"The authors of the ResNet paper went on to win the 2015 ImageNet challenge. At the time, this was by far the most important annual event in computer vision. We have already seen another ImageNet winner: the 2013 winners, Zeiler and Fergus. It is interesting to note that in both cases the starting point for the breakthroughs were experimental observations. Observations about what layers actually learn, in the case of Zeiler and Fergus, and observations about which kind of networks can be trained, in the case of the ResNet authors. This ability to design and analyse thoughtful experiments, or even just to see an unexpected result say \"hmmm, that's interesting\" — and then, most importantly, to figure out what on earth is going on, with great tenacity, is at the heart of many scientific discoveries. Deep learning is not like pure mathematics. It is a heavily experimental field, so it's important to be a strong practitioner, not just a theoretician.\n",
"\n",
"Since the ResNet was introduced, there's been many papers studying it and applying it to many domains. One of the most interesting, published in 2018, is [Visualizing the Loss Landscape of Neural Nets](https://arxiv.org/abs/1712.09913). It shows that using skip connections help smoothen the loss function, which makes training easier as it avoids us falling into a very sharp area. <<resnet_surface>> shows a stunning picture from the paper, showing the bumpy terrain that SGD has to navigate to optimize a regular CNN (left) versus the smooth surface of a ResNet (right)."
"Since the ResNet was introduced, there's been many papers studying it and applying it to many domains. One of the most interesting, published in 2018, is [Visualizing the Loss Landscape of Neural Nets](https://arxiv.org/abs/1712.09913). It shows that using skip connections help smoothen the loss function, which makes training easier as it avoids falling into a very sharp area. <<resnet_surface>> shows a stunning picture from the paper, showing the bumpy terrain that SGD has to navigate to optimize a regular CNN (left) versus the smooth surface of a ResNet (right)."
]
},
{
@ -710,7 +710,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"So, as we scale up to the full ResNet, we won't show the original one, but the tweaked one, since it's substantially better. It differs a little bit from our the implementation we had before in that it begins with a few convolutional layers followed by a max pooling layer, instead of just starting with ResNet blocks. This is what the first layers look like:"
"So, as we scale up to the full ResNet, we won't show the original one, but the tweaked one, since it's substantially better. It differs a little bit from our previous implementation, in that instead of just starting with ResNet blocks, it begins with a few convolutional layers followed by a max pooling layer. This is what the first layers look like:"
]
},
{
@ -831,7 +831,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The `_make_layer` function is just there to create a series of `nl` blocks. The first one is is going from `ch_in` to `ch_out` with the indicated `stride` and all the others are blocks of stride 1 with `ch_out` to `ch_out` tensors. Once the blocks are defined, our model is purely sequential, which is why we define it as a subclass of `nn.Sequential`. (Ignore the `expansion` parameter for now--we'll discuss it in the next section' for now, it'll be `1`, so it doesn't do anything.)\n",
"The `_make_layer` function is just there to create a series of `n_layers` blocks. The first one is is going from `ch_in` to `ch_out` with the indicated `stride` and all the others are blocks of stride 1 with `ch_out` to `ch_out` tensors. Once the blocks are defined, our model is purely sequential, which is why we define it as a subclass of `nn.Sequential`. (Ignore the `expansion` parameter for now--we'll discuss it in the next section. For now, it'll be `1`, so it doesn't do anything.)\n",
"\n",
"The various versions of the models (ResNet 18, 34, 50, etc) just change the number of blocks in each of those groups. This is the definition of a ResNet18:"
]
@ -928,7 +928,7 @@
"source": [
"Even although we have more channels (and our model is therefore even more accurate), our training is just as fast as before, thanks to our optimized stem.\n",
"\n",
"To make our model deeper without taking too much compute or memory, the ResNet paper introduced anotehr kind of blocks for ResNets with a depth of 50 or more, using something called a bottleneck. "
"To make our model deeper without taking too much compute or memory, the ResNet paper introduced another kind of block for ResNets with a depth of 50 or more, using something called a bottleneck. "
]
},
{
@ -956,7 +956,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Why is that useful? 1x1 convolutions are much faster, so even if this seems to be a more complex design, this block executes faster than the first resnet block we saw. This then lets us use more filters: as we see on the illustration, the number of filters in and out is 4 times higher (256) and the 1 by 1 convs are here to diminish then restore the number of channels (hence the name bottleneck). The overall impact is that we can use more filters in the same amount of time.\n",
"Why is that useful? 1x1 convolutions are much faster, so even if this seems to be a more complex design, this block executes faster than the first ResNet block we saw. This then lets us use more filters: as we see on the illustration, the number of filters in and out is 4 times higher (256) and the 1 by 1 convs are here to diminish then restore the number of channels (hence the name bottleneck). The overall impact is that we can use more filters in the same amount of time.\n",
"\n",
"Let's try replacing our ResBlock with this bottleneck design:"
]
@ -1219,7 +1219,7 @@
"1. Why do skip connections allow us to train deeper models?\n",
"1. What does <<resnet_depth>> show? How did that lead to the idea of skip connections?\n",
"1. What is an identity mapping?\n",
"1. What is the basic equation for a resnet block (ignoring batchnorm and relu layers)?\n",
"1. What is the basic equation for a ResNet block (ignoring batchnorm and relu layers)?\n",
"1. What do ResNets have to do with \"residuals\"?\n",
"1. How do we deal with the skip connection when there is a stride 2 convolution? How about when the number of filters changes?\n",
"1. How can we express a 1x1 convolution in terms of a vector dot product?\n",
@ -1227,8 +1227,8 @@
"1. Explain what is shown in <<resnet_surface>>.\n",
"1. When is top-5 accuracy a better metric than top-1 accuracy?\n",
"1. What is the stem of a CNN?\n",
"1. Why use plain convs in the CNN stem, instead of resnet blocks?\n",
"1. How does a bottleneck block differ from a plain resnet block?\n",
"1. Why use plain convs in the CNN stem, instead of ResNet blocks?\n",
"1. How does a bottleneck block differ from a plain ResNet block?\n",
"1. Why is a bottleneck block faster?\n",
"1. How do fully convolution nets (and nets with adaptive pooling in general) allow for progressive resizing?"
]
@ -1255,9 +1255,7 @@
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s"
]
"source": []
}
],
"metadata": {
@ -1268,20 +1266,8 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}