mirror of
https://github.com/fastai/fastbook.git
synced 2025-04-04 18:00:48 +00:00
1276 lines
271 KiB
Plaintext
1276 lines
271 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {
|
|||
|
"hide_input": false
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"#hide\n",
|
|||
|
"from utils import *"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "raw",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"[[chapter_resnet]]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Resnets"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Going back to Imagenette"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"It's going to be tough to judge any improvement we do to our models when we are already at an accuracy that is as high as we saw on MNIST in the previous chapter, so we will tackle a tougher problem by going back to Imagenette. We'll stick with small images to keep things reasonably fast.\n",
|
|||
|
"\n",
|
|||
|
"Let's grab the data--we'll use the already-resized 160px version to make things faster still, and will random crop to 128px:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def get_data(url, presize, resize):\n",
|
|||
|
" path = untar_data(url)\n",
|
|||
|
" return DataBlock(\n",
|
|||
|
" blocks=(ImageBlock, CategoryBlock), get_items=get_image_files, \n",
|
|||
|
" splitter=GrandparentSplitter(valid_name='val'),\n",
|
|||
|
" get_y=parent_label, item_tfms=Resize(presize),\n",
|
|||
|
" batch_tfms=[*aug_transforms(min_scale=0.5, size=resize),\n",
|
|||
|
" Normalize.from_stats(*imagenet_stats)],\n",
|
|||
|
" ).dataloaders(path, bs=128)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"dls = get_data(URLs.IMAGENETTE_160, 160, 128)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAVkAAAFkCAYAAACKFkioAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nOy9eZBtx33f9+nus919m33evP09PKzEQpAgSEqidtlW4khx4rIqkpJyllLJVuJYiuWkYmVxnDjlVGK7HKdSSWTFUSxbsWItJYmhKFEkAZA0Nj4AfHj7MjNv1jv3zt3O1t35o888DB8BiBDxBBB1v1Wn5s7pe/r08utf//q3XWGtZYoppphiinsD+V43YIopppjig4wpk51iiimmuIeYMtkppphiinuIKZOdYooppriHmDLZKaaYYop7iCmTnWKKKaa4h5gy2SmmmGKKe4gPNJMVQnyPEOKCEGIshPh9IcSxQ2V/WwhxSwixL4S4IYT4T9+ijp8QQlghxF88dO9nhRCvCCEGQohrQoifveuZ60KIiRBiWFyffou6P1vU7R26918JIc4LIXIhxC/c9f1PFWU9IcSuEOLXhBDLf8zhmeLbGN8KbQshHhVCPF88+7wQ4tFDZX8UbT8qhPi8EKIvhFgVQvzn76Bdrx5aE8OCxn/jUPkPF+8eCiGeEUI88G6O2XsGa+0H8gJmgD7w54AI+O+B5w6V3wdUis/LwKvAj9xVRwu4ALwC/MVD938OeBzwinpuAH/+UPl14Hv/iPb9GPCHgAW8Q/d/Avgh4F8Av3DXM/PAUvE5BP428Ovv9VhPrz/Z61uhbSAo6PU/KmjoLxf/B0X5H0XbrwF/E1DAKeA28K98M+26qw8CuAr8ePH/GWAf+ETx7p8HLh9eG9+u13vegHeB4K4DfxX4ajHBv1JM8L8HPHPoexVgApx7kzqWgfPAz911/x8CPwX8wWEm+ybP/13g793VprdkskADuAg8dTeTPfSdf3w3k72rPAT+FvDaez0H0+veXPeCtoHvB9YAceg7N4EffIs23E3bY+CBQ///M+Dni8/vpF3fCQx5YzP4aeC3DpXL4tnvea/n4Vu9Pijqgn8D+EHgBPAI8JPAg8DLB1+w1o6AK8V9AIQQf00IMQRWcQTxy4fKPgJ8GMdo3xJCCAF8EictHMb/JYTYFkJ8WgjxobvK/hvgfwY2vvku3nnfUSFED0eAfxUnzU7xwcW7TdsPAl+1BScr8NXDzx6q481o+38EflwI4Qsh7gM+BnzmUN1v265D+AngV4vvgJNsxeHXF9dDb/LstxU+KEz271pr1621XeA3gEeBKm73P4w+UDv4x1r73xb/Pw78nwffF0Io4B8Af8laa/6Id/8Cbhz/j0P3fgw4DhwDfh/4XSFEs6j7w8DHgb/3TjtZtPmmtbaJO5r9Zzh1xhQfXLyrtP3NPHsIv8A30vZvAv86bpO/APxv1tqvvJO6hRDloo5fPHT7/wO+UwjxXUKIAPjrONVG+U3a9W2FDwqTPSwRjnGTPQTqd32vDgwO37AOL+KI5r8obv8Ubrd/9u1eKoT4aeDHgT9trU0O1flFa+3EWju21v4toAd8Ugghccz7Z6y1+Tvt5F3t7gL/CPgXhw1nU3zg8G7T9jf17JvRthCiDfwO8F/i1BYrwA8IIX7qndQN/AjQBT53qK0XcNLt38fpeWdw+t9Vvs3xQWGyb4ZXgTvHdCFEBaeov/tYfwCvKAf4HuBfE0JsCCE2gKeBvyOE+PuH6vt3gL+G0xn9UYRgcUefOk4F8StFvQcSwKoQ4pPvpHOH2jzHNxL2FB9sfCu0/SrwSKEKOMAjh599G9o+CWhr7S9Za/Oi7J8Af+odtusngF+6S2WBtfZXrbUPWWs7wN/AnQS/wrc73mul8Ld6cZeRCXfE+cfALO6o8qO4Xfe/o7B04jaXfx/nPSCAj+B2z79clDeBhUPXM8BfARpF+Y/hJIz736Q9R3HqgKB4788C20CneNfhep/EMeBl3rDu+sVzvwz818VnVZT9CM7iK4v+/VPghfd6DqbXtxVtH3gX/AzOePrTfL13wdvRdh13KvsLxXsWgGeBv1mUv2W7DtVxBMiBU29S/xM4r4VZnJHvl9/rOXhX5vG9bsC9IsTi8/fi9EYTnIfA8UOE+Du4I8sQZ+n/6xyyuN71jj/g6124rgFZ8ezB9Q+LsgdxhoQRsAv8HvDht6j3ON/owvWLxb3D108WZX+pePeoWAj/BDj2Xs/B9Pr2om3gMeD54tkXgMcOlb0lbRfl342TLvsFDf6vQPlQ+Zu261D5zwOff4v+fgGnWugC/wuF58G3+yWKzk0xxRRTTHEP8EHWyU4xxRRTvOeYMtkppphiinuIKZOdYooppriHmDLZKaaYYop7iCmTnWKKKaa4h3jbSKEHHv+oBYMwGaIIUFJCIKUELHmWMR6PGQxcQMdoNCZNUrS23O2zIITA2gP/5wPPpHcfotg3Dv4aYUFZwCCLANnQwGNNxb+9fIQfrrSYkTDOxgCM0zHVWolypYQVAmMl2gau714ZIX2GNmesoDpTp7wyg5x3sQCT3g5rN2+S9Ac0LJQtRLLYx0ROLiZkQYKuWGzVw2v6AESdgKBjsPUBojZElDKk58bbWIswEoFEG0uWZxwEi/m+wPcUWI8s9en1FGlSK8a7zVcvjPj8l9f5yvkh125buiM35pNckBnlIiSERgiLLKbGWjAGrJXgVfGEhXzoygDPC0GEpKnGYBDkxftypBUoAkAhRIwvD2hGYYxE+AG9eHjYCf49w0o1shmQaYPJHWEoJMoKBBapoNmq8+gjLnT+I088yrmzZ5ibbTM/N4OPQacTAPo72/S7O2TxmDyNCXwPody8JyZnNJkw7o+oVjvkssSF1dsAbOSaxpGj/PI//X/wEsOJ1iwbly4D8OTjj9Kea2GUxyQWaO2ztLQIgO8phqMxWnikWrJ2fZX+2hoAdTNBTfbo9bscObJA4oWoShUAqwJSEzGzdAoV1tFGkOeuD+mkiy+HZEmX4bBHWK5za22Xestl0hxsrrNQExw91mLp+CLDScz1G13Xj9spt7f3iXVOfabG4soSO7uOZnLqbNze5dSxEvFgg9s3V+k0m8V4W9A5GxtrZFmCEYa84AtaWJACqSxRYDjSrnJu9igAVRFy8bWv0dcpmVKEYYmKVwJgYf4IslLmM19+DhtJzh6dZzLUrp09w9qgjyoH3Hf/OXbXRszVFgA4ujzL0pEZqrUWzeYx9vYG+KHjCf29Dfp7Y/a6I3KZMcy2ePVrLr5iMkjolKucPVnit1+4/Ka0/bZM1vMkWJDKw5rieaOxgBQSP4ioSIVSjlmEYcR4PGY0mpBlGcbczUgP/r+7Le8ew7V3/UXcfcPdapV9FoOAyGqENpA4YisLSwBoq8mFxQqDwUXMZnmCECEGH2E9SATECkwFgMjPmKk0iXNBKU0JkxwvcYxGCoWRZbJMYbIMkVpE7hplE00y9PDaNWTLh+YIXXUTbIIUKzQCiVIR0npo7dqjjUZag1IJQSlhrizBFNG9+T6Liw2e/vAKr70+4bkX93nhVZeL4/Jazq0tzd7IklqFFZBbXYyTyxokMdh8gCFACrfJGJsS5zGIBKQPJsCNFlgERubkpDiy8ohtseFpg8BCMv5WpvZdRZ5pMgG5tVAEP1njNhlPOEbWbpR5/PFHAPjI008xPztHrVqlWq2wuXaLSxccQ/zcp3+beqXE/GyHvc0NyqWAzuyMe5FnWN+8ya0bNzhy7AG+44d+lB0yAF744jPsJ2NOLbSJuwMmuxssdhoApPt9+ianMb9Eqqr0RERvt8ilkg+pVSOiMGJrfZM83uP+420ATjZD2mHOZNxFhYLS3CyDYpnvJB5X1nps7a9Ra1lG45Sg4ADlqmTY6yEYMrNQZXt7G8sQP5gDoFKSrHRqHGk36d2+xcvnz7M3cOMm/SXmmrM0OnMM0yG9nX3i1NHT8XMrzK2cQGZ98smYs/c/QrPhBIEXvvwcWTrBeh5SGHSeODoBPGFdhpjc4kmLjTPG+06YO35
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 432x432 with 4 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {
|
|||
|
"needs_background": "light"
|
|||
|
},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"dls.show_batch(max_n=4)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"When we looked at MNIST we were dealing with 28 x 28 pixel images. For Imagenette we are going to be training with 128 x 128 pixel images, and later on would like to be able to use larger images as well — at least as big as 224 x 224 pixels, the ImageNet standard. Do you recall how we managed to get a single vector of activations for each image out of the MNIST convolutional neural network?\n",
|
|||
|
"\n",
|
|||
|
"The approach we used was to ensure that there was enough stride two convolutions such that the final layer would have a grid size of one. Then we just flattened out the unit axes that we ended up with, to get a vector for each image (so a matrix of activations for a mini batch). We could do the same thing for Imagenette, but that's going to cause two problems:\n",
|
|||
|
"\n",
|
|||
|
"- We are going to need lots of stride two layers to make our grid one by one at the end — perhaps more than we would otherwise choose\n",
|
|||
|
"- The model will not work on images of any size other than the size we originally trained on.\n",
|
|||
|
"\n",
|
|||
|
"One approach to dealing with the first of these issues would be to flatten the final convolutional layer in a way that handles a grid size other than one by one. That is, we could simply flatten a matrix into a vector as we have done before, by laying out each row after the previous row. In fact, this is the approach that convolutional neural networks up until 2013 nearly always did. The most famous, still sometimes used today, is the 2013 ImageNet winner VGG. But there was another problem with this architecture: not only does it not work with images other than those of the same size as the training set, but it required a lot of memory, because flattening out the convolutional create resulted in many activations being fed into the final layers. Therefore, the weight matrices of the final layers were enormous.\n",
|
|||
|
"\n",
|
|||
|
"This problem was solved through the creation of *fully convolutional networks*. The trick in fully convolutional networks is to take the average of activations across a convolutional grid. In other words, we can simply use this function:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def avg_pool(x): return x.mean((2,3))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"As you see, it is taking the mean over the X and Y axes. This function will always convert a grid of activations into a single activation per image. PyTorch provides a slightly more versatile module called `nn.AdaptiveAvgPool2d`, which averages a grid of activations into whatever sized destination you require (although we nearly always use the size of one).\n",
|
|||
|
"\n",
|
|||
|
"A fully convolutional network, therefore, has a number of convolutional layers, some of which will be stride two, at the end of which is an adaptive average pooling layer, a flatten layer to remove the unit axes, and finally a linear layer. Here is our first fully convolutional network:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def block(ni, nf): return ConvLayer(ni, nf, stride=2)\n",
|
|||
|
"def get_model():\n",
|
|||
|
" return nn.Sequential(\n",
|
|||
|
" block(3, 16),\n",
|
|||
|
" block(16, 32),\n",
|
|||
|
" block(32, 64),\n",
|
|||
|
" block(64, 128),\n",
|
|||
|
" block(128, 256),\n",
|
|||
|
" nn.AdaptiveAvgPool2d(1),\n",
|
|||
|
" Flatten(),\n",
|
|||
|
" nn.Linear(256, dls.c))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"We're going to be replacing the implementation of `block` in the network with other variants in a moment, which is why we're not calling it `conv` any more. We're saving some time by taking advantage of fastai's `ConvLayer` that already provides the functionality of `conv` from the last chapter (plus a lot more!)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"> stop: Consider this question: Would this approach makes sense for an optical character recognition (OCR) problem such as MNIST? We see the vast majority of practitioners tackling OCR and similar problems tend to use fully convolutional networks, because that's what nearly everybody learns nowadays. But it really doesn't make any sense! You can't decide whether, for instance, whether a number is a \"3\" or an \"8\" by slicing it into small pieces, jumbling them up, and deciding whether on average each piece looks like a \"3\" or an \"8\". But that's what adaptive average pooling effectively does! Fully convolutional networks are only really a good choice for objects that don't have a single correct orientation or size (i.e. like most natural photos)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Once we are done with our convolutional layers, we will get activations of size `bs x ch x h x w` (batch size, a certain number of channels, height and width). We want to convert this to a tensor of size `bs x ch`, so we take the average over the last two dimensions and flatten the trailing `1 x 1` dimension like we did in our previous model. \n",
|
|||
|
"\n",
|
|||
|
"This is different from regular pooling in the sense that those layers will generally take the average (for average pooling) or the maximum (for max pooling) of a window of a given size: for instance max pooling layers of size 2 that were very popular in older CNNs reduce the size of our image by on each dimension by taking the maximum of each 2 by 2 window (with a stride of 2).\n",
|
|||
|
"\n",
|
|||
|
"As before, we can define a `Learner` with our custom model and the data we grabbed before then train it:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def get_learner(m):\n",
|
|||
|
" return Learner(dls, m, loss_func=nn.CrossEntropyLoss(), metrics=accuracy\n",
|
|||
|
" ).to_fp16()\n",
|
|||
|
"\n",
|
|||
|
"learn = get_learner(get_model())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [],
|
|||
|
"text/plain": [
|
|||
|
"<IPython.core.display.HTML object>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(0.47863011360168456, 3.981071710586548)"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAEKCAYAAAAIO8L1AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3deXyU5bn/8c+VnSyEJUEwEMImiAgCEUXUYutWa4u22qrVaquH1rZWWtvT1p7TnqPn2Hq6t9YqVWtrtYt1+VnrRq07ggYEWaLIKmFNwpKErJNcvz9mgBiHACVP5pnk+3695pWZZ5tvhpAr930/z3ObuyMiItJRSqIDiIhIOKlAiIhIXCoQIiISlwqEiIjEpQIhIiJxqUCIiEhcaYkO0JUKCgq8pKQk0TFERJLGokWLqty9MN66HlUgSkpKKCsrS3QMEZGkYWYbDrROXUwiIhKXCoSIiMSlAiEiInGpQIiISFwqECIiEpcKhIiIxKUCISKSxJZv2s38NVUEMXWDCoSISBK7d/56vvLHNwI5tgqEiEgSW75pNxOK8jGzLj+2CoSISJJqaG7lne11TCzKD+T4KhAiIklq5ZYaWtucCSoQIiLS3vJNuwE4fmiSFQgzG2Zmz5lZuZmtMLPr42wz08x2m9mS2OO77datN7NlseW6A5+ISAfLNu2mIDeDwX2zAjl+kHdzjQA3uPtiM8sDFpnZPHdf2WG7l9z9/AMc4wx3rwowo4hI0lpWEdwANQTYgnD3Le6+OPa8FigHioJ6PxGR3iQ6QF0b2AA1dNMYhJmVAJOBhXFWTzezpWb2pJkd1265A8+Y2SIzm93JsWebWZmZlVVWVnZpbhGRsFq5pYY2J7ABauiGCYPMLBd4CJjj7jUdVi8Ghrt7nZmdBzwKjImtm+Hum81sEDDPzN5y9xc7Ht/d5wJzAUpLS7v+UkIRkRAKeoAaAm5BmFk60eJwv7s/3HG9u9e4e13s+RNAupkVxF5vjn3dDjwCTAsyq4hIMnmzItgBagj2LCYD7gbK3f0nB9hmcGw7zGxaLE+1meXEBrYxsxzgbGB5UFlFRJLN8k27OT7AAWoItotpBnAFsMzMlsSW3QgUA7j7HcBFwLVmFgEagEvc3c3sKOCR2DeeBjzg7k8FmFVEJGnsHaA+57ijAn2fwAqEu78MdFra3P024LY4y9cCkwKKJiKS1LpjgBp0JbWISNJZVrELCHaAGlQgRESSzrJNNYEPUIMKhIhI0lmycWfgA9SgAiEiklQ2VO9hTeUeThtTGPh7qUCIiCSRf5RvB+DMY4M9gwlUIEREksqz5dsYMyiX4oHZgb+XCoSISJKoaWzhtXU7+FA3tB5ABUJEJGm88HYlkTbnrPGDuuX9VCBERJLEP8q3MSAngxOG9e+W91OBEBFJApHWNp5/u5Izxg4iNSXY01v3UoEQEUkCZRt2sruhhTOP7Z7uJVCBEBFJCs+WbyMjNYXTjgn++oe9VCBERELO3flH+XZOHjWQ3MzA53nbRwVCRCTEGltamfPnJayr2sP5xw/p1vfuvlIkIiKHZXttI7N/v4glG3fxjXPGcnHp0G59fxUIEZEQqqxt4oLbXmFnfQt3XD6Fcyd0b+sBgp1ydJiZPWdm5Wa2wsyuj7PNTDPbbWZLYo/vtlt3rpm9bWarzexbQeUUEQmjRRt2sHl3I79OUHGAYFsQEeAGd18cm196kZnNc/eVHbZ7yd3Pb7/AzFKBXwFnARXA62b2WJx9RUR6pMaWNgCKBwR/z6UDCawF4e5b3H1x7HktUA4UHeLu04DV7r7W3ZuBPwGzgkkqIhI+TZFWADLTUxOWoVvOYjKzEmAysDDO6ulmttTMnjSz42LLioCN7bap4NCLi4hI0muKRFsQWWmJO9k08EFqM8sFHgLmuHtNh9WLgeHuXmdm5wGPAmOAeNeR+wGOPxuYDVBcXNxluUVEEqkp1sXUY1sQZpZOtDjc7+4Pd1zv7jXuXhd7/gSQbmYFRFsMw9ptOhTYHO893H2uu5e6e2lhYfddYSgiEqR9XUwJbEEEeRaTAXcD5e7+kwNsMzi2HWY2LZanGngdGGNmI8wsA7gEeCyorCIiYdMUaSPFIK2bbswXT5BdTDOAK4BlZrYktuxGoBjA3e8ALgKuNbMI0ABc4u4ORMzsy8DTQCpwj7uvCDCriEioNLa0kpmWSuxv6IQIrEC4+8vEH0tov81twG0HWPcE8EQA0UREQq8p0kZmemLvhqR7MYmIhFBTSxtZaYkboAYVCBGRUGqKtKoFISIi79cUaUvoGUygAiEiEkrRAqEuJhER6aAp0qoWhIiIvF9ji85iEhGROKItCHUxiYhIB00tbWSpBSEiIh1pkFpEROLSILWIiMSl6yBERCSuxpbWhM4FASoQIiKh4+5qQYiIyPu1tDrukKUWhIiItBeG2eRABUJEJHSaIrH5qHtqgTCzYWb2nJmVm9kKM7u+k21PNLNWM7uo3bJWM1sSe2i6URHpNfYXiMR2MQU55WgEuMHdF5tZHrDIzOa5+8r2G5lZKnAr0elF22tw9xMCzCciEkpNLbEupp56JbW7b3H3xbHntUA5UBRn0+uAh4DtQWUREUkmjS09vIupPTMrASYDCzssLwIuBO6Is1uWmZWZ2QIzuyDwkCIiIbF/kLrndjEBYGa5RFsIc9y9psPqnwHfdPdWM+u4a7G7bzazkcA/zWyZu6+Jc/zZwGyA4uLirv8GRES62b4xiJ7axQRgZulEi8P97v5wnE1KgT+Z2XrgIuD2va0Fd98c+7oWeJ5oC+R93H2uu5e6e2lhYWHXfxMiIt0sLIPUQZ7FZMDdQLm7/yTeNu4+wt1L3L0E+CvwRXd/1Mz6m1lm7DgFwAxgZbxjiIj0NPsGqRM8BhFkF9MM4ApgmZktiS27ESgGcPd44w57HQvcaWZtRIvYDzqe/SQi0lPtbUEkej6IwAqEu78MvG9goZPtr2r3fD5wfACxRERCr7ElHIPUupJaRCRkevyV1CIi8q/ZfxaTWhAiItKObtYnIiJxNfWmK6lFROTQNUXayEhLIc4FxN1KBUJEJGQaW1oT3noAFQgRkdCJTjea2AFqUIEQEQmdpkhrwi+SAxUIEZHQibYgEv/rOfEJRETkPZpa1MUkIiJxNEVaE36rb1CBEBEJHXUxiYhIXE0trepiEhGR91MLQkRE4mqKtJGV4Bv1gQqEiEjoNOlKahERiacp0tazz2Iys2Fm9pyZlZvZCjO7vpNtTzSzVjO7qN2yK83sndjjyqByioiETVhutRHknNQR4AZ3X2xmecAiM5vXcW5pM0sFbgWebrdsAPA9oBTw2L6PufvOAPOKiIRCj79Zn7tvcffFsee1QDlQFGfT64CHgO3tlp0DzHP3HbGiMA84N6isIiJhEWltI9LmoWhBdEuJMrMSYDKwsMPyIuBC4I4OuxQBG9u9riB+ccHMZptZmZmVVVZWdlVkEZGEaG6NThbUK27WZ2a5RFsIc9y9psPqnwHfdPfWjrvFOZTHO767z3X3UncvLSwsPPLAIiIJFJbZ5CDYMQjMLJ1ocbjf3R+Os0kp8KfYrEkFwHlmFiHaYpjZbruhwPNBZhURCYOmSKxAhOA6iMAKhEV/698NlLv7T+Jt4+4j2m1/L/C4uz8aG6S+xcz6x1afDXw7qKwiImHRFIl2qPT0FsQM4ApgmZktiS27ESgGcPeO4w77uPsOM7sZeD226CZ33xFgVhGRUGjc18XUg1sQ7v4y8ccSDrT9VR1e3wPc08WxRERCLUwtiMQnEBGRffaPQST+1/MhJTCzUWaWGXs+08y+Ymb9go0mItL77D2LKZlu1vcQ0Gpmo4kOPI8AHggslYhIL5WMXUxt7h4helHbz9z9q8CQ4GKJiPRO+7qYQjBIfagFosXMLgWuBB6PLUsPJpKISO+VjC2IzwLTgf9193VmNgL4Q3CxRER6p32nuYZgkPqQTnON3YH1KwCxi9fy3P0HQQYTEemNmlr2tiCSpIvJzJ43s76xK5yXAr81s7hXR4uIyL9u7xhEMt2sLz92o72PA79196n
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 432x288 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {
|
|||
|
"needs_background": "light"
|
|||
|
},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"learn.lr_find()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"`3e-3` is very often a good learning rate for CNNs, and that appears to be the case here too, so let's try that:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: left;\">\n",
|
|||
|
" <th>epoch</th>\n",
|
|||
|
" <th>train_loss</th>\n",
|
|||
|
" <th>valid_loss</th>\n",
|
|||
|
" <th>accuracy</th>\n",
|
|||
|
" <th>time</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1.901582</td>\n",
|
|||
|
" <td>2.155090</td>\n",
|
|||
|
" <td>0.325350</td>\n",
|
|||
|
" <td>00:07</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1.559855</td>\n",
|
|||
|
" <td>1.586795</td>\n",
|
|||
|
" <td>0.507771</td>\n",
|
|||
|
" <td>00:07</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1.296350</td>\n",
|
|||
|
" <td>1.295499</td>\n",
|
|||
|
" <td>0.571720</td>\n",
|
|||
|
" <td>00:07</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.144139</td>\n",
|
|||
|
" <td>1.139257</td>\n",
|
|||
|
" <td>0.639236</td>\n",
|
|||
|
" <td>00:07</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>1.049770</td>\n",
|
|||
|
" <td>1.092619</td>\n",
|
|||
|
" <td>0.659108</td>\n",
|
|||
|
" <td>00:07</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"<IPython.core.display.HTML object>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"learn.fit_one_cycle(5, 3e-3)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"That's a pretty good start, considering we have to pick the correct one of ten categories, and we're training from scratch for just 5 epochs!"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Building a modern CNN: ResNet"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"We now have all the pieces needed to build the models we have been using in each computer vision task since the beginning of this book: ResNets. We introduce the main idea behind them and show how it improves accuracy Imagenette compared to our previous model, before building a version with all the recent tweaks."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Skip-connections"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"In 2015 the authors of ResNet paper noticed something that they found curious. Even after using batchnorm, they saw that a network using more layers was doing less well than a network using less layers — and there were no other differences between the models. Most interestingly, the difference was observed not only in the validation set, but also in the training set; so, it wasn't just a generalisation issue, but a training issue. As the paper explains:\n",
|
|||
|
"\n",
|
|||
|
"> : Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as [previously reported] and thoroughly verified by our experiments.\n",
|
|||
|
"\n",
|
|||
|
"This is the graph they showed, with training error on the left, and test on the right:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<img alt=\"Training of networks of different depth\" width=\"700\" caption=\"Training of networks of different depth\" id=\"resnet_depth\" src=\"images/att_00042.png\">"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"As the authors mention here, they are not the first people to have noticed this curious fact. But they were the 1st to make a very important leap:\n",
|
|||
|
"\n",
|
|||
|
"> : Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model.\n",
|
|||
|
"\n",
|
|||
|
"Being an academic paper, this process written in a rather inaccessible way — but it's actually saying something very simple: start with the 20 layer neural network that is trained well, and add another 36 layers that do nothing at all (for instance, they linear layer with a single weight equal to one, and bias equal to 0). This would be a 56 layer network which does exactly the same thing as the 20 layer network. This shows that there are always deep networks which should be *at least as good* as any shallow network. But for some reason, SGD does not seem able to find them.\n",
|
|||
|
"\n",
|
|||
|
"> jargon: Identity mapping: a function that just returns its input without changing it at all. Also known as *identity function*.\n",
|
|||
|
"\n",
|
|||
|
"Actually, there is another way to create those extra 36 layers, which is much more interesting. What if we replaced every occurrence of `conv(x)` with `x + conv(x)`, where `conv` is the function from the previous chapter which does a 2nd convolution, then relu, then batchnorm. Furthermore, recall that batchnorm does `gamma*y + beta`. What if we initialized `gamma` for every one of these batchnorm layers to zero? Then our `conv(x)` for those extra 36 layers will always be equal to zero, which means `x+conv(x)` will always be equal to `x`.\n",
|
|||
|
"\n",
|
|||
|
"What has that gained us, then? The key thing is that those 36 extra layers, as they stand, are an *identity mapping*, but they have *parameters*, which means they are *trainable*. So, we can start with our best 20 layer model, add these 36 extra layers which initially do nothing at all, and then *fine tune the whole 56 layer model*. If those extra 36 layers can be useful, then they can learn parameters to do so!\n",
|
|||
|
"\n",
|
|||
|
"The ResNet paper actually proposed a variant of this, which is to instead \"skip over\" every 2nd convolution, so effectively we get `x+conv2(conv1(x))`. Or In diagram form (from the paper):"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<img alt=\"A simple ResNet block\" width=\"331\" caption=\"A simple ResNet block\" id=\"resnet_block\" src=\"images/att_00043.png\">"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"That arrow on the right is just the `x` part of `x+conv2(conv1(x))`, and is known as the *identity branch* or *skip connection*. The path on the left is the `conv2(conv1(x))` part. You can think of the identity path as providing a direct route from the input to the output.\n",
|
|||
|
"\n",
|
|||
|
"In a ResNet, we don't actually train it by first training a smaller number of layers, and then add new layers on the end and fine-tune. Instead, we use ResNet blocks (like the above) throughout the CNN, initialized from scratch in the usual way, and trained with SGD in the usual way. We rely on the skip connections to make the network easier to train for SGD."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"There's another (largely equivalent) way to think of these \"ResNet blocks\". This is how the paper describes it:\n",
|
|||
|
"\n",
|
|||
|
"> : Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.\n",
|
|||
|
"\n",
|
|||
|
"Again, this is rather inaccessible prose—so let's try to restate it in plain English! If the outcome of a given layer is `x`, when using a ResNet block that return `y = x+block(x)`, we're not asking the block to predict `y`, we are asking it to predict the difference between `y-x`. So the job of those blocks isn't to predict certain features anymore, but a little extra step that will minimize the error between `x` and the desired `y`. ResNet is, therefore, good at learning about slight differences between doing nothing and some other feature that the layer learns. Since we predict residuals (reminder: \"residual\" is predictions minus targets), this is why those kinds of models were named ResNets.\n",
|
|||
|
"\n",
|
|||
|
"One key concept that both of these two ways of thinking about ResNets share is the idea of \"easy to learn\". This is an important theme. Recall the universal approximation theorem, which states that a sufficiently large network *can* learn anything. This is still true. But there turns out to be a very important difference between what a network *can learn* in principle, and what it is *easy for it to learn* under realistic data and training regimes. Many of the advances in neural networks over the last decade have been like the ResNet block: the result of realizing how to make something which was always possible actually feasible.\n",
|
|||
|
"\n",
|
|||
|
"> note: The original paper didn't actually do the trick of using zero for the initial value of gamma in the batchnorm layer; that came a couple of years later. So the original version of ResNet didn't quite begin training with a truly identity path through the ResNet blocks, but nonetheless having the ability to \"navigate through\" the skip connections did indeed make it train better. Adding the batchnorm gamma init trick made the models train at even higher learning rates.\n",
|
|||
|
"\n",
|
|||
|
"Here's the definition of a simple ResNet block (where `norm_type=NormType.BatchZero` causes fastai to init the `gamma` weights of that batchnorm layer to zero):"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"class ResBlock(Module):\n",
|
|||
|
" def __init__(self, ni, nf):\n",
|
|||
|
" self.convs = nn.Sequential(\n",
|
|||
|
" ConvLayer(ni,nf),\n",
|
|||
|
" ConvLayer(nf,nf, norm_type=NormType.BatchZero))\n",
|
|||
|
" \n",
|
|||
|
" def forward(self, x): return x + self.convs(x)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"One problem with this, however, is that it can't handle a stride other than `1`, and it requires that `ni==nf`. Stop for a moment, to think carefully about why this is...\n",
|
|||
|
"\n",
|
|||
|
"The issue is that with a stride of, say, `2`, on one of the convolutions, the grid size of the output activations will be half the size on each axis of the input. So then we can't add that back to `x` in `forward` because `x` and the output activations have different dimensions. The same basic issue occurs if `ni!=nf`: the shapes of the input and output connections won't allow us to add them together.\n",
|
|||
|
"\n",
|
|||
|
"To fix this, we need a way to change the shape of `x` to match the result of `self.convs`. Halving the grid size can be done using an average pooling layer with a stride of 2: that is, a layer which takes 2x2 patches from the input, and replaces them with their average.\n",
|
|||
|
"\n",
|
|||
|
"Changing the number of channels can be done by using a convolution. We want this skip connection to be as close to an identity map as possible, however, which means making this convolution as simple as possible. The simplest possible convolution is one where the kernel size is `1`. That means that the kernel is size `ni*nf*1*1`, so it's only doing a dot product over the channels of each input pixel--it's not combining across pixels at all. This kind of *1x1 convolution* is very widely used in modern CNNs, so take a moment to think about how it works."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"> question: Create a `1x1 convolution` with `F.conv2d` or `nn.Conv2d` and apply it to an image. What happens to the `shape` of the image?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"> jargon: 1x1 convolution: A convolution with a kernel size of one."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Here's a ResBlock using these tricks to handle changing shape in the skip connection:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def _conv_block(ni,nf,stride):\n",
|
|||
|
" return nn.Sequential(\n",
|
|||
|
" ConvLayer(ni, nf, stride=stride),\n",
|
|||
|
" ConvLayer(nf, nf, act_cls=None, norm_type=NormType.BatchZero))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"class ResBlock(Module):\n",
|
|||
|
" def __init__(self, ni, nf, stride=1):\n",
|
|||
|
" self.convs = _conv_block(ni,nf,stride)\n",
|
|||
|
" self.idconv = noop if ni==nf else ConvLayer(ni, nf, 1, act_cls=None)\n",
|
|||
|
" self.pool = noop if stride==1 else nn.AvgPool2d(2, ceil_mode=True)\n",
|
|||
|
"\n",
|
|||
|
" def forward(self, x):\n",
|
|||
|
" return F.relu(self.convs(x) + self.idconv(self.pool(x)))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Note that we're using the `noop` function here, which simply returns its input unchanged (*noop* is a computer science term that stands for \"no operation\"). In this case, `idconv` does nothing at all if `nf==nf`, and `pool` does nothing if `stride==1`, which is what we wanted in our skip connection.\n",
|
|||
|
"\n",
|
|||
|
"Also, you'll see that we've removed relu (`act_cls=None`) from the final convolution in `convs` and from `idconv`, and moved it to *after* we add the skip connection. The thinking behind this is that the whole ResNet block is like a layer, and you want your activation to be *after* your layer.\n",
|
|||
|
"\n",
|
|||
|
"Let's replace our `block` with `ResBlock`, and try it out:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def block(ni,nf): return ResBlock(ni, nf, stride=2)\n",
|
|||
|
"learn = get_learner(get_model())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: left;\">\n",
|
|||
|
" <th>epoch</th>\n",
|
|||
|
" <th>train_loss</th>\n",
|
|||
|
" <th>valid_loss</th>\n",
|
|||
|
" <th>accuracy</th>\n",
|
|||
|
" <th>time</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1.973174</td>\n",
|
|||
|
" <td>1.845491</td>\n",
|
|||
|
" <td>0.373248</td>\n",
|
|||
|
" <td>00:08</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1.678627</td>\n",
|
|||
|
" <td>1.778713</td>\n",
|
|||
|
" <td>0.439236</td>\n",
|
|||
|
" <td>00:08</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1.386163</td>\n",
|
|||
|
" <td>1.596503</td>\n",
|
|||
|
" <td>0.507261</td>\n",
|
|||
|
" <td>00:08</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.177839</td>\n",
|
|||
|
" <td>1.102993</td>\n",
|
|||
|
" <td>0.644841</td>\n",
|
|||
|
" <td>00:09</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>1.052435</td>\n",
|
|||
|
" <td>1.038013</td>\n",
|
|||
|
" <td>0.667771</td>\n",
|
|||
|
" <td>00:09</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"<IPython.core.display.HTML object>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"learn.fit_one_cycle(5, 3e-3)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"It's not much better. But the whole point of this was to allow us to train *deeper* models, and we're not really taking advantage of that yet. To create a deeper model that's, say, twice as deep, all we need to do is replace our `block` with two `ResBlock`s in a row:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def block(ni, nf):\n",
|
|||
|
" return nn.Sequential(ResBlock(ni, nf, stride=2), ResBlock(nf, nf))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: left;\">\n",
|
|||
|
" <th>epoch</th>\n",
|
|||
|
" <th>train_loss</th>\n",
|
|||
|
" <th>valid_loss</th>\n",
|
|||
|
" <th>accuracy</th>\n",
|
|||
|
" <th>time</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1.964076</td>\n",
|
|||
|
" <td>1.864578</td>\n",
|
|||
|
" <td>0.355159</td>\n",
|
|||
|
" <td>00:12</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1.636880</td>\n",
|
|||
|
" <td>1.596789</td>\n",
|
|||
|
" <td>0.502675</td>\n",
|
|||
|
" <td>00:12</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1.335378</td>\n",
|
|||
|
" <td>1.304472</td>\n",
|
|||
|
" <td>0.588535</td>\n",
|
|||
|
" <td>00:12</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.089160</td>\n",
|
|||
|
" <td>1.065063</td>\n",
|
|||
|
" <td>0.663185</td>\n",
|
|||
|
" <td>00:12</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>0.942904</td>\n",
|
|||
|
" <td>0.963589</td>\n",
|
|||
|
" <td>0.692739</td>\n",
|
|||
|
" <td>00:12</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"<IPython.core.display.HTML object>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"learn = get_learner(get_model())\n",
|
|||
|
"learn.fit_one_cycle(5, 3e-3)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Now we're making good progress!\n",
|
|||
|
"\n",
|
|||
|
"The authors of the ResNet paper went on to win the 2015 ImageNet challenge. At the time, this was by far the most important annual event in computer vision. We have already seen another ImageNet winner: the 2013 winners, Zeiler and Fergus. It is interesting to note that in both cases the starting point for the breakthroughs were experimental observations. Observations about what layers actually learn, in the case of Zeiler and Fergus, and observations about which kind of networks can be trained, in the case of the ResNet authors. This ability to design and analyse thoughtful experiments, or even just to see an unexpected result say \"hmmm, that's interesting\" — and then, most importantly, to figure out what on earth is going on, with great tenacity, is at the heart of many scientific discoveries. Deep learning is not like pure mathematics. It is a heavily experimental field, so it's important to be strong practitioner, not just a theoretician.\n",
|
|||
|
"\n",
|
|||
|
"Since the ResNet was introduced, there's been many papers studying it and applying it to many domains. One of the most interesting, published in 2018, is [Visualizing the Loss Landscape of Neural Nets](https://arxiv.org/abs/1712.09913). It shows that using skip connections help smoothen the loss function, which makes training easier as it avoids us falling into a very sharp area. Here's a stunning picture from the paper, showing the bumpy terrain that SGD has to navigate to optimize a regular CNN (left) versus the smooth surface of a ResNet (right):"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<img alt=\"Impact of ResNet on loss landscape\" width=\"600\" caption=\"Impact of ResNet on loss landscape\" id=\"resnet_surface\" src=\"images/att_00044.png\">"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### A state-of-the-art ResNet"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"In [Bag of Tricks for Image Classification with Convolutional Neural Networks](https://arxiv.org/abs/1812.01187), the authors study different variations of the ResNet architecture that come at almost no additional cost in terms of number of parameters or computation. By using this tweaked ResNet50 architecture and Mixup they achieve 94.6% top-5 accuracy on ImageNet, instead of 92.2% with a regular ResNet50 without Mixup. This result is better than regular ResNet models that are twice as deep (and twice as slow, and much more likely to overfit)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"> jargon: top-5 accuracy: A metric testing how often the label we want is in the top 5 predictions of our model. It was used in the Imagenet competition, since many images contained multiple objects, or contained objects that could be easily confused or may even have been mislabeled with a similar label. In these situations, looking at top-1 accuracy may be inappropriate. However, recently CNNs have been getting so good that top-5 accuracy is nearly 100%, so some researchers are using top-1 accuracy for Imagenet too now."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"So, as we scale up to the full ResNet, we won't show the original one, but the tweaked one, since it's substantially better. It differs a little bit from our the implementation we had before in that it begins with a few convolutional layers followed by a max pooling layer, instead of just starting with ResNet blocks. This is what the first layers look like:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def _resnet_stem(*sizes):\n",
|
|||
|
" return [\n",
|
|||
|
" ConvLayer(sizes[i], sizes[i+1], 3, stride = 2 if i==0 else 1)\n",
|
|||
|
" for i in range(len(sizes)-1)\n",
|
|||
|
" ] + [nn.MaxPool2d(kernel_size=3, stride=2, padding=1)]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"[ConvLayer(\n",
|
|||
|
" (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)\n",
|
|||
|
" (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
|
|||
|
" (2): ReLU()\n",
|
|||
|
" ), ConvLayer(\n",
|
|||
|
" (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)\n",
|
|||
|
" (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
|
|||
|
" (2): ReLU()\n",
|
|||
|
" ), ConvLayer(\n",
|
|||
|
" (0): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)\n",
|
|||
|
" (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
|
|||
|
" (2): ReLU()\n",
|
|||
|
" ), MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"#hide_output\n",
|
|||
|
"_resnet_stem(3,32,32,64)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"```\n",
|
|||
|
"[ConvLayer(\n",
|
|||
|
" (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))\n",
|
|||
|
" (1): BatchNorm2d(32, eps=1e-05, momentum=0.1)\n",
|
|||
|
" (2): ReLU()\n",
|
|||
|
" ), ConvLayer(\n",
|
|||
|
" (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
|
|||
|
" (1): BatchNorm2d(32, eps=1e-05, momentum=0.1)\n",
|
|||
|
" (2): ReLU()\n",
|
|||
|
" ), ConvLayer(\n",
|
|||
|
" (0): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
|
|||
|
" (1): BatchNorm2d(64, eps=1e-05, momentum=0.1)\n",
|
|||
|
" (2): ReLU()\n",
|
|||
|
" ), MaxPool2d(kernel_size=3, stride=2, padding=1, ceil_mode=False)]\n",
|
|||
|
" ```"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"> jargon: Stem: The stem of a CNN are its first few layers. Generally, the stem has a different structure to the main body of the CNN."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"The reason that we have a stem of plain convolutional layers, instead of ResNet blocks, is based on a very important insight about all deep convolutional neural networks: the vast majority of the computation occurs in the early layers. Therefore, we should keep the early layers as fast and simple as possible.\n",
|
|||
|
"\n",
|
|||
|
"To see why so much computation occurs in the early layers, consider the very first convolution on a 128 pixel input image. If it is a stride one convolution, then it will apply the kernel to every one of the 128×128 pixels. That's a lot of work! In the later layers, however, the grid size could be as small as 4x4 or even 2x2. So there are far fewer kernel applications to do.\n",
|
|||
|
"\n",
|
|||
|
"On the other hand, the first layer convolution only has three input features, and 32 output features. Since it is a 3x3 kernel, this is 3×32×3×3 = 864 parameters in the weights. On the other hand, the last convolution will be 256 input features and 512 output features, which will be 1,179,648 weights! So the first layers contain vast majority of the computation, but the last layers contain the vast majority of the parameters.\n",
|
|||
|
"\n",
|
|||
|
"A ResNet block takes more computation than a plain convolutional block, since (in the stride two case) a ResNet block has three convolutions and a pooling layer. That's why we want to have plain convolutions to start off our ResNet.\n",
|
|||
|
"\n",
|
|||
|
"We're now ready to show the implementation of a modern ResNet, with the \"bag of tricks\". The ResNet use four groups of ResNet blocks, with 64, 128, 256 then 512 filters. Each groups starts with a stride 2 block, except for the first one, since it's just after a `MaxPooling` layer."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"class ResNet(nn.Sequential):\n",
|
|||
|
" def __init__(self, n_out, layers, expansion=1):\n",
|
|||
|
" stem = _resnet_stem(3,32,32,64)\n",
|
|||
|
" self.block_szs = [64, 64, 128, 256, 512]\n",
|
|||
|
" for i in range(1,5): self.block_szs[i] *= expansion\n",
|
|||
|
" blocks = [self._make_layer(*o) for o in enumerate(layers)]\n",
|
|||
|
" super().__init__(*stem, *blocks,\n",
|
|||
|
" nn.AdaptiveAvgPool2d(1), Flatten(),\n",
|
|||
|
" nn.Linear(self.block_szs[-1], n_out))\n",
|
|||
|
" \n",
|
|||
|
" def _make_layer(self, idx, n_layers):\n",
|
|||
|
" stride = 1 if idx==0 else 2\n",
|
|||
|
" ch_in,ch_out = self.block_szs[idx:idx+2]\n",
|
|||
|
" return nn.Sequential(*[\n",
|
|||
|
" ResBlock(ch_in if i==0 else ch_out, ch_out, stride if i==0 else 1)\n",
|
|||
|
" for i in range(n_layers)\n",
|
|||
|
" ])"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"The `_make_layer` function is just there to create a series of `nl` blocks. The first one is is going from `ch_in` to `ch_out` with the indicated `stride` and all the others are blocks of stride 1 with `ch_out` to `ch_out` tensors. Once the blocks are defined, our model is purely sequential, which is why we define it as a subclass of `nn.Sequential`. (Ignore the `expansion` parameter for now--we'll discuss it in the next section' for now, it'll be `1`, so it doesn't do anything.)\n",
|
|||
|
"\n",
|
|||
|
"The various versions of the models (ResNet 18, 34, 50, etc) just change the number of blocks in each of those groups. This is the definition of a ResNet18:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"rn = ResNet(dls.c, [2,2,2,2])"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Let's train it for a little bit and see how it fares compared to the previous model:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: left;\">\n",
|
|||
|
" <th>epoch</th>\n",
|
|||
|
" <th>train_loss</th>\n",
|
|||
|
" <th>valid_loss</th>\n",
|
|||
|
" <th>accuracy</th>\n",
|
|||
|
" <th>time</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1.673882</td>\n",
|
|||
|
" <td>1.828394</td>\n",
|
|||
|
" <td>0.413758</td>\n",
|
|||
|
" <td>00:13</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1.331675</td>\n",
|
|||
|
" <td>1.572685</td>\n",
|
|||
|
" <td>0.518217</td>\n",
|
|||
|
" <td>00:13</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1.087224</td>\n",
|
|||
|
" <td>1.086102</td>\n",
|
|||
|
" <td>0.650701</td>\n",
|
|||
|
" <td>00:13</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>0.900428</td>\n",
|
|||
|
" <td>0.968219</td>\n",
|
|||
|
" <td>0.684331</td>\n",
|
|||
|
" <td>00:12</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>0.760280</td>\n",
|
|||
|
" <td>0.782558</td>\n",
|
|||
|
" <td>0.757197</td>\n",
|
|||
|
" <td>00:12</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"<IPython.core.display.HTML object>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"learn = get_learner(rn)\n",
|
|||
|
"learn.fit_one_cycle(5, 3e-3)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Even although we have more channels (and our model is therefore even more accurate), our training is just as fast as before, thanks to our optimized stem."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Bottleneck layers"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Things are a tiny bit more complicated for deeper models like `resnet50` as they don't use the same resnet blocks: instead of stacking two convolutions with a kernel size of 3, they use three different convolutions: two 1x1 (at the beginning and the end) and one 3x3, as shown in the right of this image from the ResNet paper (using an example of 64 channel output, comparing to the regular ResBlock on the left):"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<img alt=\"Comparison of regular and bottleneck ResNet blocks\" width=\"550\" caption=\"Comparison of regular and bottleneck ResNet blocks\" id=\"resnet_compare\" src=\"images/att_00045.png\">"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Why? 1x1 convolutions are much faster, so even if this seems to be a more complex design, this block executes faster than the first resnet block we saw. This then lets us use more filters: as we see on the illustration, the number of filters in and out is 4 times higher (256) and the 1 by 1 convs are here to diminish then restore the number of channels (hence the name bottleneck). The overall impact is that we can use more filters in the same amount of time.\n",
|
|||
|
"\n",
|
|||
|
"Let's try replacing our ResBlock with this bottleneck design:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def _conv_block(ni,nf,stride):\n",
|
|||
|
" return nn.Sequential(\n",
|
|||
|
" ConvLayer(ni, nf//4, 1),\n",
|
|||
|
" ConvLayer(nf//4, nf//4, stride=stride), \n",
|
|||
|
" ConvLayer(nf//4, nf, 1, act_cls=None, norm_type=NormType.BatchZero))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"We'll use this to create a ResNet50, which uses this bottleneck block, and uses group sizes of `(3,4,6,3)`. We now need to pass `4` in to the `expansion` parameter of `ResNet`, since we need to start with four times less channels, and we'll end with four times more channels.\n",
|
|||
|
"\n",
|
|||
|
"Deeper networks like this don't generally show improvements when training for only 5 epochs, so we'll bump it up to 20 epochs this time to make the most of our bigger model. And to really get great results, let's use bigger images too:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"dls = get_data(URLs.IMAGENETTE_320, presize=320, resize=224)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"We don't have to do anything to account for the larger 224 pixel images--thanks to our fully convolutional network, it just works. This is also why we were able to do *progressive resizing* earlier in the book--the models we used were fully convolutional, so we were even able to fine-tune models trained with different sizes."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"rn = ResNet(dls.c, [3,4,6,3], 4)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: left;\">\n",
|
|||
|
" <th>epoch</th>\n",
|
|||
|
" <th>train_loss</th>\n",
|
|||
|
" <th>valid_loss</th>\n",
|
|||
|
" <th>accuracy</th>\n",
|
|||
|
" <th>time</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1.613448</td>\n",
|
|||
|
" <td>1.473355</td>\n",
|
|||
|
" <td>0.514140</td>\n",
|
|||
|
" <td>00:31</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1.359604</td>\n",
|
|||
|
" <td>2.050794</td>\n",
|
|||
|
" <td>0.397452</td>\n",
|
|||
|
" <td>00:31</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1.253112</td>\n",
|
|||
|
" <td>4.511735</td>\n",
|
|||
|
" <td>0.387006</td>\n",
|
|||
|
" <td>00:31</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.133450</td>\n",
|
|||
|
" <td>2.575221</td>\n",
|
|||
|
" <td>0.396178</td>\n",
|
|||
|
" <td>00:31</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>1.054752</td>\n",
|
|||
|
" <td>1.264525</td>\n",
|
|||
|
" <td>0.613758</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" <td>0.927930</td>\n",
|
|||
|
" <td>2.670484</td>\n",
|
|||
|
" <td>0.422675</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>6</td>\n",
|
|||
|
" <td>0.838268</td>\n",
|
|||
|
" <td>1.724588</td>\n",
|
|||
|
" <td>0.528662</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>7</td>\n",
|
|||
|
" <td>0.748289</td>\n",
|
|||
|
" <td>1.180668</td>\n",
|
|||
|
" <td>0.666497</td>\n",
|
|||
|
" <td>00:31</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>8</td>\n",
|
|||
|
" <td>0.688637</td>\n",
|
|||
|
" <td>1.245039</td>\n",
|
|||
|
" <td>0.650446</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>9</td>\n",
|
|||
|
" <td>0.645530</td>\n",
|
|||
|
" <td>1.053691</td>\n",
|
|||
|
" <td>0.674904</td>\n",
|
|||
|
" <td>00:31</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>10</td>\n",
|
|||
|
" <td>0.593401</td>\n",
|
|||
|
" <td>1.180786</td>\n",
|
|||
|
" <td>0.676433</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>0.536634</td>\n",
|
|||
|
" <td>0.879937</td>\n",
|
|||
|
" <td>0.713885</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>12</td>\n",
|
|||
|
" <td>0.479208</td>\n",
|
|||
|
" <td>0.798356</td>\n",
|
|||
|
" <td>0.741656</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>13</td>\n",
|
|||
|
" <td>0.440071</td>\n",
|
|||
|
" <td>0.600644</td>\n",
|
|||
|
" <td>0.806879</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>14</td>\n",
|
|||
|
" <td>0.402952</td>\n",
|
|||
|
" <td>0.450296</td>\n",
|
|||
|
" <td>0.858599</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>15</td>\n",
|
|||
|
" <td>0.359117</td>\n",
|
|||
|
" <td>0.486126</td>\n",
|
|||
|
" <td>0.846369</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>16</td>\n",
|
|||
|
" <td>0.313642</td>\n",
|
|||
|
" <td>0.442215</td>\n",
|
|||
|
" <td>0.861911</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>17</td>\n",
|
|||
|
" <td>0.294050</td>\n",
|
|||
|
" <td>0.485967</td>\n",
|
|||
|
" <td>0.853503</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>18</td>\n",
|
|||
|
" <td>0.270583</td>\n",
|
|||
|
" <td>0.408566</td>\n",
|
|||
|
" <td>0.875924</td>\n",
|
|||
|
" <td>00:32</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <td>19</td>\n",
|
|||
|
" <td>0.266003</td>\n",
|
|||
|
" <td>0.411752</td>\n",
|
|||
|
" <td>0.872611</td>\n",
|
|||
|
" <td>00:33</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"<IPython.core.display.HTML object>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"learn = get_learner(rn)\n",
|
|||
|
"learn.fit_one_cycle(20, 3e-3)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"We're getting a great result now! Try adding Mixup, and then training this for a hundred epochs while you go get lunch. You'll have yourself a very accurate image classifier, trained from scratch.\n",
|
|||
|
"\n",
|
|||
|
"The bottleneck design we've shown here is only used in ResNet50, 101, and 152 in all official models we've seen. ResNet18 and 34 use the non-bottleneck design seen in the previous section. However, we've noticed that the bottleneck layer generally works better even for the shallower networks. This just goes to show that the little details in papers tend to stick around for years, even if they're actually not quite the best design! Questioning assumptions and \"stuff everyone knows\" is always a good idea, because this is still a new field, and there's lots of details that aren't always done well."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Questionnaire"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"1. How did we get to a single vector of activations in the convnets used for MNIST in previous chapters? Why isn't that suitable for Imagenette?\n",
|
|||
|
"1. What do we do for Imagenette instead?\n",
|
|||
|
"1. What is adaptive pooling?\n",
|
|||
|
"1. What is average pooling?\n",
|
|||
|
"1. Why do we need `Flatten` after an adaptive average pooling layer?\n",
|
|||
|
"1. What is a skip connection?\n",
|
|||
|
"1. Why do skip connections allow us to train deeper models?\n",
|
|||
|
"1. What does <<resnet_depth>> show? How did that lead to the idea of skip connections?\n",
|
|||
|
"1. What is an identity mapping?\n",
|
|||
|
"1. What is the basic equation for a resnet block (ignoring batchnorm and relu layers)?\n",
|
|||
|
"1. What do ResNets have to do with \"residuals\"?\n",
|
|||
|
"1. How do we deal with the skip connection when there is a stride 2 convolution? How about when the number of filters changes?\n",
|
|||
|
"1. How can we express a 1x1 convolution in terms of a vector dot product?\n",
|
|||
|
"1. What does the `noop` function return?\n",
|
|||
|
"1. Explain what is shown in <<resnet_surface>>.\n",
|
|||
|
"1. When is top-5 accuracy a better metric than top-1 accuracy?\n",
|
|||
|
"1. What is the stem of a CNN?\n",
|
|||
|
"1. Why use plain convs in the CNN stem, instead of resnet blocks?\n",
|
|||
|
"1. How does a bottleneck block differ from a plain resnet block?\n",
|
|||
|
"1. Why is a bottleneck block faster?\n",
|
|||
|
"1. How do fully convolution nets (and nets with adaptive pooling in general) allow for progressive resizing?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Further research"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"1. Try creating a fully convolutional net with adaptive average pooling for MNIST (note that you'll need fewer stride 2 layers). How does it compare to a network without such a pooling layer?\n",
|
|||
|
"1. In <<chapter_foundations>> we introduce *Einstein summation notation*. Skip ahead to see how this works, and then write an implementation of the 1x1 convolution operation using `torch.einsum`. Compare it to the same operation using `torch.conv2d`.\n",
|
|||
|
"1. Write a \"top 5 accuracy\" function using plain PyTorch or plain Python.\n",
|
|||
|
"1. Train a model on Imagenette for more epochs, with and without label smoothing. Take a look at the Imagenette leaderboards and see how close you can get to the best results shown. Read the linked pages describing the leading approaches."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"s"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"jupytext": {
|
|||
|
"split_at_heading": true
|
|||
|
},
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Python 3",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.7.5"
|
|||
|
},
|
|||
|
"toc": {
|
|||
|
"base_numbering": 1,
|
|||
|
"nav_menu": {},
|
|||
|
"number_sections": false,
|
|||
|
"sideBar": true,
|
|||
|
"skip_h1_title": true,
|
|||
|
"title_cell": "Table of Contents",
|
|||
|
"title_sidebar": "Contents",
|
|||
|
"toc_cell": false,
|
|||
|
"toc_position": {},
|
|||
|
"toc_section_display": true,
|
|||
|
"toc_window_display": false
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|