"In <<chapter_mnist_basics>> we learned how to create a neural network recognising images. We were able to achieve a bit over 98% accuracy at recognising threes from sevens. But we also saw that fastai's built in classes were able to get close to 100%. Let's start trying to close the gap.\n",
"\n",
"In this chapter, we will start by digging into what convolutions are and build a CNN from scratch. We will then study a range of techniques to improve training stability and learn all the tweaks the library usually applies for us to get great results."
"One of the most powerful tools that machine learning practitioners have at their disposal is *feature engineering*. A *feature* is a transformation of the data which is designed to make it easier to model. For instance, the `add_datepart` function that we used for our tabular dataset preprocessing added date features to the Bulldozers dataset. What kind of features might we be able to create from images?"
"> jargon: Feature engineering: creating new transformations of the input data in order to make it easier to model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the context of an image, a *feature* will be a visually distinctive attribute of an image. Here's an idea: the number seven is characterised by a horizontal edge near the top of the digit, and a bottom left to top right diagonal edge underneath that. On the other hand, the number three is characterised by a diagonal edge in one direction in the top left and bottom right of the digit, the opposite diagonal on the bottom left and top right, a horizontal edge in the middle of the top and the bottom, and so forth. So what if we could extract information about where the edges occur in each image, and then use that as our features, instead of raw pixels?\n",
"\n",
"It turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use something called a *convolution*. A convolution requires nothing more than multiplication, and addition — two operations which are responsible for the vast majority of work that we will see in every single deep learning model in this book!\n",
"<img src=\"images/chapter9_conv_basic.png\" id=\"basic_conv\" caption=\"Applying a kernel to one location\" alt=\"Applying a kernel to one location\" width=\"700\">"
"The 7x7 grid to the left is our *image* we're going to apply the kernel to. The convolution operation multiplies each element of the kernel, to each element of a 3x3 block of the image. The results of these multiplications are then added together. The diagram above shows an example of applying a kernel to a single location in the image, the 3x3 block around cell 18.\n",
"Now we're going to take the top 3x3 pixel square of our image, and we'll multiply each of those by each item in our kernel. Then we'll add them up. Like so:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[-0., -0., -0.],\n",
" [0., 0., 0.],\n",
" [0., 0., 0.]])"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"im3_t = tensor(im3)\n",
"im3_t[0:3,0:3] * top_edge"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tensor(0.)"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(im3_t[0:3,0:3] * top_edge).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Not very interesting so far - they are all white pixels in the top left corner. But let's pick a couple of more interesting spots:"
"<img alt=\"Top section of a digit\" width=\"490\" src=\"images/att_00059.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There's a top edge at cell 5,7. Let's repeat our calculation there:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tensor(762.)"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(im3_t[4:7,6:9] * top_edge).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There's a right edge at cell 8,18. What does that give us?:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tensor(-29.)"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(im3_t[7:10,17:20] * top_edge).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, this little calculation is returning a high number where the 3x3 pixel square represents a top edge (i.e. where there are low values at the top of the square, and high values immediately underneath). That's because the `-1` values in our kernel have little impact in that case, but the `1` values have a lot.\n",
"\n",
"Let's look a tiny bit at the math. The filter will take any window of size 3 by 3 in our images, and if we name the pixel values like this:\n",
"it will return $a1+a2+a3-a7-a8-a9$. Now if we are in a part of the image where there $a1$, $a2$ and $a3$ are kind of the same as $a7$, $a8$ and $a9$, then the terms will cancel each other and we will get 0. However if $a1$ is greater than $a7$, $a2$ is greater than $a8$ and $a3$ is greater than $a9$, we will get a bigger number as a result. So this filter detects horizontal edges, more precisely edges where we go from bright parts of the image at the top to darker parts at the bottom.\n",
"\n",
"Changing our filter to have the row of ones at the top and the -1 at the bottom would detect horizonal edges that go from dark to light. Putting the ones and -1 in columns versus rows would give us a filter that detect vertical edges. Each set of weights will produce a different kind of outcome.\n",
"\n",
"Let's create a function to do this for one location, and check it matches our result from before:"
"We can map `apply_kernel()` across the coordinate grid. That is, we'll be taking our 3x3 kernel, and applying it to each 3x3 section of our image. For instance, here are the positions a 3x3 kernel can be applied to in the first row of a 5x5 image:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"images/chapter9_nopadconv.svg\" id=\"nopad_conv\" caption=\"Applying a kernel across a grid\" alt=\"Applying a kernel across a grid\" width=\"400\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get a *grid* of coordinates we can use a *nested list comprehension*, like so:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[(1, 1), (1, 2), (1, 3), (1, 4)],\n",
" [(2, 1), (2, 2), (2, 3), (2, 4)],\n",
" [(3, 1), (3, 2), (3, 3), (3, 4)],\n",
" [(4, 1), (4, 2), (4, 3), (4, 4)]]"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[[(i,j) for j in range(1,5)] for i in range(1,5)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> note: Nested list comprehensions are used a lot in Python, so if you haven't seen them before, take a few minutes to make sure you understand what's happening here, and experiment with writing your own nested list comprehensions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's the result of applying our kernel over a coordinate grid."
"top_edge3 = tensor([[apply_kernel(i,j,top_edge) for j in rng] for i in rng])\n",
"\n",
"show_image(top_edge3);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking good! Our top edges are black, and bottom edges are white (since they are the *opposite* of top edges). Now that our *image* contains negative numbers too, matplotlib has automatically changed our colors, so that white is the smallest number in the image, black the highest, and zeros appear as grey.\n",
"This operation of applying a kernel over a grid in this way is called *convolution*. In the paper [A guide to convolution arithmetic for deep learning](https://arxiv.org/abs/1603.07285) there are many great diagrams showing how image kernels can be applied. Here's an example from the paper showing (at bottom) a light blue 4x4 image, with a dark blue 3x3 kernel being applied, creating a 2x2 green output activation map at the top. "
"<img alt=\"Result of applying a 3x3 kernel to a 4x4 image\" width=\"782\" caption=\"Result of applying a 3x3 kernel to a 4x4 image (curtesy of Vincent Dumoulin and Francesco Visin)\" id=\"three_ex_four_conv\" src=\"images/att_00028.png\">"
"Look at the shape of the result. If the original image has a height of `h` and a width of `w`, how many 3 by 3 windows can we find? As you see from the example, there are `h-2` by `w-2` windows, so the image we get as a result as a height of `h-2` and a witdh of `w-2`."
"We won't implement this convolution function from scratch, but use PyTorch's implementation instead (it is way faster than anything we could do in Python)."
"Convolution is such an important and widely-used operation that PyTorch has it builtin. It's called `F.conv2d` (Recall F is a fastai import from torch.nn.functional as recommended by PyTorch). The PyTorch docs tell us that it includes these parameters:\n",
"- **input**: input tensor of shape `(minibatch, in_channels, iH, iW)`\n",
"- **weight**: filters of shape `(out_channels, in_channels, kH, kW)`\n",
"\n",
"Here `iH,iW` is the height and width of the image (i.e. `28,28`), and `kH,kW` is the height and width of our kernel (`3,3`). But apparently PyTorch is expecting rank 4 tensors for both these arguments, but currently we only have rank 2 tensors (i.e. matrices, arrays with two axes).\n",
"\n",
"The reason for these extra axes is that PyTorch has a few tricks up its sleeve. The first trick is that PyTorch can apply a convolution to multiple images at the same time. That means we can call it on every item in a batch at once!\n",
"The second trick is that PyTorch can apply multiple kernels at the same time. So let's create the diagonal edge kernels too, and then stack all 4 of our edge kernels into a single tensor:"
"By default, fastai puts data on the GPU when using data blocks. Let's move it to the CPU for our examples:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"xb,yb = to_cpu(xb),to_cpu(yb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One batch contains 64 images, each of 1 channel, with 28x28 pixels. `F.conv2d` can handle multi-channel (e.g. colour) images. A *channel* is a single basic color in an image--for regular full color images there are 3 channels, red, green, and blue. PyTorch represents an image as a rank-3 tensor, with dimensions channels x rows x columns.\n",
"\n",
"We'll see how to handle more than one channel later in this chapter. Kernels passed to `F.conv2d` need to be rank-4 tensors: channels_in x features_out x rows x columns. `edge_kernels` is currently missing one of these: the `1` for features_out. We need to tell PyTorch that the number of input channels in the kernel is one, by inserting an axis of size one (this is known as a *unit axis*) in the first location, since the PyTorch docs show that's where `in_channels` is expected. To insert a unit axis into a tensor, use the `unsqueeze` method:"
"This is now the correct shape for `edge_kernels`. Let's pass this all to `conv2d`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"edge_kernels = edge_kernels.unsqueeze(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([64, 4, 26, 26])"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"batch_features = F.conv2d(xb, edge_kernels)\n",
"batch_features.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output shape shows our 64 images in the mini-batch, 4 kernels, and 26x26 edge maps (we started with 28x28 images, but lost one pixel from each side as discussed earlier). We can see we get the same results as when we did this manually:"
"The most important trick that PyTorch has up its sleeve is that it can use the GPU to do all this work in parallel. That is, applying multiple kernels, to multiple images, across multiple channels. Doing lots of work in parallel is critical to getting GPUs to work efficiently; if we did each of these one at a time, we'll often run hundreds of times slower (and if we used our manual convolution loop from the previous section, we'd be millions of times slower!) Therefore, to become a strong deep learning practitioner, one skill to practice is giving your GPU plenty of work to do at a time."
"It would be nice to not lose those two pixels on each axis. The way we do that is to add *padding*, which is simply additional pixels added around the outside of our image. Most commonly, pixels of zeros are added. "
"With appropriate padding, we can ensure that the output activation map is the same size as the original image, which can make things a lot simpler when we construct our architectures."
"<img alt=\"4x4 kernel with 5x5 input and 2 pixels of padding\" width=\"783\" caption=\"4x4 kernel with 5x5 input and 2 pixels of padding (curtesy of Vincent Dumoulin and Francesco Visin)\" id=\"four_by_five_conv\" src=\"images/att_00029.png\">"
"If we add a kernel of size `ks` by `ks` (with `ks` an odd number), the necessary padding on each side to keep the same shape is `ks//2`. An even number for `ks` would require a different amount of padding on the top/bottom, left/right, but in practice we almost never use an even filter size.\n",
"So far, when we have applied the kernel to the grid, we have moved it one pixel over at a time. But we can jump further; for instance, we could move over two pixels after each kernel application as in <<three_by_five_conv>>. This is known as a *stride 2* convolution. The most common kernel size in practice is 3x3, and the most common padding is 1. As you'll see, stride 2 convolutions are useful for decreasing the size of our outputs, and stride 1 convolutions are useful for adding layers without changing the output size."
"<img alt=\"3x3 kernel with 5x5 input, stride 2 convolution, and 1 pixel of padding\" width=\"774\" caption=\"3x3 kernel with 5x5 input, stride 2 convolution, and 1 pixel of padding (curtesy of Vincent Dumoulin and Francesco Visin)\" id=\"three_by_five_conv\" src=\"images/att_00030.png\">"
"In an image of size `h` by `w` like before, using a padding of 1 and a stride of 2 will give us a result of size `(h+1)//2` by `(w+1)//2`. The general formula for each dimension is `(n + 2*pad - ks)//stride + 1` where `pad` is the padding, `ks` the size of our kernel and `stride` the stride."
"To explain the math behing convolutions, fast.ai student Matt Kleinsmith came up with the very clever idea of showing [CNNs from different viewpoints](https://medium.com/impactai/cnns-from-different-viewpoints-fab7f52d159c). In fact, it's so clever, and so helpful, we're going to show it here too!\n",
"Notice that the bias term, b, is the same for each section of the image. You can consider the bias as part of the filter, just like the weights (α, β, γ, δ) are part of the filter.\n",
"Here's an interesting insight -- a convolution can be represented as a special kind of matrix multiplication. The weight matrix is just like the ones from traditional neural networks. However, this weight matrix has two special properties:\n",
"\n",
"1. The zeros shown in gray are untrainable. This means that they’ll stay zero throughout the optimization process.\n",
"1. Some of the weights are equal, and while they are trainable (i.e. changeable), they must remain equal. These are called *shared weights*.\n",
"\n",
"The zeros correspond to the pixels that the filter can't touch. Each row of the weight matrix corresponds to one application of the filter."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Convolution as matrix multiplication\" width=\"683\" caption=\"Convolution as matrix multiplication\" id=\"conv_matmul\" src=\"images/att_00038.png\">"
"There is no reason to believe that some particular edge filters are the most useful kernels for image recognition. Furthermore, we've seen that in later layers convolutional kernels become complex transformations of features from lower levels — we do not have a good idea of how to manually construct these.\n",
"Instead, it would be best to learn the values of the kernels. We already know how to do this — SGD! In effect, the model will learn the features that are useful for classification.\n",
"\n",
"When we use convolutions instead of (or in addition to) regular linear layers we create a *convolutional neural network*, or *CNN*."
"We now want to create a similar architecture to this linear model, but using convolutional layers instead of linear. `nn.Conv2d` is the module equivalent of `F.conv2d`. It's more convenient than `F.conv2d` when creating an architecture, because it creates the weight matrix for us automatically when we instantiate it.\n",
"One thing to note here is that we didn't need to specify \"28x28\" as the input size. That's because a linear layer needs a weight in the weight matrix for every pixel. So it needs to know how many pixels there are. But a convolution is applied over each pixel automatically. The weights only depend on the number of input and output channels, and the kernel size, as we saw in the previous section.\n",
"Have a think about what the output shape is going to be.\n",
"\n",
"Let's try it and see:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([64, 1, 28, 28])"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"broken_cnn(xb).shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is not something we can use to do classification, since we need a single output activation per image, not a 28x28 map of activations. One way to deal with this is to use enough stride-2 convolutions such that the final layer is size 1. That is, after one stride-2 convolution, the size will be 14x14, after 2 it will be 7x7, then 4x4, 2x2, and finally size 1.\n",
"\n",
"Let's try that now. First, we'll define a function with the basic parameters we'll use in each convolution:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def conv(ni, nf, ks=3, act=True):\n",
" res = nn.Conv2d(ni, nf, stride=2, kernel_size=ks, padding=ks//2)\n",
" if act: res = nn.Sequential(res, nn.ReLU())\n",
" return res"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> important: Refactoring parts of your neural networks like this makes it much less likely you'll get errors due to inconsistencies in your architectures, and makes it more obvious to the reader which parts of your layers are actually changing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we use a stride-2 convolution, we often increase the number of features at the same time. This is because we're decreasing the number of activations in the activation map by a factor of 4; we don't want to decrease the capacity of a layer by too much at a time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: channels and features: These two terms are largely used interchangably, and refer to the size of the second axis of a weight matrix, which is, therefore, the number of activations per grid cell after a convolution. *Features* is never used to refer to the input data, but *channels* can refer to either the input data (generally channels are colors) or activations inside the network."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"simple_cnn = sequential(\n",
" conv(1 ,4), #14x14\n",
" conv(4 ,8), #7x7\n",
" conv(8 ,16), #4x4\n",
" conv(16,32), #2x2\n",
" conv(32,2, act=False), #1x1\n",
" Flatten(),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> j: I like to add comments like the above after each convolution to show how large the activation map will be after each layer. The above comments assume that the input size is 28x28"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now the network outputs two activations, which maps to the two possible levels in our labels:"
"Optimizer used: <function Adam at 0x7fbc9c258cb0>\n",
"Loss function: <function cross_entropy at 0x7fbca9ba0170>\n",
"\n",
"Callbacks:\n",
" - TrainEvalCallback\n",
" - Recorder\n",
" - ProgressCallback"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"learn.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the output of the final Conv2d layer is `64x2x1x1`. We need to remove those extra `1x1` axes; that's what `Flatten` does. It's basically the same as PyTorch's `squeeze` method, but as a module.\n",
"\n",
"Let's see if this trains! Since this is a deeper network than we've built from scratch before, we'll use a lower learning rate and more epochs:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.072684</td>\n",
" <td>0.045110</td>\n",
" <td>0.990186</td>\n",
" <td>00:05</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0.022580</td>\n",
" <td>0.030775</td>\n",
" <td>0.990186</td>\n",
" <td>00:05</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.fit_one_cycle(2, 0.01)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Success! It's getting closer to the resnet-18 result we had, although it's not quite there yet, and it's taking more epochs, and we're needing to use a lower learning rate. So we've got a few more tricks still to learn--but we're getting closer and closer to being able to create a modern CNN from scratch."
"We can see from the summary that we have an input of size `64x1x28x28`. The axes are: `batch,channel,height,width`. This is often represented as `NCHW` (where `N` refers to batch size). Tensorflow, on the other hand, uses `NHWC` axis order. The first layer is:"
"So we have 1 channel input, 4 channel output, and a 3x3 kernel. Let's check the weights of the first convolution:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([4, 1, 3, 3])"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m[0].weight.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The summary shows we have 40 parameters, and `4*1*3*3` is 36. What are the other 4 parameters? Let's see what the bias contains:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([4])"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m[0].bias.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now use this information to better understand our earlier statement in this section: \"because we're decreasing the number of activations in the activation map by a factor of 4; we don't want to decrease the capacity of a layer by too much at a time\".\n",
"\n",
"There is one bias for each channel. (Sometimes channels are called *features* or *filters* when they are not input channels.) The output shape is `64x4x14x14`, and this will therefore become the input shape to the next layer. The next layer, according to `summary`, has 296 parameters. Let's ignore the batch axis to keep things simple. So for each of `14*14=196` locations we are multiplying `296-8=288` weights (ignoring the bias for simplicity), so that's `196*288=56_448` multiplications at this layer. The next layer will have `7*7*(1168-16)=56_448` multiplications.\n",
"\n",
"So what happened here is that our stride 2 conv halved the *grid size* from `14x14` to `7x7`, and we doubled the *number of filters* from 8 to 16, resulting in no overall change in the amount of computation. If we left the number of channels the same in each stride 2 layer, the amount of computation being done in the net would get less and less as it gets deeper. But we know that the deeper layers have to compute semantically rich features (such as eyes, or fur), so we wouldn't expect that doing *less* compute would make sense."
"The \"receptive field\" is the area of an image that is involved in the calculation of a layer. On the book website, you'll find an Excel spreadsheet called `conv-example.xlsx` that shows the calculation of two stride 2 convolutional layers using an MNIST digit. Each layer has a single kernel. If we click on one of the cells in the *conv2* section, which shows the output of the second convolutional layer, and click *trace precendents*, we see this:"
"<img alt=\"Immediate precedents of conv2 layer\" width=\"308\" caption=\"Immediate precedents of conv2 layer\" id=\"preced1\" src=\"images/att_00068.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, the green cell is the cell we clicked on, and the blue highlighted cells are its *precedents*--that is, the cells used to calculate its value. These cells are the corresponding 3x3 area of cells from the input layer (on the left), and the cells from the filter (on the right). Let's now click *show precedents* again, to show what cells are used to calculate these inputs, and see what happens:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Secondary precedents of conv2 layer\" width=\"601\" caption=\"Secondary precedents of conv2 layer\" id=\"preced2\" src=\"images/att_00069.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we just have two convolutional layers, each of stride 2, so this is now tracing right back to the input image. We can see that a 7x7 area of cells in the input layer is used to calculate the single green cell in the Conv2 layer. This 7x7 area is the *receptive field* in the Input of the green activation in Conv2. We can also see that a second filter kernel is needed now, since we have two layers.\n",
"As you see from this example, the deeper we are in the network (specifically, the more stride 2 convs we have before a layer), the larger the receptive field for an activation in that layer. A large receptive field means that a large amount of the input image is used to calculate each activation in that layer. So we know now that in the deeper layers of the network, we have semantically rich features, corresponding to larger receptive fields. Therefore, we'd expect that we'd need more weights for each of our features to handle this increasing complexity. This is another way of seeing the same thing we saw in the previous section: when we introduce a stride 2 conv in our network, we should also increase the number of channels."
"When writing this particular chapter, we had a lot of questions we needed answers for, to be able to explain to you those CNNs as best we could. Believe it or not, we found most of the answers on Twitter. "
"We are not, to say the least, big users of social networks in general. But our goal of this book is to help you become the best deep learning practitioner you can, and we would be remiss not to mention how important Twitter has been in our own deep learning journeys.\n",
"You see, there's another part of Twitter, far away from Donald Trump and the Kardashians, which is the part of Twitter where deep learning researchers and practitioners talk shop every day. As we were writing the section above, Jeremy wanted to double-check to ensure that what we were saying about stride 2 convolutions was accurate, so he asked on Twitter:"
"Christian Szegedy is the first author of [Inception](https://arxiv.org/pdf/1409.4842.pdf), the 2014 Imagenet winner and source of many key insights used in modern neural networks. Two hours later, this appeared:"
"Do you recognize that name? We saw it in <<chapter_production>>, when we were talking about the Turing Award winners who set the foundation of deep learning today!\n",
"Jeremy also asked on Twitter for help checking our description of label smoothing in <<chapter_sizing_and_tta>> was accurate, and got a response from again from directly from Christian Szegedy (label smoothing was originally introduced in the Inception paper):"
"Many of the top people in deep learning today are Twitter regulars, and are very open about interacting with the wider community. One good way to get started is to look at a list of Jeremy's [recent Twitter likes](https://twitter.com/jeremyphoward/likes), or [Sylvain's](https://twitter.com/GuggerSylvain/likes). That way, you can see a list of Twitter users that we thought had interesting and useful things to say.\n",
"\n",
"Twitter is the main way we both stay up to date with interesting papers, software releases, and other deep learning news. For making connections with the deep learning community, we recommend getting involved both in the [fast.ai forums](https://forums.fast.ai) and Twitter."
"Up until now, we have only shown you examples of pictures in black and white, with only one value per pixel. In practice, most colored images have three values per pixel to define is color."
"for bear,ax,color in zip(im,axs,('Reds','Greens','Blues')):\n",
" show_image(255-bear, ax=ax, cmap=color)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We saw what the convolution operation was for one filter on one channel of the image (our examples were done on a square). A convolution layer will take an image with a certain number of channels (3 for the first layer for regular RGB color images) and output an image with a different number of channels. Like our hidden size that represented the numbers of neurons in a linear layer, we can decide to have has many filters as we want, and each of them will be able to specialize, some to detect horizontal edges, other to detect vertical edges and so forth, to give something like we studied in <<chapter_production>>.\n",
"On one sliding window, we have a certain number of channels and we need as many filters (we don't use the same kernel for all the channels). So our kernel doesn't have a size of 3 by 3, but `ch_in` (for channel in) by 3 by 3. On each channel, we multiply the elements of our window by the elements of the coresponding filter then sum the results (as we saw before) and sum over all the filters. In the example given by <<rgbconv>>, the result of our conv layer on that window is red + green + blue."
"<img src=\"images/chapter9_rgbconv.svg\" id=\"rgbconv\" caption=\"Convolution over an RGB image\" alt=\"Convolution over an RGB image\" width=\"550\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, in order to apply a convolution to a colour picture we require a kernel tensor with a matching size as the first axis. At each location, the corresponding parts of the kernel and the image patch are multiplied together.\n",
"<img src=\"images/chapter9_rgb_conv_stack.svg\" id=\"rgbconv2\" caption=\"Adding the RGB filters\" alt=\"Adding the RGB filters\" width=\"500\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we have `ch_out` filters like this, so in the end, the result of our convolutional layer will be a batch of images with `ch_out` channels and a height and width given by the formula above. This give us `ch_out` tensors of size `ch_in x ks x ks` that we represent in one big tensor of 4 dimensions. In PyTorch, the order of the dimensions for those weights is `ch_out x ch_in x ks x ks`.\n",
"\n",
"Additionally, we may want to have a bias for each filter. In the example above, the final result for our convolutional layer would be $y_{R} + y_{G} + y_{B} + b$ in that case. Like in a linear layer, there are as many bias as we have kernels, so the bias is a vector of size `ch_out`.\n",
"There are lots of ways of processing color images. For instance, you can change them to black and white, or change from RGB to HSV (Hue, Saturation, and Value) color space, and so forth. In general, it turns out experimentally that changing the encoding of colors won't make any difference to your model results, as long as you don't lose information in the transformation. So transforming to black and white is a bad idea, since it removes the color information entirely (and this can be critical; for instance a pet breed may have a distinctive color); but converting to HSV generally won't make any difference.\n",
"Now you know what those pictures in <<chapter_intro>> of \"what a neural net learns\" from the Zeiler and Fergus paper mean! This is their picture of some of the layer 1 weights which we showed:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Layer 1 kernels found by Zeiler and Fergus\" width=\"120\" src=\"images/att_00031.png\">"
"This is taking the 3 slices of the convolutional kernel, for each output feature, and displaying them as images. We can see that even although the creators of the neural net never explicitly created kernels to find edges, for instance, the neural net automatically discovered these features using SGD.\n",
"\n",
"Now let's see how we can train those CNNs, and show you all the techniques fastai uses behind the hood for efficient training."
"Since we are so good at recognizing threes from sevens, let's move onto something harder—recognizing all 10 digits. That means we'll need to use `MNIST` instead of `MNIST_SAMPLE`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"path = untar_data(URLs.MNIST)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"Path.BASE_PATH = path"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(#2) [Path('testing'),Path('training')]"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"path.ls()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data is in two folders named `training` and `testing`, so we have to tell `GrandparentSplitter` about that (it defaults to `train` and `valid`). We define a function `get_dls` to make it easy to change our batch size later:"
"Let's start with a basic CNN as a baseline. We'll use the same as we had in the last Section, but with one tweak: we'll use more activations. Since we have more numbers to differentiate, it's likely we will need to learn more filters.\n",
"As we discussed, we generally want to double the number of filters each time we have a stride 2 layer. So, one way to increase the number of filters throughout our network is to double the number of activations in the first layer – then every layer after that will end up twice as big as the previous version as well.\n",
"\n",
"But there is a subtle problem with this. Consider the kernel which is being applied to each pixel. By default, we use a 3x3 pixel kernel. That means that there are a total of 3×3 = 9 pixels that the kernel is being applied to at each location. Previously, our first layer had four filters output. That meant that there were four values being computed from nine pixels at each location. Think about what happens if we double this output to 8 filters. Then when we apply our kernel we would be using nine pixels to calculate eight numbers. That means that it isn't really learning much at all — the output size is almost the same as the input size. Neural networks will only create useful features if they're forced to do so—that is, that the number of outputs from an operation is smaller than the number of inputs.\n",
"\n",
"To fix this, we can use a larger kernel in the first layer. If we use a kernel of 5x5 pixels then there are 25 pixels being used at each kernel application — creating eight filters from this will mean the neural net will have to find some useful features."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def simple_cnn():\n",
" return sequential(\n",
" conv(1 ,8, ks=5), #14x14\n",
" conv(8 ,16), #7x7\n",
" conv(16,32), #4x4\n",
" conv(32,64), #2x2\n",
" conv(64,10, act=False), #1x1\n",
" Flatten(),\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you'll see in a moment, we're going to look inside our models while they're training in order to try to find ways to make them train better. To do this, we use the `ActivationStats` callback, which records the mean, standard deviation, and histogram of activations of every trainable layer (as we've seen, callbacks are used to add behavior to the training loop; we'll see how they work in <<chapter_accel_sgd>>)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from fastai2.callback.hook import *"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to train quickly, so that means training at a high learning rate. Let's see how we go at 0.06:"
"This didn't train at all well! Let's find out why.\n",
"\n",
"One handy feature of the callbacks passed to `Learner` is that they are made available automatically, with the same name as the callback class, except in `camel_case`. So our `ActivationStats` callback can be accessed through `activation_stats`. In fact--I'm sure you remember `learn.recorder`... can you guess how that is implemented? That's right, it's a callback called `Recorder`!\n",
"\n",
"`ActivationStats` includes some handy utilities for plotting the activations during training. `plot_layer_stats(idx)` plots the mean and standard deviation of the activations of layer number `idx`, along with the percent of activations near zero. Here's the first layer's plot:"
"Generally our model should have a consistent, or at least smooth, mean and standard deviation of layer activations during training. Activations near zero are particularly problematic, because it means we have computation in the model that's doing nothing at all (since multiplying by zero gives zero). When you have some zeros in one layer, they will therefore generally carry over to the next layer... which will then create more zeros. Here's the penultimate layer of our network:"
"As expected, the problems get worse towards the end of the network, as the instability and zero activations compound over layers. The first thing we can do to make training more stable is to increase the batch size."
"One way to make training more stable is to *increase the batch size*. Larger batches have gradients that are more accurate, since they're calculated from more data. On the downside though, a larger batch size means fewer batches per epoch, which means less opportunities for your model to update weights. Let's see if a batch size of 512 helps:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dls = get_dls(512)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>2.309385</td>\n",
" <td>2.302744</td>\n",
" <td>0.113500</td>\n",
" <td>00:08</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn = fit()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see what the penultimate layer looks like:"
"Our initial weights are not well suited to the task we're trying to solve. Therefore, it is dangerous to begin training with a high learning rate: we may very well make the training diverge instantly, as we've seen above. We probably don't want to end training with a high learning rate either, so that we don't skip over a minimum. But we want to train at a high learning rate for the rest of training, because we'll be able to train more quickly. Therefore, we should change the learning rate during training, from low, to high, and then back to low again.\n",
"\n",
"Leslie Smith (yes, the same guy that invented the learning rate finder!) developed this idea in his article [Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates](https://arxiv.org/abs/1708.07120) by designing a schedule for learning rate separated in two phases: one were the learning rate grows from the minimum value to the maximum value (*warm-up*), and then one where it decreases back to the minimum value (*annealing*). Smith called this combination of approaches *1cycle training*.\n",
"\n",
"1cycle training allows us to use a much higher maximum learning rate than other types of training, which gives two benefits:\n",
"\n",
"- By training with higher learning rates, we train faster, a phenomenon Leslie N. Smith named *super-convergence*\n",
"- By training with higher learning rates, we overfit less because we skip over the sharp local minimas to end-up in a smoother (and therefore more generalizable) part of the loss.\n",
"\n",
"The second point is an interesting and subtle idea; it is based on the observation that a model that generalises well is one whose loss would not change very much if you change the input by a small amount. If a model trains at a large learning rate for quite a while, and can find a good loss when doing so, it must have found an area that also generalises well, because it is jumping around a lot from batch to batch (that is basically the definition of a high learning rate). The problem is that, as we have discussed, just jumping to a high learning rate is more likely to result in diverging losses, rather than seeing your losses improve. So we don't just jump to a high learning rate. Instead, we start at a low learning rate, where our losses do not diverge, and we allow the optimiser to gradually find smoother and smoother areas of our parameters, by gradually going to higher and higher learning rates.\n",
"\n",
"Then, once we have found a nice smooth area for our parameters, we then want to find the very best part of that area, which means we have to bring out learning rates down again. This is why 1cycle training has a gradual learning rate warmup, and a gradual learning rate cooldown. Many researchers have found that in practice this approach leads to more accurate models, and trains more quickly. That is why it is the approach that is used by default for `fine_tune` in fastai.\n",
"In <<chapter_accel_sgd>> we'll learn all about *momentum* in SGD. Briefly, momentum is a technique where the optimizer takes a step not only in the direction of the gradients, but also continues in the direction of previous steps. Leslie Smith introduced cyclical momentums in [A disciplined approach to neural network hyper-parameters: Part 1](https://arxiv.org/pdf/1803.09820.pdf). It suggests that the momentum varies in the opposite direction of the learning rate: when we are at high learning rates, we use less momentum, and we use more again in the annealing phase.\n",
"We're finally making some progress! It's giving us a reasonable accuracy now.\n",
"\n",
"We can view the learning rate and momentum throughout training by calling `plot_sched` on `learn.recorder`. `learn.recorder` (as the name suggests) records everything that happens during training, including losses, metrics, and hyperparameters such as learning rate and momentum:"
"Smith's original 1cycle paper used a linear warm-up and linear annealing. As you see above, we adapted the approach in fastai by combining it with another popular approach: cosine annealing. `fit_one_cycle` provides the following parameters you can adjust:\n",
"- `lr_max`:: The highest learning rate that will be used (this can also be a list of learning rates for each layer group, or a python `slice` object containing the first and last layer group learning rates)\n",
"- `div`:: How much to divide `lr_max` by to get the starting learning rate\n",
"- `div_final`:: How much to divide `lr_max` by to get the ending learning rate\n",
"- `pct_start`:: What % of the batches to use for the warmup\n",
"- `moms`:: A tuple `(mom1,mom2,mom3)` where mom1 is the initial momentum, mom2 is the minimum momentum, and mom3 is the final momentum.\n",
"`color_dim` was developed by fast.ai in conjunction with a student, Stefano Giomo. Stefano, who refers to the idea as the *colorful dimension*, has a [detailed explanation](https://forums.fast.ai/t/the-colorful-dimension/42908) of the history and details behind the method. The basic idea is to create a histogram of the activations of a layer, which we would hope would follow a smooth pattern such as the normal distribution shown by Stefano here:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"images/colorful_dist.jpeg\" id=\"colorful_dist\" caption=\"Histogram in 'colorful dimension'\" alt=\"Histogram in 'colorful dimension'\" width=\"800\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To create `color_dim`, we take the histogram shown on the left here, and convert it into just the colored representation shown at the bottom. Then we flip it on its side, as shown on the right. We found that the distribution is clearer if we take the `log` of the histogram values. Then, Stefano describes:\n",
"\n",
"> : The final plot for each layer is made by stacking the histogram of the activations from each batch along the horizontal axis. So each vertical slice in the visualisation represents the histogram of activations for a single batch. The color intensity corresponds to the height of the histogram, in other words the number of activations in each histogram bin.\n",
"\n",
"This is Stefano's picture of how this all fits together:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"images/colorful_summ.png\" id=\"colorful_summ\" caption=\"Summary of 'colorful dimension'\" alt=\"Summary of 'colorful dimension'\" width=\"800\">"
"This shows a classic picture of \"bad training\". We start with nearly all activations at zero--that's what we see at the far left, with nearly all the left hand side dark blue; the bright yellow at the bottom are the near-zero activations. Then over the first few batches we see the number of non-zero activations exponentially increasing. But it goes too far, and collapses! We see the dark blue return, and the bottom becomes bright yellow again. It almost looks like training restarts from scratch. Then we see the activations increase again, and then it collapses again. After repeating a few times, eventually we see a spread of activations throughout the range.\n",
"It's much better if training can be smooth from the start. The cycles of exponential increase and then collapse that we see above tend to result in a lot of near-zero activations, resulting in slow training, and poor final results. One way to solve this problem is to use Batch normalization."
"To fix the slow training and poor final results we ended up with in the previous section, we need to both fix the initial large percentage of near-zero activations, and then try to maintain a good distribution of activations throughout training.\n",
"Sergey Ioffe and Christian Szegedy showed a solution to this problem in the 2015 paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167). In the abstract, they describe just the problem that we've seen:\n",
"> : \"Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization... We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.\"\n",
"\n",
"Their solution, they say is:\n",
"\n",
"> : \"...making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization.\"\n",
"The paper caused great excitement as soon as it was released, because they showed the chart in <<batchnorm>>, which clearly demonstrated that batch normalization could train a model that was even more accurate than the current state of the art (the *inception* architecture), around 5x faster:"
"<img alt=\"Impact of batch normalization\" width=\"553\" caption=\"Impact of batch normalization\" id=\"batchnorm\" src=\"images/att_00046.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The way batch normalization (often just called *batchnorm*) works is that it takes an average of the mean and standard deviations of the activations of a layer, and uses those to normalize the activations. However, this can cause problems because the network might really want some activations to be really high in order to make accurate predictions, they also add two learnable parameters (meaning they will be updated in our SGD step), usually called `gamma` and `beta`; after normalizing the activations to get some new activation vector `y`, a batchnorm layer returns `gamma*y + beta`.\n",
"\n",
"That why our activations can have any mean or variance, which is independent from the mean and std of the results of the previous layer. Those statistics are learned separately, making training easier on our model. The behavior is different during training and validation: during training, we use the mean and standard deviation of the batch to normalize the data. During validation, we instead use a running mean of the statistics calculated during training.\n",
"This is just what we hope to see: a smooth development of activations, with no \"crashes\". Batchnorm has really delivered on its promise here! In fact, batchnorm has been so successful that we see it (or something very similar) today in nearly all modern neural networks.\n",
"An interesting observation about models containing batch normalisation layers is that they tend to generalise better than models that don't contain them. Although we haven't as yet seen a rigourous analysis of what's going on here, most researchers believe that the reason for this is that batch normalisation add some extra randomness to the training process. Each mini batch will have a somewhat different mean and standard deviation to other mini batches. Therefore, the activations will be normalised by different values each time. In order for the model to make accurate predictions, it will have to learn to become robust with these variations. In general, adding additional randomisation to the training process often helps.\n",
"Since things are going so well, let's train for a few more epochs and see how it goes. In fact, let's even *increase* the learning rate, since the abstract of the batchnorm paper claimed we should be able to \"train at much higher learning rates\":"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.191731</td>\n",
" <td>0.121738</td>\n",
" <td>0.960900</td>\n",
" <td>00:11</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0.083739</td>\n",
" <td>0.055808</td>\n",
" <td>0.981800</td>\n",
" <td>00:10</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>0.053161</td>\n",
" <td>0.044485</td>\n",
" <td>0.987100</td>\n",
" <td>00:10</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>0.034433</td>\n",
" <td>0.030233</td>\n",
" <td>0.990200</td>\n",
" <td>00:10</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>0.017646</td>\n",
" <td>0.025407</td>\n",
" <td>0.991200</td>\n",
" <td>00:10</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn = fit(5, lr=0.1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.183244</td>\n",
" <td>0.084025</td>\n",
" <td>0.975800</td>\n",
" <td>00:13</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0.080774</td>\n",
" <td>0.067060</td>\n",
" <td>0.978800</td>\n",
" <td>00:12</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>0.050215</td>\n",
" <td>0.062595</td>\n",
" <td>0.981300</td>\n",
" <td>00:12</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>0.030020</td>\n",
" <td>0.030315</td>\n",
" <td>0.990700</td>\n",
" <td>00:12</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>0.015131</td>\n",
" <td>0.025148</td>\n",
" <td>0.992100</td>\n",
" <td>00:12</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn = fit(5, lr=0.1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this point, I think it's fair to say we know how to recognize digits! It's time to move on to something harder..."
"We've seen that convolutions are just a type of matrix multiplication, with two constraints on the weight matrix: some elements are always zero, and some elements are tied (forced to always have the same value). In <<chapter_intro>> we saw the eight requirements from the 1986 book *Parallel Distributed Processing*; one of them was \"A pattern of connectivity among units\". That's exactly what these constraints do: they enforce a certain pattern of connectivity.\n",
"\n",
"These constraints allow us to use far less parameters in our model, without sacrificing the ability to represent complex visual features. That means we can train deeper models faster, with less over-fitting. Although the universal approximation theorem shows that it should be *possible* to represent anything in a fully connected network in one hidden layer, we've seen now that in *practice* we can train much better models by being thoughtful about network architecture.\n",
"Convolutions are by far the most common pattern of connectivity we see in neural nets (along with regular linear layers, which we refer to as *fully connected*), but it's likely that many more will be discovered.\n",
"\n",
"Then we have seen how to interpret the activations of layers in the network to see if training is going well or not, and how Batchnorm helps regularizing the training and makes it smoother. In the next chapter, we will use both of those layers to build the most popular architecture in computer vision: residual networks."
"1. What features other than edge detectors have been used in computer vision (especially before deep learning became popular)?\n",
"1. There are other normalization layers available in PyTorch. Try them out and see what works best. Learn about why other normalization layers have been developed, and how they differ from batch normalization.\n",
"1. Try moving the activation function after the batch normalization layer in `conv`. Does it make a difference? See what you can find out about what order is recommended, and why.\n",
"1. Batch normalization isn't defined for a batch size of one, since the standard deviation isn't defined for a single item. "