mirror of
https://github.com/fastai/fastbook.git
synced 2025-04-05 02:10:48 +00:00
proofread convolutions
This commit is contained in:
parent
e87f1d54e7
commit
a9f7bc804b
@ -47,7 +47,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"One of the most powerful tools that machine learning practitioners have at their disposal is *feature engineering*. A *feature* is a transformation of the data which is designed to make it easier to model. For instance, the `add_datepart` function that we used for our tabular data set preprocessing added date features to the Bulldozers dataset. What kind of features might we be able to create from images?"
|
||||
"One of the most powerful tools that machine learning practitioners have at their disposal is *feature engineering*. A *feature* is a transformation of the data which is designed to make it easier to model. For instance, the `add_datepart` function that we used for our tabular dataset preprocessing added date features to the Bulldozers dataset. What kind of features might we be able to create from images?"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -79,7 +79,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The grey grid to the left is our *image* we're going to apply the kernel to. The convolution operation multiplies each element of the kernel, to each element of a 3x3 block of the image. The results of these multiplications are then added together. The diagram above shows an example of applying a kernel to a single location in the image, the 3x3 block around cell 18.\n",
|
||||
"The 7x7 grid to the left is our *image* we're going to apply the kernel to. The convolution operation multiplies each element of the kernel, to each element of a 3x3 block of the image. The results of these multiplications are then added together. The diagram above shows an example of applying a kernel to a single location in the image, the 3x3 block around cell 18.\n",
|
||||
"\n",
|
||||
"Let's do this with code. First, we create a little 3x3 matrix like so:"
|
||||
]
|
||||
@ -1526,7 +1526,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We won't implement this function from scratch, using PyTorch's implementation (that is way faster than anything we could do in python) instead."
|
||||
"We won't implement this convolution function from scratch, but use PyTorch's implementation instead (it is way faster than anything we could do in Python)."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1540,7 +1540,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Convolution is such an important and widely-used operation that PyTorch has it builtin. It's called `F.conv2d`. The PyTorch docs tell us that it includes these parameters:\n",
|
||||
"Convolution is such an important and widely-used operation that PyTorch has it builtin. It's called `F.conv2d` (Recall F is a fastai import from torch.nn.functional as recommended by PyTorch). The PyTorch docs tell us that it includes these parameters:\n",
|
||||
"\n",
|
||||
"- **input**: input tensor of shape `(minibatch, in_channels, iH, iW)`\n",
|
||||
"- **weight**: filters of shape `(out_channels, in_channels, kH, kW)`\n",
|
||||
@ -1764,7 +1764,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"With a 5x5 input, and 4x4 kernel, and 2 pixels of padding, we end up with a 6x6 activation map, as we can see in <<four_by_five_conv>>/"
|
||||
"With a 5x5 input, and 4x4 kernel, and 2 pixels of padding, we end up with a 6x6 activation map, as we can see in <<four_by_five_conv>>"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2027,7 +2027,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"One thing to note here is that we didn't need to specify \"28x28\" as the input size. That's because a linear layer needs a weight in the weight matrix for every pixel. So it needs to know how many pixels there are. But a convolution is applied over each pixel automatically. The weights only depend on the number of input and output channels, and the kernel size, as we say in the previous section.\n",
|
||||
"One thing to note here is that we didn't need to specify \"28x28\" as the input size. That's because a linear layer needs a weight in the weight matrix for every pixel. So it needs to know how many pixels there are. But a convolution is applied over each pixel automatically. The weights only depend on the number of input and output channels, and the kernel size, as we saw in the previous section.\n",
|
||||
"\n",
|
||||
"Have a think about what the output shape is going to be.\n",
|
||||
"\n",
|
||||
@ -2446,14 +2446,14 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When writing this particular chapter, we had a lot of questions we needed answer to, to be able to explain to you those CNNs as best we could. Believe it or not, we found most of the answers on twitter. "
|
||||
"When writing this particular chapter, we had a lot of questions we needed answers for, to be able to explain to you those CNNs as best we could. Believe it or not, we found most of the answers on Twitter. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### A note about twitter"
|
||||
"### A note about Twitter"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2462,7 +2462,7 @@
|
||||
"source": [
|
||||
"We are not, to say the least, big users of social networks in general. But our goal of this book is to help you become the best deep learning practitioner you can, and we would be remiss not to mention how important Twitter has been in our own deep learning journeys.\n",
|
||||
"\n",
|
||||
"You see, there's another part of Twitter, far away from Donald Trump and the Kardashians, which is the part of Twitter where deep learning researchers and practitioners talk shop every day. As we were writing the section above, Jeremy wanted to double-check to ensure that what we were saying about stride 2 convolutions was accurate, so he asked on twitter:"
|
||||
"You see, there's another part of Twitter, far away from Donald Trump and the Kardashians, which is the part of Twitter where deep learning researchers and practitioners talk shop every day. As we were writing the section above, Jeremy wanted to double-check to ensure that what we were saying about stride 2 convolutions was accurate, so he asked on Twitter:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2626,7 +2626,7 @@
|
||||
"source": [
|
||||
"We saw what the convolution operation was for one filter on one channel of the image (our examples were done on a square). A convolution layer will take an image with a certain number of channels (3 for the first layer for regular RGB color images) and output an image with a different number of channels. Like our hidden size that represented the numbers of neurons in a linear layer, we can decide to have has many filters as we want, and each of them will be able to specialize, some to detect horizontal edges, other to detect vertical edges and so forth, to give something like we studied in <<chapter_production>>.\n",
|
||||
"\n",
|
||||
"On one sliding window, we have a certain number of channels and we need as many filters (we don't use the same kernel for all the channels). So our kernel doesn't have a size of 3 by 3, but `ch_in` (for channel in) by 3 by 3. On each channel, we multiply the elements of our window by the elements of the coresponding filter then sum the results (as we saw before) and sum over all the filters. In the example fiven by <<rgbconv>>, the result of our conv layer on that window is red + green + blue."
|
||||
"On one sliding window, we have a certain number of channels and we need as many filters (we don't use the same kernel for all the channels). So our kernel doesn't have a size of 3 by 3, but `ch_in` (for channel in) by 3 by 3. On each channel, we multiply the elements of our window by the elements of the coresponding filter then sum the results (as we saw before) and sum over all the filters. In the example given by <<rgbconv>>, the result of our conv layer on that window is red + green + blue."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2660,7 +2660,7 @@
|
||||
"\n",
|
||||
"Additionally, we may want to have a bias for each filter. In the example above, the final result for our convolutional layer would be $y_{R} + y_{G} + y_{B} + b$ in that case. Like in a linear layer, there are as many bias as we have kernels, so the bias is a vector of size `ch_out`.\n",
|
||||
"\n",
|
||||
"There are no special mechanisms required when setting up a CNN for training with color images. Just make sure your first layer as 3 inputs.\n",
|
||||
"There are no special mechanisms required when setting up a CNN for training with color images. Just make sure your first layer has 3 inputs.\n",
|
||||
"\n",
|
||||
"There are lots of ways of processing color images. For instance, you can change them to black and white, or change from RGB to HSV (Hue, Saturation, and Value) color space, and so forth. In general, it turns out experimentally that changing the encoding of colors won't make any difference to your model results, as long as you don't lose information in the transformation. So transforming to black and white is a bad idea, since it removes the color information entirely (and this can be critical; for instance a pet breed may have a distinctive color); but converting to HSV generally won't make any difference.\n",
|
||||
"\n",
|
||||
@ -2765,7 +2765,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Always a good idea to look at your data before you use it:"
|
||||
"It is always a good idea to look at your data before you use it:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2827,7 +2827,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's start with a basic CNN as a baseline. We'll use the same as we had in the last chapter, but with one tweak: we'll use more activations. Since we have more numbers to differentiate, it's likely we will need to learn more filters.\n",
|
||||
"Let's start with a basic CNN as a baseline. We'll use the same as we had in the last Section, but with one tweak: we'll use more activations. Since we have more numbers to differentiate, it's likely we will need to learn more filters.\n",
|
||||
"\n",
|
||||
"As we discussed, we generally want to double the number of filters each time we have a stride 2 layer. So, one way to increase the number of filters throughout our network is to double the number of activations in the first layer – then every layer after that will end up twice as big as the previous version as well.\n",
|
||||
"\n",
|
||||
@ -3123,7 +3123,7 @@
|
||||
"\n",
|
||||
"Then, once we have found a nice smooth area for our parameters, we then want to find the very best part of that area, which means we have to bring out learning rates down again. This is why 1cycle training has a gradual learning rate warmup, and a gradual learning rate cooldown. Many researchers have found that in practice this approach leads to more accurate models, and trains more quickly. That is why it is the approach that is used by default for `fine_tune` in fastai.\n",
|
||||
"\n",
|
||||
"In <<chapter_accel_sgd>> we'll learn all about *momentum* in SGD. Briefly, momentum is a technique where the optimizer takes a step not only in the direction of the gradients, but also continues in the direction of previous steps. Leslie Smith introduced cyclical momentums in [A disciplined approach to neural network hyper-parameters: Part 1](https://arxiv.org/pdf/1803.09820.pdf). It suggests that the momentum varies in the opposite direction of the learning rate: when we are at high learning rate, we use less momentum, and we use more again in the annealing phase.\n",
|
||||
"In <<chapter_accel_sgd>> we'll learn all about *momentum* in SGD. Briefly, momentum is a technique where the optimizer takes a step not only in the direction of the gradients, but also continues in the direction of previous steps. Leslie Smith introduced cyclical momentums in [A disciplined approach to neural network hyper-parameters: Part 1](https://arxiv.org/pdf/1803.09820.pdf). It suggests that the momentum varies in the opposite direction of the learning rate: when we are at high learning rates, we use less momentum, and we use more again in the annealing phase.\n",
|
||||
"\n",
|
||||
"We can use 1cycle training in fastai by calling `fit_one_cycle`:"
|
||||
]
|
||||
@ -3496,7 +3496,7 @@
|
||||
"source": [
|
||||
"This is just what we hope to see: a smooth development of activations, with no \"crashes\". Batchnorm has really delivered on its promise here! In fact, batchnorm has been so successful that we see it (or something very similar) today in nearly all modern neural networks.\n",
|
||||
"\n",
|
||||
"An interesting observation about models containing batch normalisation layers is that they tend to generalise better than models that don't contain them. Although we haven't as yet seen a rigourous analysis of what's going on here, most researchers believe that the reason for this is that batch normalisation add some extra randomness to the training process. Each mini batch will have a somewhat different mean and standard deviation to each other mini batch. Therefore, the activations will be normalised by different values each time. In order for the model to make accurate predictions, it will have to learn to become insensitive to these variations. In general, adding additional randomisation to the training process often helps.\n",
|
||||
"An interesting observation about models containing batch normalisation layers is that they tend to generalise better than models that don't contain them. Although we haven't as yet seen a rigourous analysis of what's going on here, most researchers believe that the reason for this is that batch normalisation add some extra randomness to the training process. Each mini batch will have a somewhat different mean and standard deviation to other mini batches. Therefore, the activations will be normalised by different values each time. In order for the model to make accurate predictions, it will have to learn to become robust with these variations. In general, adding additional randomisation to the training process often helps.\n",
|
||||
"\n",
|
||||
"Since things are going so well, let's train for a few more epochs and see how it goes. In fact, let's even *increase* the learning rate, since the abstract of the batchnorm paper claimed we should be able to \"train at much higher learning rates\":"
|
||||
]
|
||||
@ -3760,5 +3760,5 @@
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
|
Loading…
Reference in New Issue
Block a user