Merge pull request #93 from Datadote/master

ch 4 recommendations
This commit is contained in:
Sylvain Gugger 2020-04-15 08:34:00 -04:00 committed by GitHub
commit 6b0199ecf0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -2057,7 +2057,7 @@
"\n",
"To get a validation set we need to remove some of the data from training entirely, so it is not seen by the model at all. As it turns out, the creators of the MNIST dataset have already done this for us. Do you remember how there was a whole separate directory called \"valid\"? That's what this directory is for!\n",
"\n",
"So to start with, let's create tensors for our threes and sevens from that directory. These are the tensors we will use calculate a metric measuring the quality of our first try model, which measures distance from an ideal image."
"So to start with, let's create tensors for our threes and sevens from that directory. These are the tensors we will use to calculate a metric measuring the quality of our first try model, which measures distance from an ideal image."
]
},
{
@ -2217,7 +2217,7 @@
"source": [
"We are calculating the difference between the \"ideal 3\" and each of 1,010 threes in the validation set, for each of `28x28` images, resulting in the shape `1010,28,28`.\n",
"\n",
"There's a couple of important points about how broadcasting is implemented, which make it valuable not just for expressivity but also for performance:\n",
"There are a couple of important points about how broadcasting is implemented, which make it valuable not just for expressivity but also for performance:\n",
"\n",
"- PyTorch doesn't *actually* copy `mean3` 1010 times. Instead, it just *pretends* as if it was a tensor of that shape, but doesn't actually allocate any additional memory\n",
"- It does the whole calculation in C (or, if you're using a GPU, in CUDA, the equivalent of C on the GPU), tens of thousands of times faster than pure Python (up to millions of times faster on a GPU!)\n",
@ -2508,7 +2508,7 @@
"\n",
"- **Initialize**:: we initialize the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initialising them to the percentage of times that that pixel is activated for that category. But since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well\n",
"- **Loss**:: This is the thing Arthur Samuel referred to: \"*testing the effectiveness of any current weight assignment in terms of actual performance*\". We need some function that will return a number that is small if the performance of the model is good (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention)\n",
"- **Step**:: A simple way to figure out whether a weight should be increased a bit, or decreased a bit, would be just to try it. Increase the weight by a small amount, and see if the loss goes up or down. Once you find the correct direction, you could then change that amount by a bit more, and a bit less, until you find an amount which works well. However, this is slow! As we will see, the magic of calculus allows us to directly figure out which direction, and roughly how much, to change each weight, without having to try all these small changes. The way to do this is by calculating *gradients*. This is just a performance optimisation, we would get exactly the same results by using the slower manual process as well\n",
"- **Step**:: A simple way to figure out whether a weight should be increased a bit, or decreased a bit, would be just to try it. Increase the weight by a small amount, and see if the loss goes up or down. Once you find the correct direction, you could then change that amount by a bit more, and a bit less, until you find an amount which works well. However, this is slow! As we will see, the magic of calculus allows us to directly figure out which direction, and roughly how much, to change each weight, without having to try all these small changes. The way to do this is by calculating *gradients*. This is just a performance optimisation. We would get exactly the same results by using the slower manual process as well\n",
"- **Stop**:: We have already discussed how to choose how many epochs to train a model for. This is where that decision is applied. For our digit classifier, we would keep training until the accuracy of the model started getting worse, or we ran out of time."
]
},
@ -2854,7 +2854,7 @@
"w -= gradient(w) * lr\n",
"```\n",
"\n",
"This is known as *stepping* your parameters, using a *optimiser step*.\n",
"This is known as *stepping* your parameters, using an *optimiser step*.\n",
"\n",
"If you pick a learning rate that's too low, it can mean having to do a lot of steps. <<descent_small>> illustrates that."
]
@ -2914,7 +2914,7 @@
"source": [
"We've seen how to use gradients to find a minimum. Now it's time to look at an SGD example, and see how finding a minimum can be used to train a model to fit data better.\n",
"\n",
"Let us start with a simple, synthetic, example model. Imagine you were measuring the speed of a roller coaster as it went over the top of a hump. It would start fast, and then get slower as it went up the hill, and then would be slowest at the top, and it would then speed up again as it goes downhill. You want to build a model of how the space changes over time. If you're measuring the speed manually every second for 20 seconds, it might look something like this:"
"Let us start with a simple, synthetic, example model. Imagine you were measuring the speed of a roller coaster as it went over the top of a hump. It would start fast, and then get slower as it went up the hill, and then would be slowest at the top, and it would then speed up again as it goes downhill. You want to build a model of how the speed changes over time. If you're measuring the speed manually every second for 20 seconds, it might look something like this:"
]
},
{
@ -3505,7 +3505,7 @@
"source": [
"To summarize, at the beginning, the weights of our model can be random (training *from scratch*) or come from a pretrained model (*transfer learning*). In the first case, the output we will get from our inputs won't have anything to do with what we want, and even in the second case, it's very likely the pretrained model won't be very good at the specific task we are targeting. So the model will need to *learn* better weights.\n",
"\n",
"To do this, we will compare the outputs the model gives us with our targets (we have labelled data, so we know what result the model should give) using a *loss function*, which returns a number that needs to be as low as possible. Our weights need to be improved. To do this, we take a few data items (such as images) that we feed to our model. After going through our model, we compare the corresponding targets using our loss function. The score we get tells us how wrong our predictions were, and we will change the weights a little bit to make it slightly better.\n",
"To do this, we will compare the outputs the model gives us with our targets (we have labelled data, so we know what result the model should give) using a *loss function*, which returns a number that needs to be as low as possible. Our weights need to be improved. To do this, we take a few data items (such as images) and feed them to our model. After going through our model, we compare the corresponding targets using our loss function. The score we get tells us how wrong our predictions were, and we will change the weights a little bit to make it slightly better.\n",
"\n",
"To find how to change the weights to make the loss a bit better, we use calculus to calculate the *gradient*. (Actually, we let PyTorch do it for us!) Let's imagine you are lost in the mountains with your car parked at the lowest point. To find your way, you might wander in a random direction but that probably won't help much. Since you know your vehicle is at the lowest point, you would be better to go downhill. By always taking a step in the direction of the steepest downward slope, you should eventually arrive at your destination. We use the magnitude of the gradient (i.e., the steepness of the slope) to tell us how big a step to take; specifically, we multiply the gradient by a number we choose called the *learning rate* to decide on the step size."
]
@ -5676,7 +5676,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Did you choose to skip over chapters 2 & 3, in your excitement to peak under the hood? Well, here's your reminder to head back to chapter 2 now, because you'll be needing to know that stuff very soon!"
"Did you choose to skip over chapters 2 & 3, in your excitement to peek under the hood? Well, here's your reminder to head back to chapter 2 now, because you'll be needing to know that stuff very soon!"
]
},
{
@ -5760,6 +5760,18 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,