proofread mnist

This commit is contained in:
Brad S 2020-03-08 23:39:08 +08:00
parent b2f1c12d4c
commit 600c4fc3da

View File

@ -31,7 +31,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Having seen what it looks like to actually train a variety of models in chapter 2, lets now look under the hood and see exactly what is going on. Well start with computer vision, and will use that to introduce fundamental tools and concepts of deep learning.\n", "Having seen what it looks like to actually train a variety of models in Chapter 2, lets now look under the hood and see exactly what is going on. Well start by using computer vision to introduce fundamental tools and concepts for deep learning.\n",
"\n", "\n",
"To be exact, we'll discuss the role of arrays and tensors, and of brodcasting, a powerful technique for using them expressively. We'll explain stochastic gradient descent (SGD), the mechanism for learning by updating weights automatically. We'll discuss the choice of loss function for our basic classification task, and the role of mini-batches. We'll also finally describe the math that a basic neural network is actually doing. Finally, we'll put all these pieces together to see them working together.\n", "To be exact, we'll discuss the role of arrays and tensors, and of brodcasting, a powerful technique for using them expressively. We'll explain stochastic gradient descent (SGD), the mechanism for learning by updating weights automatically. We'll discuss the choice of loss function for our basic classification task, and the role of mini-batches. We'll also finally describe the math that a basic neural network is actually doing. Finally, we'll put all these pieces together to see them working together.\n",
"\n", "\n",
@ -51,7 +51,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In order to understand what happens in a computer vision model, we first have to understand how computers handle images. We'll use one of the most famous datasets in computer vision, [MNIST](https://en.wikipedia.org/wiki/MNIST_database), for our experiments. MNIST contains hand-written digits, collected by the National Institute of Standards and Technology, and collated into a machine learning dataset by Yann Lecun and his colleagues. Lecun used MNIST in 1998 to demonstrate [Lenet 5](http://yann.lecun.com/exdb/lenet/), the first computer system to demonstrate practically useful recognition of hand-written digit sequences. This was one of the most important breakthroughs in the history of AI." "In order to understand what happens in a computer vision model, we first have to understand how computers handle images. We'll use one of the most famous datasets in computer vision, [MNIST](https://en.wikipedia.org/wiki/MNIST_database), for our experiments. MNIST contains hand-written digits, collected by the National Institute of Standards and Technology, and collated into a machine learning dataset by Yann Lecun and his colleagues. Lecun used MNIST in 1998 to demonstrate [Lenet 5](https://yann.lecun.com/exdb/lenet/), the first computer system to demonstrate practically useful recognition of hand-written digit sequences. This was one of the most important breakthroughs in the history of AI."
] ]
}, },
{ {
@ -65,9 +65,9 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The story of deep learning is one of tenacity and grit from a handful of dedicated researchers. After early hopes (and hype!) neural networks went out of favor in the 1990's and 2000's, and just a handful of researchers kept trying to make them work well. Three of them, Yann Lecun, Geoff Hinton, and Yoshua Bengio were awarded the highest honor in computer science, the Turing Award (generally considered the \"Nobel Prize of computer science\") after triumphing despite the deep skepticism and disinterest of the wider machine learning and statistics community.\n", "The story of deep learning is one of tenacity and grit from a handful of dedicated researchers. After early hopes (and hype!) neural networks went out of favor in the 1990's and 2000's, and just a handful of researchers kept trying to make them work well. Three of them, Yann Lecun, Yoshua Bengio and Geoffrey Hinton were awarded the highest honor in computer science, the Turing Award (generally considered the \"Nobel Prize of computer science\") after triumphing despite the deep skepticism and disinterest of the wider machine learning and statistics community.\n",
"\n", "\n",
"<img src=\"images/turing_300.jpg\" id=\"dl_fathers\" caption=\"Left to right, Yann Lecun, Geoffrey Hinton and Yoshua Bengio\" alt=\"Picture of Yann Lecun, Geoffrey Hinton and Yoshua Bengio\">\n", "<img src=\"images/turing_300.jpg\" id=\"dl_fathers\" caption=\"Left to right, Yann Lecun, Yoshua Bengio and Geoffrey Hinton\" alt=\"Picture of Yann Lecun, Yoshua Bengio and Geoffrey Hinton\">\n",
"\n", "\n",
"Geoff Hinton has told of how even academic papers showing dramatically better results than anything previously published would be rejected from top journals and conferences, just because they used a neural network. Yann Lecun's work on convolutional neural networks, which we will study in the next section, showed that these models could read hand-written text--something that had never been achieved before. However his breakthrough was ignored by most researchers, even as it was used commercially to read 10% of the checks in the US!\n", "Geoff Hinton has told of how even academic papers showing dramatically better results than anything previously published would be rejected from top journals and conferences, just because they used a neural network. Yann Lecun's work on convolutional neural networks, which we will study in the next section, showed that these models could read hand-written text--something that had never been achieved before. However his breakthrough was ignored by most researchers, even as it was used commercially to read 10% of the checks in the US!\n",
"\n", "\n",
@ -228,7 +228,7 @@
"source": [ "source": [
"Here we are using the `Image` class from the *Python Imaging Library* (PIL), which is the most widely used Python package for opening, manipulating, and viewing images. Jupyter knows about PIL images, so it displays the image for us automatically.\n", "Here we are using the `Image` class from the *Python Imaging Library* (PIL), which is the most widely used Python package for opening, manipulating, and viewing images. Jupyter knows about PIL images, so it displays the image for us automatically.\n",
"\n", "\n",
"In a computer, everything is represented as a number. To view the numbers that make up this image, we have to convert it to a *NumPy array* or a *PyTorch tensor*. For instance, here's a few numbers from the top-left of the image, converted to a numpy array:" "In a computer, everything is represented as a number. To view the numbers that make up this image, we have to convert it to a *NumPy array* or a *PyTorch tensor*. For instance, here's a few numbers from the top-left of the image, converted to a NumPy array:"
] ]
}, },
{ {
@ -1462,7 +1462,7 @@
"\n", "\n",
"Some operations in PyTorch, such as taking a mean, require us to cast our integer types to float types. Since we'll be needing this later, we'll also cast our stacked tensor to `float` now. Casting in PyTorch is as simple as typing the name of the type you wish to cast to, and treating it as a method.\n", "Some operations in PyTorch, such as taking a mean, require us to cast our integer types to float types. Since we'll be needing this later, we'll also cast our stacked tensor to `float` now. Casting in PyTorch is as simple as typing the name of the type you wish to cast to, and treating it as a method.\n",
"\n", "\n",
"Generally when images are floats, the pixels are expected to be be zero and one, so we will also divide by 255 here." "Generally when images are floats, the pixels are expected to be between zero and one, so we will also divide by 255 here."
] ]
}, },
{ {
@ -1586,7 +1586,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"According to this dataset, this is the ideal number three! Let's do the same thing for the sevens, but let's put all the steps together at once to save some time:" "According to this dataset, this is the ideal number three! Let's do the same thing for the sevens:"
] ]
}, },
{ {
@ -1659,7 +1659,7 @@
"- Take the mean of the *absolute value* of differences (_absolute value_ is the function that replaces negative values with positive values). This is called the *mean absolute difference* or *L1 norm*\n", "- Take the mean of the *absolute value* of differences (_absolute value_ is the function that replaces negative values with positive values). This is called the *mean absolute difference* or *L1 norm*\n",
"- Take the mean of the *square* of differences (which makes everything positive) and then take the *square root* (which *undoes* the squaring). This is called the *root mean squared error (RMSE)* or *L2 norm*.\n", "- Take the mean of the *square* of differences (which makes everything positive) and then take the *square root* (which *undoes* the squaring). This is called the *root mean squared error (RMSE)* or *L2 norm*.\n",
"\n", "\n",
"> important: in this book we generally assume that you have completed high school maths, and remember at least some of it... But everybody forgets some things! It all depends on what you happen to have had reason to practice in the meantime. Perhaps you have forgotten what a _square root_ is, or exactly how they work. No problem! Any time you come across a maths concept that is not explained fully in this book, don't just keep moving on, but instead stop and look it up. Make sure you understand the basic idea of what that maths concept is, how it works, and why we might be using it. One of the best places to refresh your understanding is Khan Academy. For instance, Khan Academy has a great [introduction to square roots](https://www.khanacademy.org/math/algebra/x2f8bb11595b61c86:rational-exponents-radicals/x2f8bb11595b61c86:radicals/v/understanding-square-roots)." "> important: in this book we generally assume that you have completed high school maths, and remember at least some of it... But everybody forgets some things! It all depends on what you happen to have had reason to practice in the meantime. Perhaps you have forgotten what a _square root_ is, or exactly how they work. No problem! Any time you come across a maths concept that is not explained fully in this book, don't just keep moving on, but instead stop and look it up. Make sure you understand the basic idea of what that the maths concept is, how it works, and why we might be using it. One of the best places to refresh your understanding is Khan Academy. For instance, Khan Academy has a great [introduction to square roots](https://www.khanacademy.org/math/algebra/x2f8bb11595b61c86:rational-exponents-radicals/x2f8bb11595b61c86:radicals/v/understanding-square-roots)."
] ]
}, },
{ {
@ -1790,13 +1790,13 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"A numpy array is multidimensional table of data, with all items of the same type. Since that can be any type at all, they could even be arrays of arrays, with the innermost arrays potentially being different sizes — this is called a \"jagged array\". By \"multidimensional table\" we mean, for instance, a list (dimension of one), a table or matrix (dimension of two), a \"table of tables\" or a \"cube\" (dimension of three), and so forth. If the items are all of some simple type such as an integer or a float then numpy will store them as a compact C data structure in memory. This is where numpy shines. Numpy has a wide variety of operators and methods which can run computations on these compact structures at the same speed as optimized C, because they are written in optimized C.\n", "Python is slow compared to many languages. Anything fast in Python, NumPy or PyTorch is likely to be a wrapper to a compiled object written (and optimised) in another language - specifically C. In fact, **NumPy arrays and PyTorch tensors can finish computations many thousands of times faster than using pure Python.**\n",
"\n", "\n",
"In fact, **arrays and tensors can finish computations many thousands of times faster than using pure Python.**\n", "A NumPy array is multidimensional table of data, with all items of the same type. Since that can be any type at all, they could even be arrays of arrays, with the innermost arrays potentially being different sizes — this is called a \"jagged array\". By \"multidimensional table\" we mean, for instance, a list (dimension of one), a table or matrix (dimension of two), a \"table of tables\" or a \"cube\" (dimension of three), and so forth. If the items are all of some simple type such as an integer or a float then NumPy will store them as a compact C data structure in memory. This is where NumPy shines. Numpy has a wide variety of operators and methods which can run computations on these compact structures at the same speed as optimized C, because they are written in optimized C.\n",
"\n", "\n",
"A PyTorch tensor is nearly the same thing as a numpy array, but with an additional restriction which unlocks some additional capabilities. It's the same in that it, too, is a multidimensional table of data, with all items of the same type. However, the restriction is that a tensor cannot use just any old type — it has to use a single basic numeric type for all componentss. As a result, a tensor is not as flexible as a genuine array of arrays, which allows jagged arrays, where the inner arrays could have different sizes. So a PyTorch tensor cannot be jagged. It is always a regularly shaped multidimensional rectangular structure.\n", "A PyTorch tensor is nearly the same thing as a NumPy array, but with an additional restriction which unlocks some additional capabilities. It's the same in that it, too, is a multidimensional table of data, with all items of the same type. However, the restriction is that a tensor cannot use just any old type — it has to use a single basic numeric type for all components. As a result, a tensor is not as flexible as a genuine array of arrays, which allows jagged arrays, where the inner arrays could have different sizes. So a PyTorch tensor cannot be jagged. It is always a regularly shaped multidimensional rectangular structure.\n",
"\n", "\n",
"The vast majority of methods and operators supported by numpy on these structures are also supported by PyTorch. But PyTorch tensors have additional capabilities. One major capability is that these structures can live on the GPU, in which case their computation will be optimised for the GPU, and can run much faster. In addition, PyTorch can automatically calculate derivatives of these operations, including combinations of operations. As you'll see, it would be impossible to do deep learning in practice without this capability.\n", "The vast majority of methods and operators supported by NumPy on these structures are also supported by PyTorch. But PyTorch tensors have additional capabilities. One major capability is that these structures can live on the GPU, in which case their computation will be optimised for the GPU, and can run much faster (given lots of values to work on). In addition, PyTorch can automatically calculate derivatives of these operations, including combinations of operations. As you'll see, it would be impossible to do deep learning in practice without this capability.\n",
"\n", "\n",
"> s: If you don't know what C is, do not worry as you won't need it at all. In a nutshell, it's a low-level (low-level means more similar to the language that computers use internally) language that is very fast compared to Python. To take advantage of its speed while programming in Python, try to avoid as much as possible writing loops and replace them by commands that work directly on arrays or tensors.\n", "> s: If you don't know what C is, do not worry as you won't need it at all. In a nutshell, it's a low-level (low-level means more similar to the language that computers use internally) language that is very fast compared to Python. To take advantage of its speed while programming in Python, try to avoid as much as possible writing loops and replace them by commands that work directly on arrays or tensors.\n",
"\n", "\n",
@ -2338,7 +2338,7 @@
"\n", "\n",
"As we discussed, this is the key to allowing us to have something which can get better and better — to learn. But our pixel similarity approach does not really do this. We do not have any kind of weight assignment, or any way of improving based on testing the effectiveness of a weight assignment. In other words, we can't really improve our pixel similarity approach by modifying a set of parameters (which will be the SGD part, as we will see). In order to take advantage of the power of deep learning, we will first have to represent our task in the way that Arthur Samuel described it.\n", "As we discussed, this is the key to allowing us to have something which can get better and better — to learn. But our pixel similarity approach does not really do this. We do not have any kind of weight assignment, or any way of improving based on testing the effectiveness of a weight assignment. In other words, we can't really improve our pixel similarity approach by modifying a set of parameters (which will be the SGD part, as we will see). In order to take advantage of the power of deep learning, we will first have to represent our task in the way that Arthur Samuel described it.\n",
"\n", "\n",
"Instead of trying to find the similarity between an image and a \"ideal image\" we could instead look at each individual pixel, and come up with a set of weights for each pixel, such that the highest weights are associated with those pixels most likely to be black for a particular category. For instance, pixels towards the bottom right are not very likely to be activated for a seven, so they should have a low weight for a seven, but are more likely to be activated for an eight, so they should have a high weight for an eight. This can be represented as a function for each possible category, for instance the probability of being the number eight:\n", "Instead of trying to find the similarity between an image and an \"ideal image\" we could instead look at each individual pixel, and come up with a set of weights for each pixel, such that the highest weights are associated with those pixels most likely to be black for a particular category. For instance, pixels towards the bottom right are not very likely to be activated for a seven, so they should have a low weight for a seven, but are more likely to be activated for an eight, so they should have a high weight for an eight. This can be represented as a function for each possible category, for instance the probability of being the number eight:\n",
"\n", "\n",
"```\n", "```\n",
"def pr_eight(x,w) = (x*w).sum()\n", "def pr_eight(x,w) = (x*w).sum()\n",
@ -2484,13 +2484,13 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"These seven steps, illustrated in <<gradient_descent>> are the key to the training of all deep learning models and we'll be using the seven terms in the above diagram throughout this book. That deep learning turns out to rely entirely on these steps is extremely surprising and counter-intuitive. It's amazing that this process can solve such complex problems. But, as you'll see, it really does!\n", "These seven steps, illustrated in <<gradient_descent>> are the key to the training of all deep learning models, and we'll be using the seven terms in the above diagram throughout this book. That deep learning turns out to rely entirely on these steps is extremely surprising and counter-intuitive. It's amazing that this process can solve such complex problems. But, as you'll see, it really does!\n",
"\n", "\n",
"There are many different ways to do each of these seven steps, and we will be learning about them throughout the rest of this book. These are the details which make a big difference for deep learning practitioners. But it turns out that the general approach to each one generally follows some basic principles:\n", "There are many different ways to do each of these seven steps, and we will be learning about them throughout the rest of this book. These are the details which make a big difference for deep learning practitioners. But it turns out that the general approach to each one generally follows some basic principles:\n",
"\n", "\n",
"- **Initialize**:: we initialise the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initialising them to the percentage of times that that pixel is activated for that category. But since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well\n", "- **Initialize**:: we initialise the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initialising them to the percentage of times that that pixel is activated for that category. But since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well\n",
"- **Loss**:: This is the thing Arthur Samuel refered to: \"*testing the effectiveness of any current weight assignment in terms of actual performance*\". We need some function that will return a number that is small if the performance of the model is good (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention)\n", "- **Loss**:: This is the thing Arthur Samuel refered to: \"*testing the effectiveness of any current weight assignment in terms of actual performance*\". We need some function that will return a number that is small if the performance of the model is good (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention)\n",
"- **Step**:: A simple way to figure out whether a weight should be increased a bit, or decreased a bit, would be just to try it. Increase the weight by a small amount, and see if the loss goes up or down. Once you find the correct direction, you could then change that amount by a bit more, and a bit less, until you find an amount which works well. However, this is slow! As we will see, the magic of calculus allows us to directly figure out which direction, and roughly how much, to change each weight, without having to try all these small changes, by calculating *gradients*. This is just a performance optimisation, we would get exactly the same results by using the slower manual process as well\n", "- **Step**:: A simple way to figure out whether a weight should be increased a bit, or decreased a bit, would be just to try it. Increase the weight by a small amount, and see if the loss goes up or down. Once you find the correct direction, you could then change that amount by a bit more, and a bit less, until you find an amount which works well. However, this is slow! As we will see, the magic of calculus allows us to directly figure out which direction, and roughly how much, to change each weight via *gradients*, without having to try all these small changes. This is just a performance optimisation, we would get exactly the same results by using the slower manual process as well\n",
"- **Stop**:: We have already discussed how to choose how many epochs to train a model for. This is where that decision is applied. For our digit classifier, we would keep training until the accuracy of the model started getting worse, or we ran out of time." "- **Stop**:: We have already discussed how to choose how many epochs to train a model for. This is where that decision is applied. For our digit classifier, we would keep training until the accuracy of the model started getting worse, or we ran out of time."
] ]
}, },
@ -2587,7 +2587,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"We can change our weight by a little in the direction of the slop, calculate our loss and adjustment again, and repeat this a few times. Eventually, we will get to the lowest point on our curve:" "We can change our weight by a little in the direction of the slope, calculate our loss and adjustment again, and repeat this a few times. Eventually, we will get to the lowest point on our curve:"
] ]
}, },
{ {
@ -3150,7 +3150,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"We can use these gradients to improve our parameters. We'll need to pick a learning rate (we'll discuss how to do that in practice in the next chapter; for now we'll just pick `1.0`):" "We can use these gradients to improve our parameters. We'll need to pick a learning rate (we'll discuss how to do that in practice in the next chapter; for now we'll just pick `1.0E-5` so `0.00001`):"
] ]
}, },
{ {
@ -3391,7 +3391,7 @@
"\n", "\n",
"As a result, a very small change in the value of a weight will often not actually change the accuracy at all. This means it is not useful to use accuracy as a loss function. When we use accuracy as a loss function, most of the time our gradients will actually be zero, and the model will not be able to learn from that number. That is not much use at all!\n", "As a result, a very small change in the value of a weight will often not actually change the accuracy at all. This means it is not useful to use accuracy as a loss function. When we use accuracy as a loss function, most of the time our gradients will actually be zero, and the model will not be able to learn from that number. That is not much use at all!\n",
"\n", "\n",
"> s: In mathematical terms, accuracy is a function that is constant almost everywhere (except at the threshold, 0.5) so its derivative is nil almost everywhere (and infinity at the threshold). This then gives gradients that are zero or infinite, so useless to do an update of gradient descent.\n", "> s: In mathematical terms, accuracy is a function that is constant almost everywhere (except at the threshold, 0.5) so its derivative is nil almost everywhere (and infinity at the threshold). This then gives gradients that are zero or infinite, which are useless to do an update of gradient descent.\n",
"\n", "\n",
"Instead, we need a loss function which, when our weights result in slightly better predictions, gives us a slightly better loss. So what does a \"slightly better prediction\" look like, exactly? Well, in this case, it means that, if the correct answer is a 3, then the score is a little higher, or if the correct answer is a 7, then the score is a little lower.\n", "Instead, we need a loss function which, when our weights result in slightly better predictions, gives us a slightly better loss. So what does a \"slightly better prediction\" look like, exactly? Well, in this case, it means that, if the correct answer is a 3, then the score is a little higher, or if the correct answer is a 7, then the score is a little lower.\n",
"\n", "\n",
@ -3550,7 +3550,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Pytorch actually already defines this for us, so we dont really need our own version. This is an important function in deep learning, since we often want to ensure values between zero and one. This is what it looks like:" "Pytorch actually already defines this for us, so we dont really need our own version. This is an important function in deep learning, since we often want to ensure values are between zero and one. This is what it looks like:"
] ]
}, },
{ {
@ -3621,13 +3621,13 @@
"source": [ "source": [
"Now that we have a loss function which is suitable to drive SGD, we can consider some of the details involved in the next phase of the learning process, which is *step* (i.e., change or update) the weights based on the gradients. This is called an optimisation step.\n", "Now that we have a loss function which is suitable to drive SGD, we can consider some of the details involved in the next phase of the learning process, which is *step* (i.e., change or update) the weights based on the gradients. This is called an optimisation step.\n",
"\n", "\n",
"In order to take an optimiser step we need to calculate the loss over one or more data items. How many should we use? We could calculate it for the whole dataset, and take the average, or we could calculate it for a single data item. But neither of these is ideal. Calculating it for the whole dataset would take a very long time. Calculating it for a single item would not use much inofmration, and so it would result in a very imprecise and unstable gradient. That is, you'd be going to the trouble of updating the weights but taking into account only how that would improve the model's performance on that single item.\n", "In order to take an optimiser step we need to calculate the loss over one or more data items. How many should we use? We could calculate it for the whole dataset, and take the average, or we could calculate it for a single data item. But neither of these is ideal. Calculating it for the whole dataset would take a very long time. Calculating it for a single item would not use much information, and so it would result in a very imprecise and unstable gradient. That is, you'd be going to the trouble of updating the weights but taking into account only how that would improve the model's performance on that single item.\n",
"\n", "\n",
"So instead we take a compromise between the two: we calculate the average loss for a few data items at a time. This is called a *mini-batch*. The number of data items in the mini batch is called the *batch size*. A larger batch size means that you will get a more accurate and stable estimate of your datasets gradient on the loss function, but it will take longer, and you will get less mini-batches per epoch. Choosing a good batch size is one of the decisions you need to make as a deep learning practitioner to train your model quickly and accurately. We will talk about how to make this choice throughout this book.\n", "So instead we take a compromise between the two: we calculate the average loss for a few data items at a time. This is called a *mini-batch*. The number of data items in the mini batch is called the *batch size*. A larger batch size means that you will get a more accurate and stable estimate of your datasets gradient on the loss function, but it will take longer, and you will get less mini-batches per epoch. Choosing a good batch size is one of the decisions you need to make as a deep learning practitioner to train your model quickly and accurately. We will talk about how to make this choice throughout this book.\n",
"\n", "\n",
"Another good reason for using mini-batches rather than calculating the gradient on individual data items is that, in practice, we nearly always do our training on an accelerator such as a GPU. These accelerators only perform well if they have lots of work to do at a time. So it is helpful if we can give them lots of data items to work on at a time. Using mini-batches is one of the best ways to do this. However, if you give them too much data to work on at once, they run out of memory--making GPUs happy is also tricky!.\n", "Another good reason for using mini-batches rather than calculating the gradient on individual data items is that, in practice, we nearly always do our training on an accelerator such as a GPU. These accelerators only perform well if they have lots of work to do at a time. So it is helpful if we can give them lots of data items to work on at a time. Using mini-batches is one of the best ways to do this. However, if you give them too much data to work on at once, they run out of memory--making GPUs happy is also tricky!.\n",
"\n", "\n",
"As we've seen, in the discussion of data augmentation, we get better generalisation if we can very things during training. A simple and effective thing we can vary during training is what data items we put in each mini batch. Rather than simply enumerating our data set in order for every epoch, instead what we normally do in practice is to randomly shuffle it on every epoch, before we create mini batches. PyTorch and fastai provide a class that will do the shuffling and mini batch collation for you, called `DataLoader`.\n", "As we've seen, in the discussion of data augmentation, we get better generalisation if we can vary things during training. A simple and effective thing we can vary during training is what data items we put in each mini batch. Rather than simply enumerating our data set in order for every epoch, instead what we normally do is to randomly shuffle it on every epoch, before we create mini batches. PyTorch and fastai provide a class that will do the shuffling and mini batch collation for you, called `DataLoader`.\n",
"\n", "\n",
"A `DataLoader` can take any Python collection, and turn it into an iterator over many batches, like so:" "A `DataLoader` can take any Python collection, and turn it into an iterator over many batches, like so:"
] ]
@ -4741,7 +4741,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"So far we have a general procedure for optimising the parameters of a function, and we have tried it out on a very boring function: a simple linear classifier. A linear classifier is very constrained in terms of what it can do. To make it a bit more complex (and able to handle more tasks), we need to add a non-linearity between two linear classifiers, and this is what will gived us a neural network.\n", "So far we have a general procedure for optimising the parameters of a function, and we have tried it out on a very boring function: a simple linear classifier. A linear classifier is very constrained in terms of what it can do. To make it a bit more complex (and able to handle more tasks), we need to add a non-linearity between two linear classifiers, and this is what gives us a neural network.\n",
"\n", "\n",
"Here is the entire definition of a basic neural network:" "Here is the entire definition of a basic neural network:"
] ]
@ -4831,7 +4831,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"> s: Mathematically, we say the composition of two linear functions is another linear function. So we can stack as many linear classifiers on top or each other, without non-linear functions between them, it will jsut be the same as one linear classifier." "> s: Mathematically, we say the composition of two linear functions is another linear function. So we can stack as many linear classifiers on top or each other, without non-linear functions between them, it will just be the same as one linear classifier."
] ]
}, },
{ {
@ -5357,8 +5357,8 @@
"[options=\"header\"]\n", "[options=\"header\"]\n",
"|=====\n", "|=====\n",
"| Term | Meaning\n", "| Term | Meaning\n",
"|**ReLU** | Funxtion that returns 0 for negatives numbers and doesn't change positive numbers\n", "|**ReLU** | Function that returns 0 for negative numbers and doesn't change positive numbers\n",
"|**mini-batch** | A few inputs and labels gathered together in two big arrays\n", "|**mini-batch** | A few inputs and labels gathered together in two arrays. A gradient descent step is updated on this batch (rather than a whole epoch).\n",
"|**forward pass** | Applying the model to some input and computing the predictions\n", "|**forward pass** | Applying the model to some input and computing the predictions\n",
"|**loss** | A value that represents how well (or badly) our model is doing\n", "|**loss** | A value that represents how well (or badly) our model is doing\n",
"|**gradient** | The derivative of the loss with respect to some parameter of the model\n", "|**gradient** | The derivative of the loss with respect to some parameter of the model\n",
@ -5467,5 +5467,5 @@
} }
}, },
"nbformat": 4, "nbformat": 4,
"nbformat_minor": 2 "nbformat_minor": 4
} }