Fixes erratas

This commit is contained in:
sgugger 2021-02-17 19:19:42 -05:00
parent fb57077906
commit c3ceea7996
10 changed files with 20 additions and 57 deletions

View File

@ -1541,7 +1541,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The Pet dataset contains 7,390 pictures of dogs and cats, consisting of 37 different breeds. Each image is labeled using its filename: for instance the file *great\\_pyrenees\\_173.jpg* is the 173rd example of an image of a Great Pyrenees breed dog in the dataset. The filenames start with an uppercase letter if the image is a cat, and a lowercase letter otherwise. We have to tell fastai how to get labels from the filenames, which we do by calling `from_name_func` (which means that labels can be extracted using a function applied to the filename), and passing `x[0].isupper()`, which evaluates to `True` if the first letter is uppercase (i.e., it's a cat).\n",
"The Pet dataset contains 7,390 pictures of dogs and cats, consisting of 37 different breeds. Each image is labeled using its filename: for instance the file *great\\_pyrenees\\_173.jpg* is the 173rd example of an image of a Great Pyrenees breed dog in the dataset. The filenames start with an uppercase letter if the image is a cat, and a lowercase letter otherwise. We have to tell fastai how to get labels from the filenames, which we do by calling `from_name_func` (which means that labels can be extracted using a function applied to the filename), and passing `is_cat`, which returns `x[0].isupper()`, which evaluates to `True` if the first letter is uppercase (i.e., it's a cat).\n",
"\n",
"The most important parameter to mention here is `valid_pct=0.2`. This tells fastai to hold out 20% of the data and *not use it for training the model at all*. This 20% of the data is called the *validation set*; the remaining 80% is called the *training set*. The validation set is used to measure the accuracy of the model. By default, the 20% that is held out is selected randomly. The parameter `seed=42` sets the *random seed* to the same value every time we run this code, which means we get the same validation set every time we run it—this way, if we change our model and retrain it, we know that any differences are due to the changes to the model, not due to having a different random validation set.\n",
"\n",

View File

@ -1784,7 +1784,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We just completed various mathematical operations on PyTorch tensors. If you've done some numeric programming in PyTorch before, you may recognize these as being similar to NumPy arrays. Let's have a look at those two very important data structures."
"We just completed various mathematical operations on PyTorch tensors. If you've done some numeric programming in NumPy before, you may recognize these as being similar to NumPy arrays. Let's have a look at those two very important data structures."
]
},
{
@ -2194,7 +2194,7 @@
}
],
"source": [
"tensor([1,2,3]) + tensor([1,1,1])"
"tensor([1,2,3]) + tensor(1)"
]
},
{
@ -3833,7 +3833,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check our accuracy. To decide if an output represents a 3 or a 7, we can just check whether it's greater than 0, so our accuracy for each item can be calculated (using broadcasting, so no loops!) with:"
"Let's check our accuracy. To decide if an output represents a 3 or a 7, we can just check whether it's greater than 0.5, so our accuracy for each item can be calculated (using broadcasting, so no loops!) with:"
]
},
{
@ -3859,7 +3859,7 @@
}
],
"source": [
"corrects = (preds>0.0).float() == train_y\n",
"corrects = (preds>0.5).float() == train_y\n",
"corrects"
]
},

View File

@ -1237,7 +1237,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The function we saw in the previous section works quite well as a loss function, but we can make it a bit better. The problem is that we are using probabilities, and probabilities cannot be smaller than 0 or greater than 1. That means that our model will not care whether it predicts 0.99 or 0.999. Indeed, those numbers are so close together—but in another sense, 0.999 is 10 times more confident than 0.99. So, we want to transform our numbers between 0 and 1 to instead be between negative infinity and infinity. There is a mathematical function that does exactly this: the *logarithm* (available as `torch.log`). It is not defined for numbers less than 0, and looks like this:"
"The function we saw in the previous section works quite well as a loss function, but we can make it a bit better. The problem is that we are using probabilities, and probabilities cannot be smaller than 0 or greater than 1. That means that our model will not care whether it predicts 0.99 or 0.999. Indeed, those numbers are so close together—but in another sense, 0.999 is 10 times more confident than 0.99. So, we want to transform our numbers between 0 and 1 to instead be between negative infinity and 0. There is a mathematical function that does exactly this: the *logarithm* (available as `torch.log`). It is not defined for numbers less than 0, and looks like this:"
]
},
{
@ -2559,18 +2559,6 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,

View File

@ -1987,7 +1987,7 @@
"source": [
"To turn our architecture into a deep learning model, the first step is to take the results of the embedding lookup and concatenate those activations together. This gives us a matrix which we can then pass through linear layers and nonlinearities in the usual way.\n",
"\n",
"Since we'll be concatenating the embedding matrices, rather than taking their dot product, the two embedding matrices can have different sizes (i.e., different numbers of latent factors). fastai has a function `get_emb_sz` that returns recommended sizes for embedding matrices for your data, based on a heuristic that fast.ai has found tends to work well in practice:"
"Since we'll be concatenating the embeddings, rather than taking their dot product, the two embedding matrices can have different sizes (i.e., different numbers of latent factors). fastai has a function `get_emb_sz` that returns recommended sizes for embedding matrices for your data, based on a heuristic that fast.ai has found tends to work well in practice:"
]
},
{
@ -2346,31 +2346,6 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,

View File

@ -315,7 +315,7 @@
"\n",
"- `xxbos`:: Indicates the beginning of a text (here, a review)\n",
"- `xxmaj`:: Indicates the next word begins with a capital (since we lowercased everything)\n",
"- `xxunk`:: Indicates the next word is unknown\n",
"- `xxunk`:: Indicates the word is unknown\n",
"\n",
"To see the rules that were used, you can check the default rules:"
]

View File

@ -2263,8 +2263,8 @@
"source": [
"You have now seen everything that is inside the AWD-LSTM architecture we used in text classification in <<chapter_nlp>>. It uses dropout in a lot more places:\n",
"\n",
"- Embedding dropout (just after the embedding layer)\n",
"- Input dropout (after the embedding layer)\n",
"- Embedding dropout (inside the embedding layer, drops some random lines of embeddings)\n",
"- Input dropout (applied after the embedding layer)\n",
"- Weight dropout (applied to the weights of the LSTM at each training step)\n",
"- Hidden dropout (applied to the hidden state between two layers)\n",
"\n",
@ -2316,7 +2316,7 @@
"1. Why can we use a higher learning rate for `LMModel6`?\n",
"1. What are the three regularization techniques used in an AWD-LSTM model?\n",
"1. What is \"dropout\"?\n",
"1. Why do we scale the weights with dropout? Is this applied during training, inference, or both?\n",
"1. Why do we scale the acitvations with dropout? Is this applied during training, inference, or both?\n",
"1. What is the purpose of this line from `Dropout`: `if not self.training: return x`\n",
"1. Experiment with `bernoulli_` to understand how it works.\n",
"1. How do you set your model in training mode in PyTorch? In evaluation mode?\n",

View File

@ -2943,7 +2943,7 @@
"source": [
"This didn't train at all well! Let's find out why.\n",
"\n",
"One handy feature of the callbacks passed to `Learner` is that they are made available automatically, with the same name as the callback class, except in `camel_case`. So, our `ActivationStats` callback can be accessed through `activation_stats`. I'm sure you remember `learn.recorder`... can you guess how that is implemented? That's right, it's a callback called `Recorder`!\n",
"One handy feature of the callbacks passed to `Learner` is that they are made available automatically, with the same name as the callback class, except in `snake_case`. So, our `ActivationStats` callback can be accessed through `activation_stats`. I'm sure you remember `learn.recorder`... can you guess how that is implemented? That's right, it's a callback called `Recorder`!\n",
"\n",
"`ActivationStats` includes some handy utilities for plotting the activations during training. `plot_layer_stats(idx)` plots the mean and standard deviation of the activations of layer number *`idx`*, along with the percentage of activations near zero. Here's the first layer's plot:"
]
@ -3261,7 +3261,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The percentage of nonzero weights is getting much better, although it's still quite high.\n",
"The percentage of near-zero weights is getting much better, although it's still quite high.\n",
"\n",
"We can see even more about what's going on in our training using `color_dim`, passing it a layer index:"
]

View File

@ -378,7 +378,7 @@
"\n",
"> jargon: Identity mapping: Returning the input without changing it at all. This process is performed by an _identity function_.\n",
"\n",
"Actually, there is another way to create those extra 36 layers, which is much more interesting. What if we replaced every occurrence of `conv(x)` with `x + conv(x)`, where `conv` is the function from the previous chapter that adds a second convolution, then a ReLU, then a batchnorm layer. Furthermore, recall that batchnorm does `gamma*y + beta`. What if we initialized `gamma` to zero for every one of those final batchnorm layers? Then our `conv(x)` for those extra 36 layers will always be equal to zero, which means `x+conv(x)` will always be equal to `x`.\n",
"Actually, there is another way to create those extra 36 layers, which is much more interesting. What if we replaced every occurrence of `conv(x)` with `x + conv(x)`, where `conv` is the function from the previous chapter that adds a second convolution, then a batchnorm layer, then a ReLU. Furthermore, recall that batchnorm does `gamma*y + beta`. What if we initialized `gamma` to zero for every one of those final batchnorm layers? Then our `conv(x)` for those extra 36 layers will always be equal to zero, which means `x+conv(x)` will always be equal to `x`.\n",
"\n",
"What has that gained us? The key thing is that those 36 extra layers, as they stand, are an *identity mapping*, but they have *parameters*, which means they are *trainable*. So, we can start with our best 20-layer model, add these 36 extra layers which initially do nothing at all, and then *fine-tune the whole 56-layer model*. Those extra 36 layers can then learn the parameters that make them most useful.\n",
"\n",

View File

@ -670,7 +670,7 @@
" x = torch.cat(x, 1)\n",
"```\n",
"\n",
"Then dropout is applied. You can pass `emb_drop` to `__init__` to change this value:\n",
"Then dropout is applied. You can pass `embd_p` to `__init__` to change this value:\n",
"\n",
"```python\n",
" x = self.emb_drop(x)\n",

View File

@ -1105,9 +1105,9 @@
"\n",
"The rules of Einstein summation notation are as follows:\n",
"\n",
"1. Repeated indices are implicitly summed over.\n",
"1. Each index can appear at most twice in any term.\n",
"1. Each term must contain identical nonrepeated indices.\n",
"1. Repeated indices on the left side are implicitly summed over if they are not on the right side.\n",
"2. Each index can appear at most twice on the left side.\n",
"3. The unrepeated indices on the left side must appear on the right side.\n",
"\n",
"So in our example, since `k` is repeated, we sum over that index. In the end the formula represents the matrix obtained when we put in `(i,j)` the sum of all the coefficients `(i,k)` in the first tensor multiplied by the coefficients `(k,j)` in the second tensor... which is the matrix product! Here is how we can code this in PyTorch:"
]
@ -1348,7 +1348,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is `nan`s everywhere. So maybe the scale of our matrix was too big, and we need to have smaller weights? But if we use too small weights, we will have the opposite problem—the scale of our activations will go from 1 to 0.1, and after 100 layers we'll be left with zeros everywhere:"
"The result is `nan`s everywhere. So maybe the scale of our matrix was too big, and we need to have smaller weights? But if we use too small weights, we will have the opposite problem—the scale of our activations will go from 1 to 0.1, and after 50 layers we'll be left with zeros everywhere:"
]
},
{
@ -1825,7 +1825,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For the gradients of the ReLU and our linear layer, we use the gradients of the loss with respect to the output (in `out.g`) and apply the chain rule to compute the gradients of the loss with respect to the output (in `inp.g`). The chain rule tells us that `inp.g = relu'(inp) * out.g`. The derivative of `relu` is either 0 (when inputs are negative) or 1 (when inputs are positive), so this gives us:"
"For the gradients of the ReLU and our linear layer, we use the gradients of the loss with respect to the output (in `out.g`) and apply the chain rule to compute the gradients of the loss with respect to the input (in `inp.g`). The chain rule tells us that `inp.g = relu'(inp) * out.g`. The derivative of `relu` is either 0 (when inputs are negative) or 1 (when inputs are positive), so this gives us:"
]
},
{