Merge branch 'master' of github.com:fastai/fastbook

This commit is contained in:
Jeremy Howard 2020-04-14 12:52:52 -07:00
commit 2e431d1ef7
16 changed files with 183 additions and 57 deletions

View File

@ -923,7 +923,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There a number of powerful concepts embedded in this short statement: \n",
"There are a number of powerful concepts embedded in this short statement: \n",
"\n",
"- the idea of a \"weight assignment\" \n",
"- the fact that every weight assignment has some \"actual performance\"\n",
@ -1526,7 +1526,7 @@
" label_func=is_cat, item_tfms=Resize(224))\n",
"```\n",
"\n",
"The fourth line tells fastai what kind of dataset we have, and how it is structured. There are various different classes for different kinds of deep learning dataset and problem--here we're using `ImageDataLoaders`. The first part of the class name will generally be the type of data you have, such as image, or text. The second part will generally be the type of problem you are solving, such as classification, or regression.\n",
"The fourth line tells fastai what kind of dataset we have, and how it is structured. There are various different classes for different kinds of deep learning datasets and problems--here we're using `ImageDataLoaders`. The first part of the class name will generally be the type of data you have, such as image, or text. The second part will generally be the type of problem you are solving, such as classification, or regression.\n",
"\n",
"The other important piece of information that we have to tell fastai is how to get the labels from the dataset. Computer vision datasets are normally structured in such a way that the label for an image is part of the file name, or path, most commonly the parent folder name. Fastai comes with a number of standardized labelling methods, and ways to write your own. Here we define a function on the third line: `is_cat` which labels cats based on a filename rule provided by the dataset creators.\n",
"\n",
@ -2914,6 +2914,18 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,

View File

@ -121,11 +121,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many domains in which deep learning has not been used to analyse images yet, but those where it has been tried have nearly universally shown that computers can recognise what items are in an image at least as well as people can — even specially trained people, such as radiologists. This is known as *object recognition*. Deep learning is also good at recognizing whereabouts objects in an image are, and can highlight their location and name each found object. This is known as *object detection* (there is also a variant of this we saw in <<chapter_intro>>, where every pixel is categorized based on what kind of object it is part of--this is called *segmentation*). Deep learning algorithms are generally not good at recognizing images that are significantly different in structure or style to those used to train the model. For instance, if there were no black-and-white images in the training data, the model may do poorly on black-and-white images. If the training data did not contain hand-drawn images then the model will probably do poorly on hand-drawn images. There is no general way to check what types of image are missing in your training set, but we will show in this chapter some ways to try to recognize when unexpected image types arise in the data when the model is being used in production (this is known as checking for *out of domain* data).\n",
"There are many domains in which deep learning has not been used to analyse images yet, but those where it has been tried have nearly universally shown that computers can recognise what items are in an image at least as well as people can — even specially trained people, such as radiologists. This is known as *object recognition*. Deep learning is also good at recognizing whereabouts objects in an image are, and can highlight their location and name each found object. This is known as *object detection* (there is also a variant of this we saw in <<chapter_intro>>, where every pixel is categorized based on what kind of object it is part of--this is called *segmentation*). Deep learning algorithms are generally not good at recognizing images that are significantly different in structure or style to those used to train the model. For instance, if there were no black-and-white images in the training data, the model may do poorly on black-and-white images. If the training data did not contain hand-drawn images then the model will probably do poorly on hand-drawn images. There is no general way to check what types of images are missing in your training set, but we will show in this chapter some ways to try to recognize when unexpected image types arise in the data when the model is being used in production (this is known as checking for *out of domain* data).\n",
"\n",
"One major challenge for object detection systems is that image labelling can be slow and expensive. There is a lot of work at the moment going into tools to try to make this labelling faster and easier, and require less handcrafted labels to train accurate object detection models. One approach which is particularly helpful is to synthetically generate variations of input images, such as by rotating them, or changing their brightness and contrast; this is called *data augmentation* and also works well for text and other types of model. We will be discussing it in detail in this chapter.\n",
"\n",
"Another point to consider is that although your problem might not look like a computer vision problem, it might be possible with a little imagination to turn it into one. For instance, if what you are trying to classify is sounds, you might try converting the sounds into images of their acoustic waveforms and then training a model on those images."
"Another point to consider is that although your problem might not look like a computer vision problem, it might be possible with a little imagination to turn it into one. For instance, if what you are trying to classify are sounds, you might try converting the sounds into images of their acoustic waveforms and then training a model on those images."
]
},
{
@ -139,7 +139,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Just like in computer vision, computers are very good at categorising both short and long documents based on categories such as spam, sentiment (e.g. is the review positive or negative), author, source website, and so forth. We are not aware of any rigourous work done in this area to compare to human performance, but anecdotally it seems to us that deep learning performance is similar to human performance here. Deep learning is also very good at generating context-appropriate text, such as generating replies to social media posts, and imitating a particular author's style. It is also good at making this content compelling to humans, and has been shown to be even more compelling than human-generated text. However, deep learning is currently not good at generating *correct* responses! We don't currently have a reliable way to, for instance, combine a knowledge base of medical information, along with a deep learning model for generating medically correct natural language responses. This is very dangerous, because it is so easy to create content which appears to a layman to be compelling, but actually is entirely incorrect.\n",
"Just like in computer vision, computers are very good at categorising both short and long documents based on categories such as spam, sentiment (e.g. is the review positive or negative), author, source website, and so forth. We are not aware of any rigorous work done in this area to compare to human performance, but anecdotally it seems to us that deep learning performance is similar to human performance here. Deep learning is also very good at generating context-appropriate text, such as generating replies to social media posts, and imitating a particular author's style. It is also good at making this content compelling to humans, and has been shown to be even more compelling than human-generated text. However, deep learning is currently not good at generating *correct* responses! We don't currently have a reliable way to, for instance, combine a knowledge base of medical information, along with a deep learning model for generating medically correct natural language responses. This is very dangerous, because it is so easy to create content which appears to a layman to be compelling, but actually is entirely incorrect.\n",
"\n",
"Another concern is that context-appropriate, highly compelling responses on social media can be used at massive scale — thousands of times greater than any troll farm previously seen — to spread disinformation, create unrest, and encourage conflict. As a rule of thumb, text generation will always be technologically a bit ahead of the ability of models to recognize automatically generated text. For instance, it is possible to use a model that can recognize artificially generated content to actually improve the generator that creates that content, until the classification model is no longer able to complete its task.\n",
"\n",
@ -274,7 +274,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> important: Services that can be used for creating datasets come and go all the time, and their features, interfaces, and pricing change regularly too. In this section, we'll show how to use one particular provider, _Bing Image Search_, using the service they have as this book as written. We'll be providing more options and more up to date information on the http://book.fast.ai[book website], so be sure to have a look there now to get the most current information on how to download images from the web to create a dataset for deep learning."
"> important: Services that can be used for creating datasets come and go all the time, and their features, interfaces, and pricing change regularly too. In this section, we'll show how to use one particular provider, _Bing Image Search_, using the service they have as this book was written. We'll be providing more options and more up to date information on the http://book.fast.ai[book website], so be sure to have a look there now to get the most current information on how to download images from the web to create a dataset for deep learning."
]
},
{
@ -804,7 +804,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"All of these approaches seem somewhat wasteful, or problematic. If we squished or stretch the images then they end up unrealistic shapes, leading to a model that learns that things look different to how they actually are, which we would expect to result in lower accuracy. If we crop the images then we remove some of the features that allow us to recognize them. For instance, if we were trying to recognise the breed of dog or cat, we may end up cropping out a key part of the body or the face necessary to distinguish between similar breeds. If we pad the images then we have a whole lot of empty space, which is just wasted computation for our model, and results in a lower effective resolution for the part of the image we actually use.\n",
"All of these approaches seem somewhat wasteful, or problematic. If we squished or stretched the images then they end up as unrealistic shapes, leading to a model that learns that things look different to how they actually are, which we would expect to result in lower accuracy. If we crop the images then we remove some of the features that allow us to recognize them. For instance, if we were trying to recognise the breed of dog or cat, we may end up cropping out a key part of the body or the face necessary to distinguish between similar breeds. If we pad the images then we have a whole lot of empty space, which is just wasted computation for our model, and results in a lower effective resolution for the part of the image we actually use.\n",
"\n",
"Instead, what we normally do in practice is to randomly select part of the image, and crop to just that part. On each epoch (which is one complete pass through all of our images in the dataset) we randomly select a different part of each image. This means that our model can learn to focus on, and recognize, different features in our images. It also reflects how images work in the real world; different photos of the same thing may be framed in slightly different ways.\n",
"\n",
@ -855,7 +855,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Data augmentation refers to creating random variations of our input data, such that they appear different, but are not expected to change the meaning of the data. Examples of common data augmentation for images are rotation, flipping, perspective warping, brightness changes, contrast changes, and much more. For natural photo images such as the ones we are using here, there is a standard set of augmentations which we have found work pretty well, and are provided with the `aug_transforms` function. Because the images are now all the same size, we can apply these augmentations to an entire batch of them using the GPU, which will save a lot of time. To tell fastai we want to use these transforms to a batch, we use the `batch_tfms` parameter. (Note that we're not using `RandomResizedCrop` in this example, so you can see the differences more clearly; we're also using double the amount of augmentation compared to the default, for the same reason)."
"Data augmentation refers to creating random variations of our input data, such that they appear different, but are not expected to change the meaning of the data. Examples of common data augmentation for images are rotation, flipping, perspective warping, brightness changes, contrast changes, and much more. For natural photo images such as the ones we are using here, there is a standard set of augmentations which we have found work pretty well, and are provided with the `aug_transforms` function. Because the images are now all the same size, we can apply these augmentations to an entire batch of them using the GPU, which will save a lot of time. To tell fastai we want to use these transforms on a batch, we use the `batch_tfms` parameter. (Note that we're not using `RandomResizedCrop` in this example, so you can see the differences more clearly; we're also using double the amount of augmentation compared to the default, for the same reason)."
]
},
{
@ -1066,7 +1066,7 @@
"\n",
"It's helpful to see where exactly our errors are occurring, to see whether it's due to a dataset problem (e.g. images that aren't bears at all, or are labelled incorrectly, etc.), or a model problem (e.g. perhaps it isn't handling images taken with unusual lighting, or from a different angle, etc.). To do this, we can sort out images by their *loss*.\n",
"\n",
"The *loss* is a number that is higher if the model is incorrect (and especially if it's also confident of its incorrect answer), or if it's correct, but not confident of its correct answer. In a couple chapters we'll learn in depth how loss is calculated and used in training process. For now, `plot_top_losses` shows us the images with the highest loss in our dataset. As the title of the output says, each image is labeled with four things: prediction, actual (target label), loss, and probability. The *probability* here is the confidence level, from zero to one, that the model has assigned to its prediction."
"The *loss* is a number that is higher if the model is incorrect (and especially if it's also confident of its incorrect answer), or if it's correct, but not confident of its correct answer. In a couple chapters we'll learn in depth how loss is calculated and used in the training process. For now, `plot_top_losses` shows us the images with the highest loss in our dataset. As the title of the output says, each image is labeled with four things: prediction, actual (target label), loss, and probability. The *probability* here is the confidence level, from zero to one, that the model has assigned to its prediction."
]
},
{
@ -1293,7 +1293,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When we're doing inference, we're generally just getting predicitions for one image at a time. To do this, pass a filename to `predict`:"
"When we're doing inference, we're generally just getting predictions for one image at a time. To do this, pass a filename to `predict`:"
]
},
{
@ -1706,7 +1706,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we now know, you need a GPU to train nearly any useful deep learning model. So, do you need a GPU to use that model in production? No! You almost certainly **do not need a GPU to serve your model in production**. There's a few reasons for this:\n",
"As we now know, you need a GPU to train nearly any useful deep learning model. So, do you need a GPU to use that model in production? No! You almost certainly **do not need a GPU to serve your model in production**. There are a few reasons for this:\n",
"\n",
"- As we've seen, GPUs are only useful when they do lots of identical work in parallel. If you're doing (say) image classification, then you'll normally be classifying just one user's image at a time, and there isn't normally enough work to do in a single image to keep a GPU busy for long enough for it to be very efficient. So a CPU will often be more cost effective.\n",
"- An alternative could be to wait for a few users to submit their images, and then batch them up, and do them all at once on a GPU. But then you're asking your users to wait, rather than getting answers straight away! And you need a high volume site for this to be workable. If you do need this functionality, you can use a tool such as Microsoft's [ONNX Runtime](https://github.com/microsoft/onnxruntime), or [AWS Sagemaker](https://aws.amazon.com/sagemaker/)\n",
@ -1754,7 +1754,7 @@
"source": [
"You may well want to deploy your application onto mobile devices, or edge devices such as a Raspberry Pi. There are a lot of libraries and frameworks to allow you to integrate a model directly into a mobile application. However these approaches tend to require a lot of extra steps and boilerplate, and do not always support all the PyTorch and fastai layers that your model might use. In addition, the work you do will depend on what kind of mobile devices you are targeting for deployment. So you might need to do some work to run on iOS devices, different work to run on newer Android devices, different work for older Android devices, etc. Instead, we recommend wherever possible that you deploy the model itself to a server, and have your mobile or edge application connect to it as a web service.\n",
"\n",
"There is quite a few upsides to this approach. The initial installation is easier, because you only have to deploy a small GUI application, which connects to the server to do all the heavy lifting. More importantly perhaps, upgrades of that core logic can happen on your server, rather than needing to be distributed to all of your users. Your server can have a lot more memory and processing capacity than most edge devices, and it is far easier to scale those resources if your model becomes more demanding. The hardware that you will have on a server is going to be more standard and more easily supported by fastai and PyTorch, so you don't have to compile your model into a different form.\n",
"There are quite a few upsides to this approach. The initial installation is easier, because you only have to deploy a small GUI application, which connects to the server to do all the heavy lifting. More importantly perhaps, upgrades of that core logic can happen on your server, rather than needing to be distributed to all of your users. Your server can have a lot more memory and processing capacity than most edge devices, and it is far easier to scale those resources if your model becomes more demanding. The hardware that you will have on a server is going to be more standard and more easily supported by fastai and PyTorch, so you don't have to compile your model into a different form.\n",
"\n",
"There are downsides too, of course. Your application will require a network connection, and there will be some latency each time the model is called. It takes a while for a neural network model to run anyway, so this additional network latency may not make a big difference to your users in practice. In fact, since you can use better hardware on the server, the overall latency may even be less! If your application uses sensitive data then your users may be concerned about an approach which sends that data to a remote server, so sometimes privacy considerations will mean that you need to run the model on the edge device. Sometimes this can be avoided by having an *on premise* server, such as inside a company's firewall. Managing the complexity and scaling the server can create additional overhead, whereas if your model runs on the edge devices then each user is bringing their own compute resources, which leads to easier scaling with an increasing number of users (also known as *horizontal scaling*)."
]
@ -1763,7 +1763,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> A: I've had a chance to see up close how the mobile ML landscape is changing in my work. We offer an iPhone app that depends on computer vision and for years we ran our own computer vision models in the cloud. This was the only way to do it then since those models needed significant memory and compute resources and took minutes to process. This approach required building not only the models (fun!) but infrastructure to ensure a certain number of \"compute worker machines\" was absolutely always running (scary), that more machines would automatically come online if traffic increased, that there was stable storage for large inputs and outputs, that the iOS app could know and tell the user how their job was doing, etc... Nowadays, Apple provides APIs for converting models to run efficiently on device and most iOS devices have dedicated ML hardware, and we run our newer models on device. So, in a few years that strategy has gone from impossible to possible but it's still not easy. In our case it's worth it, for a faster user experience and to worry less about servers. What works for you will depend, realistically, on the user experience you're trying to create and what you personally find it easy to do. If you really know how to run servers, do it. If you really know how to build native mobile apps, do that. There are many roads up the hill.\n",
"> A: I've had a chance to see up close how the mobile ML landscape is changing in my work. We offer an iPhone app that depends on computer vision and for years we ran our own computer vision models in the cloud. This was the only way to do it then since those models needed significant memory and compute resources and took minutes to process. This approach required building not only the models (fun!) but also the infrastructure to ensure a certain number of \"compute worker machines\" was absolutely always running (scary), that more machines would automatically come online if traffic increased, that there was stable storage for large inputs and outputs, that the iOS app could know and tell the user how their job was doing, etc... Nowadays, Apple provides APIs for converting models to run efficiently on device and most iOS devices have dedicated ML hardware, and we run our newer models on device. So, in a few years that strategy has gone from impossible to possible but it's still not easy. In our case it's worth it, for a faster user experience and to worry less about servers. What works for you will depend, realistically, on the user experience you're trying to create and what you personally find is easy to do. If you really know how to run servers, do it. If you really know how to build native mobile apps, do that. There are many roads up the hill.\n",
"\n",
"Overall, we'd recommend using a simple CPU-based server approach where possible, for as long as you can get away with it. If you're lucky enough to have a very successful application, then you'll be able to justify the investment in more complex deployment approaches at that time.\n",
"\n",
@ -1957,6 +1957,18 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,

View File

@ -196,7 +196,7 @@
"source": [
"Of course, the project managers and engineers and technicians involved were just living their ordinary lives. Caring for their families, going to the church on Sunday, doing their jobs as best as they could. Following orders. The marketers were just doing what they could to meet their business development goals. Edwin Black, author of \"IBM and the Holocaust\", said: \"To the blind technocrat, the means were more important than the ends. The destruction of the Jewish people became even less important because the invigorating nature of IBM's technical achievement was only heightened by the fantastical profits to be made at a time when bread lines stretched across the world.\"\n",
"\n",
"Step back for a moment and consider: how would you feel if you discovered that you had been part of a system that ending up hurting society? Would you even know? Would you be open to finding out? How can you help make sure this doesn't happen? We have described the most extreme situation here in Nazi Germany, but there are many negative societal consequences happening due to AI and machine learning right now, some of which we'll describe in this chapter.\n",
"Step back for a moment and consider: how would you feel if you discovered that you had been part of a system that ended up hurting society? Would you even know? Would you be open to finding out? How can you help make sure this doesn't happen? We have described the most extreme situation here in Nazi Germany, but there are many negative societal consequences happening due to AI and machine learning right now, some of which we'll describe in this chapter.\n",
"\n",
"It's not just a moral burden either. Sometimes, technologists pay very directly for their actions. For instance, the first person who was jailed as a result of the Volkswagen scandal, where the car company cheated on their diesel emissions tests, was not the manager that oversaw the project, or an executive at the helm of the company. It was one of the engineers, James Liang, who just did what he was told.\n",
"\n",
@ -204,7 +204,7 @@
"\n",
"Okay, so hopefully we have convinced you that you ought to care. But what should you do? As data scientists, we're naturally inclined to focus on making our model better at optimizing some metric. But optimizing that metric may not actually lead to better outcomes. And even if optimizing that metric *does* help create better outcomes, it almost certainly won't be the only thing that matters. Consider the pipeline of steps that occurs between the development of a model or an algorithm by a researcher or practitioner, and the point at which this work is actually used to make some decision. This entire pipeline needs to be considered *as a whole* if we're to have a hope of getting the kinds of outcomes we want.\n",
"\n",
"Normally there is a very long chain from one end to the other. This is especially true if you are a researcher where you don't even know if your research will ever get used for anything, or if you're involved in data collection, which is even earlier in the pipeline. But no-one is better placed to inform everyone involved in this chain about the capabilities, constraints, and details of your work than you are. Although there's no \"silver bullet\" that can ensure your work is used the right way, by getting involved in the process, and asking the right questions, you can at the very least ensured that the right issues are being considered.\n",
"Normally there is a very long chain from one end to the other. This is especially true if you are a researcher where you don't even know if your research will ever get used for anything, or if you're involved in data collection, which is even earlier in the pipeline. But no-one is better placed to inform everyone involved in this chain about the capabilities, constraints, and details of your work than you are. Although there's no \"silver bullet\" that can ensure your work is used the right way, by getting involved in the process, and asking the right questions, you can at the very least ensure that the right issues are being considered.\n",
"\n",
"Sometimes, the right response to being asked to do a piece of work is to just say \"no\". Often, however, the response we hear is \"if I dont do it, someone else will\". But consider this: if youve been picked for the job, youre the best person theyve found; so if you dont do it, the best person isnt working on that project. If the first 5 they ask all say no too, then even better!"
]
@ -700,7 +700,7 @@
"- support good policy\n",
"- increase diversity\n",
"\n",
"Let's walk through each step next, staring with analyzing a project you are working on."
"Let's walk through each step next, starting with analyzing a project you are working on."
]
},
{
@ -726,7 +726,7 @@
"\n",
"These questions may be able to help you identify outstanding issues, and possible alternatives that are easier to understand and control. In addition to asking the right questions, it's also important to consider practices and processes to implement.\n",
"\n",
"One thing to consider at this stage is what data you are collecting and storing. Data often ends up being used for different purposes than why it was originally collected. For instance, IBM began selling to Nazi Germany well before the Holocaust, including helping with Germanys 1933 census conducted by Adolf Hitler, which was effective at identifying far more Jewish people than had previously been recognized in Germany. US census data was used to round up Japanese-Americans (who were US citizens) for internment during World War II. It is important to recognize how data and images collected can be weaponized later. Columbia professor [Tim Wu wrote](https://www.nytimes.com/2019/04/10/opinion/sunday/privacy-capitalism.html) that “You must assume that any personal data that Facebook or Android keeps are data that governments around the world will try to get or that thieves will try to steal.”"
"One thing to consider at this stage is what data you are collecting and storing. Data often ends up being used for different purposes than why it was originally collected for. For instance, IBM began selling to Nazi Germany well before the Holocaust, including helping with Germanys 1933 census conducted by Adolf Hitler, which was effective at identifying far more Jewish people than had previously been recognized in Germany. US census data was used to round up Japanese-Americans (who were US citizens) for internment during World War II. It is important to recognize how data and images collected can be weaponized later. Columbia professor [Tim Wu wrote](https://www.nytimes.com/2019/04/10/opinion/sunday/privacy-capitalism.html) that “You must assume that any personal data that Facebook or Android keeps are data that governments around the world will try to get or that thieves will try to steal.”"
]
},
{
@ -1034,10 +1034,25 @@
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,

View File

@ -191,7 +191,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"No, its not actually a panda! *Pandas* is a Python library that is used to manipulate and analysis tabular and timeseries data. The main class is `DataFrame`, which represents a table of rows and columns. You can get a DataFrame from a CSV file, a database table, python dictionaries, and many other sources. In Jupyter, a DataFrame is output as a formatted table, as you see above.\n",
"No, its not actually a panda! *Pandas* is a Python library that is used to manipulate and analyse tabular and timeseries data. The main class is `DataFrame`, which represents a table of rows and columns. You can get a DataFrame from a CSV file, a database table, python dictionaries, and many other sources. In Jupyter, a DataFrame is output as a formatted table, as you see above.\n",
"\n",
"You can access rows and columns of a DataFrame with the `iloc` property, which lets you access rows and columns as if it is a matrix:"
]
@ -1192,7 +1192,7 @@
"source": [
"In this case, we're using the validation set to pick a hyperparameter (the threshold), which is the purpose of the validation set. But sometimes students have expressed their concern that we might be *overfitting* to the validation set, since we're trying lots of values to see which is the best. However, as you see in the plot, changing the threshold in this case results in a smooth curve, so we're clearly not picking some inappropriate outlier. This is a good example of where you have to be careful of the difference between theory (don't try lots of hyperparameter values or you might overfit the validation set) versus practice (if the relationship is smooth, then it's fine to do this).\n",
"\n",
"This concludes the part of thic chapter dedicated to multi-label classification. Let's have a look at a regression problem now."
"This concludes the part of this chapter dedicated to multi-label classification. Let's have a look at a regression problem now."
]
},
{
@ -1459,7 +1459,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> important: We're not aware of other libraries (except for fastai) that automatically and correctly apply data augmentation to coordinates. So if you're working with another library, you may need to disable data augmentation for these kinda of problems."
"> important: We're not aware of other libraries (except for fastai) that automatically and correctly apply data augmentation to coordinates. So if you're working with another library, you may need to disable data augmentation for these kinds of problems."
]
},
{
@ -1664,7 +1664,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This makes sense, since when coordinates are used a dependent variable, most of the time we're likely to be trying to predict something as close as possible; that's basically what `MSELoss` (mean-squared error loss) does. If you want to use a different loss function, you can pass it to `cnn_learner` using the `loss_func` parameter.\n",
"This makes sense, since when coordinates are used as dependent variable, most of the time we're likely to be trying to predict something as close as possible; that's basically what `MSELoss` (mean-squared error loss) does. If you want to use a different loss function, you can pass it to `cnn_learner` using the `loss_func` parameter.\n",
"\n",
"Note also that we didn't specify any metrics. That's because the MSE is already a useful metric for this task (although it's probably more interpretable after we take the square root). \n",
"\n",
@ -1858,7 +1858,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In problems that are at first glance completely different (single-label classification, multi-label classification and regression) we end up using the same model with just different numbers of outputs. The different directions of those trainings is determined by the loss function, which is the one thing that changes. That's why it simportant to double-check your are using the right loss function for your problem.\n",
"In problems that are at first glance completely different (single-label classification, multi-label classification and regression) we end up using the same model with just different numbers of outputs. The different directions of those trainings is determined by the loss function, which is the one thing that changes. That's why it's important to double-check your are using the right loss function for your problem.\n",
"\n",
"In fastai, the library will automatically try to pick the right one from the data you built, but if you are using pure PyTorch to build your `DataLoader`s, make sure you think hard when you have to decide on your loss function, and remember that you most probably want\n",
"\n",

View File

@ -389,7 +389,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we have seen, the kinds of features that are learned by convolutional neural networks are not in any way specific to the size of the image — early layers find things like edges and gradients, and later layers may find things like noses and sunsets. So, when we change image size in the middle of training, it doesn't mean that we have two find totally different parameters for our model.\n",
"As we have seen, the kinds of features that are learned by convolutional neural networks are not in any way specific to the size of the image — early layers find things like edges and gradients, and later layers may find things like noses and sunsets. So, when we change image size in the middle of training, it doesn't mean that we have to find totally different parameters for our model.\n",
"\n",
"But clearly there are some differences between small images and big ones, so we shouldn't expect our model to continue working exactly as well, with no changes at all. Does this remind you of something? When we developed this idea, it reminded us of transfer learning! We are trying to get our model to learn to do something a little bit different to what it has learned to do before. Therefore, we should be able to use the `fine_tune` method after we resize our images.\n",
"\n",
@ -803,7 +803,7 @@
"source": [
"#hide_input\n",
"#id mixup_example\n",
"#caption Mixing a chruch and a gas station\n",
"#caption Mixing a church and a gas station\n",
"#alt An image of a church, a gas station and the two mixed up.\n",
"church = PILImage.create(get_image_files_sorted(path/'train'/'n03028079')[0])\n",
"gas = PILImage.create(get_image_files_sorted(path/'train'/'n03425413')[0])\n",
@ -910,7 +910,7 @@
"source": [
"Let's practice our paper reading skills to try to interpret this. \"This maximum\" is refering to the previous section of the paper, which talked about the fact that `1` is the value of the label for the positive class. So any value (except infinity) can't result in `1` after sigmoid or softmax. In a paper, you won't normally see \"any value\" written, but instead it would get a symbol; in this case, it's $z_k$. This is helpful in a paper, because it can be refered to again later, and the reader knows what value is being discussed.\n",
"\n",
"The it says: $z_y\\gg z_k$ for all $k\\neq y$. In this case, the paper immediately follows with \"that is...\", which is handy, because you can just read the English instead of the math. In the math, the $y$ is refering to the target ($y$ is defined earlier in the paper; sometimes it's hard to find where symbols are defined, but nearly all papers will define all their symbols somewhere), and $z_y$ is the activation corresponding to the target. So to get close to `1`, this activation needs to be much higher than all the others for that prediction.\n",
"Then it says: $z_y\\gg z_k$ for all $k\\neq y$. In this case, the paper immediately follows with \"that is...\", which is handy, because you can just read the English instead of the math. In the math, the $y$ is refering to the target ($y$ is defined earlier in the paper; sometimes it's hard to find where symbols are defined, but nearly all papers will define all their symbols somewhere), and $z_y$ is the activation corresponding to the target. So to get close to `1`, this activation needs to be much higher than all the others for that prediction.\n",
"\n",
"Next up is \"if the model learns to assign full probability to the ground-truth label for each training example, it is not guaranteed to generalize\". This is saying that making $z_y$ really big means we'll need large weights and large activations throughout our model. Large weights lead to \"bumpy\" functions, where a small change in input results in a big change to predictions. This is really bad for generalization, because it means just one pixel changing a bit could change our prediction entirely!\n",
"\n",

View File

@ -41,7 +41,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's get some data suitable for a collaboratie filtering model."
"First, let's get some data suitable for a collaborative filtering model."
]
},
{
@ -327,7 +327,7 @@
"source": [
"There is surprisingly little distance from specifying the structure of a model, as we did in the last section, and learning one, since we can just use our general gradient descent approach.\n",
"\n",
"Step one of this approach is to randomly initialise some parameters. These parameters will be a set of latent factors for each user and movie. We will have to decide how many to use. We will discuss how to select this shortly, but for illustrative purposes let's use 5 for now. Because each user will have a set of these factors, and each movie will have a set of these factors, we can show these randomly initialise values right next to the users and movies in our crosstab, and we can then fill in the dot products for each of these combinations in the middle. For example, <<xtab_latent>> shows what it looks like in Microsoft Excel, with the top-left cell formula displayed as an example."
"Step one of this approach is to randomly initialise some parameters. These parameters will be a set of latent factors for each user and movie. We will have to decide how many to use. We will discuss how to select this shortly, but for illustrative purposes let's use 5 for now. Because each user will have a set of these factors, and each movie will have a set of these factors, we can show these randomly initialised values right next to the users and movies in our crosstab, and we can then fill in the dot products for each of these combinations in the middle. For example, <<xtab_latent>> shows what it looks like in Microsoft Excel, with the top-left cell formula displayed as an example."
]
},
{
@ -366,7 +366,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When showing the data we would rather see movie titles than their ids. The table `u.item` contains the coorespondance id to title:"
"When showing the data we would rather see movie titles than their ids. The table `u.item` contains the correspondence id to title:"
]
},
{
@ -681,7 +681,7 @@
"source": [
"To calculate the result for a particular movie and use a combination we have two look up the index of the movie in our movie latent factors matrix, and the index of the user in our user latent factors matrix, and then we can do our dot product between the two latent factor vectors. But *look up in an index* is not an operation which our deep learning models know how to do. They know how to do matrix products, and activation functions.\n",
"\n",
"It turns out that we can represent *look up in an index* as a matrix product! The trick is to replace our indices with one hot encoded vectors. He is an example of what happens if we multiply a vector by a one hot encoded vector representing the index three:"
"It turns out that we can represent *look up in an index* as a matrix product! The trick is to replace our indices with one hot encoded vectors. Here is an example of what happens if we multiply a vector by a one hot encoded vector representing the index three:"
]
},
{
@ -736,7 +736,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If we do that for a few indices at once, we will have a matrix of one-hot encoded vectors and that operation will be a matrix multiplication! This would be a perfectly acceptable way to build models using this kind of architecture, except that it would use a lot more memory and time than necessary. We know that there is no real underlying reason to store the one hot encoded vector, or to search through it to find the occurrence of the number one — we should just be able to index into an array directly with an integer. Therefore, most deep learning libraries, including PyTorch, include a special layer which does just this; it indexes into a vector using an integer, but has its derivative calculated in such a way that it is identical to what it would have been if it had of done a matrix multiplication with a one hot encoded vector. This is called an *embedding*."
"If we do that for a few indices at once, we will have a matrix of one-hot encoded vectors and that operation will be a matrix multiplication! This would be a perfectly acceptable way to build models using this kind of architecture, except that it would use a lot more memory and time than necessary. We know that there is no real underlying reason to store the one hot encoded vector, or to search through it to find the occurrence of the number one — we should just be able to index into an array directly with an integer. Therefore, most deep learning libraries, including PyTorch, include a special layer which does just this; it indexes into a vector using an integer, but has its derivative calculated in such a way that it is identical to what it would have been if it had done a matrix multiplication with a one hot encoded vector. This is called an *embedding*."
]
},
{
@ -752,7 +752,7 @@
"source": [
"In computer vision, we had a very easy way to get all the information of a pixel through its RGB values: each pixel in a coloured imaged is represented by three numbers. Those three numbers gave us the red-ness, the green-ness and the blue-ness, which is enough to get our model to work afterward.\n",
"\n",
"For the problem at hand, we don't have the same easy way to characterize a user or a movie. There is probably relations with genres: if a given user likes romance, he is likely to put higher scores to romance movie. Or wether the movie is more action-centered vs heavy on dialogue. Or the presence of a specific actor that one use might particularly like. \n",
"For the problem at hand, we don't have the same easy way to characterize a user or a movie. There is probably relations with genres: if a given user likes romance, he is likely to put higher scores to romance movie. Or wether the movie is more action-centered vs heavy on dialogue. Or the presence of a specific actor that one user might particularly like. \n",
"\n",
"How do we determine numbers to characterize those? The answer is, we don't. We will let our model *learn* them. By analyzing the existing relations between users and movies, let our model figure out itself the features that seem important or not.\n",
"\n",
@ -794,7 +794,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The most important piece of this is the special method called `__init__` (pronounced *dunder init*). In Python, any method surrounded in double underscores like this is considered special. It indicates that there is some extra behaviour associated with this method name. In the case of `__init__`, this is the method which Python will call when your new object is created. So, this is where you can set up any state which needs to be done upon object creation. Any parameters included when the user constructs an instance of your class will be passed to the `__init__` method is parameters. Note that the first parameter to any methods defined inside a class is `self`, so you can use this to set and get any attributes that you will need."
"The most important piece of this is the special method called `__init__` (pronounced *dunder init*). In Python, any method surrounded in double underscores like this is considered special. It indicates that there is some extra behaviour associated with this method name. In the case of `__init__`, this is the method which Python will call when your new object is created. So, this is where you can set up any state which needs to be done upon object creation. Any parameters included when the user constructs an instance of your class will be passed to the `__init__` method as parameters. Note that the first parameter to any method defined inside a class is `self`, so you can use this to set and get any attributes that you will need."
]
},
{
@ -1055,7 +1055,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a reasonable start, but we can do better. One obvious missing piece is that some users are just more positive or negative in their recommendations and others, and some movies are just plain better or worse than others. But in our dot product representation we do not have any way to encode either of these things. If all you can say, for instance, about the movie is that it is very sci-fi, very action oriented, and very not old, then you don't really have any way to say most people like it. \n",
"This is a reasonable start, but we can do better. One obvious missing piece is that some users are just more positive or negative in their recommendations than others, and some movies are just plain better or worse than others. But in our dot product representation we do not have any way to encode either of these things. If all you can say, for instance, about the movie is that it is very sci-fi, very action oriented, and very not old, then you don't really have any way to say most people like it. \n",
"\n",
"That's because at this point we only have weights; we do not have biases. If we have a single number for each user which we add to our scores, and ditto for each movie, then this will handle this missing piece very nicely. So first of all, let's adjust our model architecture:"
]
@ -1174,7 +1174,7 @@
"source": [
"Weight decay, or L2 regularization, consists in adding to your loss function the sum of all the weights squared. Why do that? Because when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.\n",
"\n",
"Why would it prevent overfitting? The idea is that the larger the coefficient are, the more sharp canyons we will have in the loss function. If we take the basic example of parabola, `y = a * (x**2)`, the larger `a` is, the more *narrow* the parabola is."
"Why would it prevent overfitting? The idea is that the larger the coefficients are, the more sharp canyons we will have in the loss function. If we take the basic example of parabola, `y = a * (x**2)`, the larger `a` is, the more *narrow* the parabola is."
]
},
{
@ -1592,7 +1592,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Have a think about what this means. What this is saying is, that for these movies, even when a user is very well matched to its latent factors (which, as we will see in a moment, tend to represent things like level of action, age of movie, and so forth) they still generally don't like it. We could have simply sorted movies directly by the average rating, but looking at their learned bias tells us something much more interesting. It tells us not just whether a movie is of a kind that people tend not to enjoy watching, but that people type like watching it even if it is of a kind that they would otherwise enjoy! By the same token, here are the movies with the highest bias:"
"Have a think about what this means. What this is saying is, that for these movies, even when a user is very well matched to its latent factors (which, as we will see in a moment, tend to represent things like level of action, age of movie, and so forth) they still generally don't like it. We could have simply sorted movies directly by the average rating, but looking at their learned bias tells us something much more interesting. It tells us not just whether a movie is of a kind that people tend not to enjoy watching, but that people type do not like watching it even if it is of a kind that they would otherwise enjoy! By the same token, here are the movies with the highest bias:"
]
},
{
@ -1863,7 +1863,7 @@
"source": [
"On a two-dimensional map we can calculate the distance between two coordinates using the formula of Pythagoras: $\\sqrt{x^{2}+y^{2}}$ (assuming that X and Y are the distances between the coordinates on each axis). For a 50 dimensional embedding we can do exactly the same thing, except that we add up the squares of all 50 of the coordinate distances.\n",
"\n",
"If there were two movies that were nearly identical, then there embedding vectors would also have to be nearly identical, because the users that would like them would be nearly exactly the same. There is a more general idea here: movie similarity can be defined by the similarity of users that like those movies. And that directly means that the distance between two movies' embedding vectors can define that similarity. We can use this to find the most similar movie to *Silence of the Lambs*:"
"If there were two movies that were nearly identical, then their embedding vectors would also have to be nearly identical, because the users that would like them would be nearly exactly the same. There is a more general idea here: movie similarity can be defined by the similarity of users that like those movies. And that directly means that the distance between two movies' embedding vectors can define that similarity. We can use this to find the most similar movie to *Silence of the Lambs*:"
]
},
{
@ -1894,7 +1894,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have succesfully trained a model, let's see how to deal whwn we have no data for a new user, to be able to make recommandations to them."
"Now that we have succesfully trained a model, let's see how to deal when we have no data for a new user, to be able to make recommendations to them."
]
},
{
@ -1912,7 +1912,7 @@
"\n",
"But even if you are a well-established company with a long history of user transactions, you still have the question: what do you do when a new user signs up? And indeed, what do you do when you add a new product to your portfolio? There is no magic solution to this problem, and really the solutions that we suggest are just variations of the form *use your common sense*. You can start your new users such that they have the mean of all of the embedding vectors of your other users — although this has the problem that that particular combination of latent factors may be not at all common (for instance the average for the science-fiction factor may be high, and the average for the action factor may be low, but it is not that common to find people who like science-fiction without action). Better would probably be to pick some particular user to represent *average taste*.\n",
"\n",
"Better still is to use a tabular model based on user meta data to construct your initial embedding vector. When a user signs up, think about what questions you could ask them which could help you to understand their tastes. Then you can create a model where the dependent variable is a user's embedding vector, and the independent variables are the results of the questions that you ask them, along with their signup meta data. We will learn in the next section how to create these kinds of tabular models. You may have noticed that when you sign up for services such as Pandora and Netflix that they tend to ask you a few questions about what genres of movie or music that you like; this is how they come up with your initial collaborative filtering recommendations."
"Better still is to use a tabular model based on user meta data to construct your initial embedding vector. When a user signs up, think about what questions you could ask them which could help you to understand their tastes. Then you can create a model where the dependent variable is a user's embedding vector, and the independent variables are the results of the questions that you ask them, along with their signup meta data. We will learn in the next section how to create these kinds of tabular models. You may have noticed that when you sign up for services such as Pandora and Netflix, they tend to ask you a few questions about what genres of movie or music you like; this is how they come up with your initial collaborative filtering recommendations."
]
},
{

View File

@ -152,7 +152,7 @@
"\n",
"In addition, it is also valuable in its own right that embeddings are continuous. It is valuable because models are better at understanding continuous variables. This is unsurprising considering models are built of many continuous parameter weights and continuous activation values, which are updated via gradient descent, a learning algorithm for finding the minimums of continuous functions.\n",
"\n",
"Is is also valuable because we can combine our continuous embedding values with truly continuous input data in a straightforward manner: we just concatenate the variables, and feed the concatenation into our first dense layer. In other words, the raw categorical data is transformed by an embedding layer, before it interacts with the raw continuous input data. This is how fastai, and the entity embeddings paper, handle tabular models containing continuous and categorical variables.\n",
"It is also valuable because we can combine our continuous embedding values with truly continuous input data in a straightforward manner: we just concatenate the variables, and feed the concatenation into our first dense layer. In other words, the raw categorical data is transformed by an embedding layer, before it interacts with the raw continuous input data. This is how fastai, and the entity embeddings paper, handle tabular models containing continuous and categorical variables.\n",
"\n",
"An example using this concatenation approach is how Google do their recommendations on Google Play, as they explained in their paper [Wide & Deep Learning for Recommender Systems](https://arxiv.org/abs/1606.07792), and as shown in this figure from their paper:"
]
@ -9737,6 +9737,18 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,

View File

@ -1278,7 +1278,7 @@
"source": [
"As we have seen at the beginning of this chapter to train a state-of-the-art text classifier using transfer learning will take two steps: first we need to fine-tune our langauge model pretrained on Wikipedia to the corpus of IMDb reviews, then we can use that model to train a classifier.\n",
"\n",
"As usual, let's start with assemblng our data."
"As usual, let's start with assembling our data."
]
},
{
@ -2157,7 +2157,7 @@
"source": [
"Kao estimated that \"less than 800,000 of the 22M+ comments… could be considered truly unique\" and that \"more than 99% of the truly unique comments were in favor of keeping net neutrality.\"\n",
"\n",
"Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the tools at your disposal necessary to create and compelling language model. That is, something that can generate context appropriate believable text. It won't necessarily be perfectly accurate or correct, but it will be believable. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about. Take a look at this conversation on Reddit shown in <<ethics_reddit>>, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending:"
"Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the tools at your disposal necessary to create a compelling language model. That is, something that can generate context appropriate believable text. It won't necessarily be perfectly accurate or correct, but it will be believable. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about. Take a look at this conversation on Reddit shown in <<ethics_reddit>>, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending:"
]
},
{
@ -2207,7 +2207,7 @@
"1. What is a language model?\n",
"1. Why is a language model considered self-supervised learning?\n",
"1. What are self-supervised models usually used for?\n",
"1. What do we fine-tune language models?\n",
"1. Why do we fine-tune language models?\n",
"1. What are the three steps to create a state-of-the-art text classifier?\n",
"1. How do the 50,000 unlabeled movie reviews help create a better text classifier for the IMDb dataset?\n",
"1. What are the three steps to prepare your data for a language model?\n",

View File

@ -941,7 +941,7 @@
"source": [
"A *Siamese model* takes two images and has to determine if they are of the same class or not. For this example, we will use the pets dataset again, and prepare the data for a model that will have to predict if two images of pets are of the same breed or not. We will explain here how to prepare the data for such a model, then we will train that model in <<chapter_arch_details>>.\n",
"\n",
"Firs things first, let's get the images in our dataset."
"First things first, let's get the images in our dataset."
]
},
{

View File

@ -44,7 +44,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Whenever we start working on a new problem, we always first try to think of the simplest dataset we can which would allow us to try out methods quickly and easily, and interpret the results. When we started working on language modelling a few years ago, we didn't find any datasets that would allow for quick prototyping, so we made one. We call it *human numbers*, and it simply contains the first 10,000 words written out in English."
"Whenever we start working on a new problem, we always first try to think of the simplest dataset we can which would allow us to try out methods quickly and easily, and interpret the results. When we started working on language modelling a few years ago, we didn't find any datasets that would allow for quick prototyping, so we made one. We call it *human numbers*, and it simply contains the first 10,000 numbers written out in English."
]
},
{
@ -717,7 +717,7 @@
"source": [
"Looking at the code for our RNN, one thing that seems problematic is that we are initialising our hidden state to zero for every new input sequence. Why is that a problem? We made our sample sequences short so they would fit easily into batches. But if we order those samples correctly, those sample sequences will be read in order by the model, exposing the model to long stretches of the original sequence. \n",
"\n",
"Another thing we can look at is havin more signal: why only predict the fourth word when we could use the intermediate predictions to also predict the second and third words? \n",
"Another thing we can look at is having more signal: why only predict the fourth word when we could use the intermediate predictions to also predict the second and third words? \n",
"\n",
"We'll see how we can implement those changes, starting with adding some state."
]
@ -1572,7 +1572,7 @@
"\n",
"$$\\tanh(x) = \\frac{e^{x} + e^{-x}}{e^{x}-e^{-x}} = 2 \\sigma(2x) - 1$$\n",
"\n",
"where $\\sigma$ is the sigmoid function. The green boxes are elementwise operations. What goes out is the new hidden state ($h_{t}$) and new cell state ($c_{t}$) on the left, ready for our next input. The new hidden state is also use as output, which is why the arrow splits to go up.\n",
"where $\\sigma$ is the sigmoid function. The green boxes are elementwise operations. What goes out is the new hidden state ($h_{t}$) and new cell state ($c_{t}$) on the right, ready for our next input. The new hidden state is also used as output, which is why the arrow splits to go up.\n",
"\n",
"Let's go over the four neural nets (called *gates*) one by one and explain the diagram, but before this, notice how very little the cell state (on the top) is changed. It doesn't even go directly through a neural net! This is exactly why it will carry on a longer-term state.\n",
"\n",
@ -1580,7 +1580,7 @@
"\n",
"The first gate (looking from the left to right) is called the *forget gate*. Since it's a linear layer followed by a sigmoid, its output will have scalars between 0 and 1. We multiply this result by the cell gate, so for all the values close to 0, we will forget what was inside that cell state (and for the values close to 1 it doesn't do anything). This gives the ability to the LSTM to forget things about its longterm state. For instance, when crossing a period or an `xxbos` token, we would expect to it to (have learned to) reset its cell state.\n",
"\n",
"The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance we may see a new gender pronoun, so we must replace the information about gender that the forget gate removed by the new one. Like the forget gate, the input gate ends up on a product, so it jsut decides which element of the cell state to update (valeus close to 1) or not (values close to 0). The third gate will then fill those values with things between -1 and 1 (thanks to the tanh). The result is then added to the cell state.\n",
"The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance we may see a new gender pronoun, so we must replace the information about gender that the forget gate removed by the new one. Like the forget gate, the input gate ends up on a product, so it just decides which element of the cell state to update (values close to 1) or not (values close to 0). The third gate will then fill those values with things between -1 and 1 (thanks to the tanh). The result is then added to the cell state.\n",
"\n",
"The last gate is the *output gate*. It will decides which information take in the cell state to generate the output. The cell state goes through a tanh before this and the output gate combined with the sigmoid decides which values to take inside it.\n",
"\n",
@ -1723,7 +1723,7 @@
" self.i_h = nn.Embedding(vocab_sz, n_hidden)\n",
" self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)\n",
" self.h_o = nn.Linear(n_hidden, vocab_sz)\n",
" self.h = [torch.zeros(2, bs, n_hidden) for _ in range(n_layers)]\n",
" self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(1)]\n",
" \n",
" def forward(self, x):\n",
" res,h = self.rnn(self.i_h(x), self.h)\n",
@ -1894,7 +1894,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Recurrent neural networks, in general, are hard to train, because of the problems of vanishing activations and gradients we saw before. Using LSTMs (or GRUs) cell make training easier than vanilla RNNs, but there are still very prone to overfitting. Data augmentation, while it exists for text data, is less often used because in most cases, it requires another model to generate random augmentation (by translating in another language and back to the language used for instance). Overall, data augmentation for text data is currently not a well explored space.\n",
"Recurrent neural networks, in general, are hard to train, because of the problems of vanishing activations and gradients we saw before. Using LSTMs (or GRUs) cells make training easier than vanilla RNNs, but there are still very prone to overfitting. Data augmentation, while it exists for text data, is less often used because in most cases, it requires another model to generate random augmentation (by translating in another language and back to the language used for instance). Overall, data augmentation for text data is currently not a well explored space.\n",
"\n",
"However, there are other regularization techniques we can use instead to reduce overfitting, which were thoroughly studied for use with LSTMs in the paper [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182). This paper showed how effective use of *dropout*, *activation regularization*, and *temporal activation regularization* could allow an LSTM to beat state of the art results that previously required much more complicated models. They called an LSTM using these techniques an *AWD LSTM*. We'll look at each of these techniques in turn."
]
@ -2039,7 +2039,7 @@
" self.drop = nn.Dropout(p)\n",
" self.h_o = nn.Linear(n_hidden, vocab_sz)\n",
" self.h_o.weight = self.i_h.weight\n",
" self.h = [torch.zeros(2, bs, n_hidden) for _ in range(n_layers)]\n",
" self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(1)]\n",
" \n",
" def forward(self, x):\n",
" raw,h = self.rnn(self.i_h(x), self.h)\n",
@ -2090,7 +2090,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can the train the model, and add additional regularization by increasing the weight decay to `0.1`:"
"We can then train the model, and add additional regularization by increasing the weight decay to `0.1`:"
]
},
{
@ -2257,7 +2257,7 @@
"- weight dropout (applied to the weights of the LSTM at each training step)\n",
"- hidden dropout (applied to the hidden state between two layers)\n",
"\n",
"which makes it even more regularized. Since fine-tuning those five dropout values (adding the dropout before the output layer) is complicated, so we have determined good defaults, and allow the magnitude of dropout to be tuned overall with the `drop_mult` parameter you saw (which is multiplied by each dropout).\n",
"which makes it even more regularized. Since fine-tuning those five dropout values (adding the dropout before the output layer) is complicated, we have determined good defaults, and allow the magnitude of dropout to be tuned overall with the `drop_mult` parameter you saw (which is multiplied by each dropout).\n",
"\n",
"Another architecture that is very powerful, especially in \"sequence to sequence\" problems (that is, problems where the dependent variable is itself a variable length sequence, such as language translation), is the Transformers architecture. You can find it in an online bonus chapter on the book website."
]
@ -2274,7 +2274,7 @@
"metadata": {},
"source": [
"1. If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?\n",
"1. Why do we concatenating the documents in our dataset before creating a language model?\n",
"1. Why do we concatenate the documents in our dataset before creating a language model?\n",
"1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make?\n",
"1. How can we share a weight matrix across multiple layers in PyTorch?\n",
"1. Write a module which predicts the third word given the previous two words of a sentence, without peeking.\n",
@ -2347,6 +2347,18 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,

View File

@ -1567,6 +1567,18 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,

View File

@ -1080,6 +1080,18 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,

View File

@ -304,10 +304,25 @@
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,

View File

@ -8340,6 +8340,18 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,

View File

@ -1504,7 +1504,7 @@
"1. What is a language model?\n",
"1. Why is a language model considered self-supervised learning?\n",
"1. What are self-supervised models usually used for?\n",
"1. What do we fine-tune language models?\n",
"1. Why do we fine-tune language models?\n",
"1. What are the three steps to create a state-of-the-art text classifier?\n",
"1. How do the 50,000 unlabeled movie reviews help create a better text classifier for the IMDb dataset?\n",
"1. What are the three steps to prepare your data for a language model?\n",

View File

@ -1154,7 +1154,7 @@
" self.i_h = nn.Embedding(vocab_sz, n_hidden)\n",
" self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)\n",
" self.h_o = nn.Linear(n_hidden, vocab_sz)\n",
" self.h = [torch.zeros(2, bs, n_hidden) for _ in range(n_layers)]\n",
" self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(1)]\n",
" \n",
" def forward(self, x):\n",
" res,h = self.rnn(self.i_h(x), self.h)\n",
@ -1362,7 +1362,7 @@
" self.drop = nn.Dropout(p)\n",
" self.h_o = nn.Linear(n_hidden, vocab_sz)\n",
" self.h_o.weight = self.i_h.weight\n",
" self.h = [torch.zeros(2, bs, n_hidden) for _ in range(n_layers)]\n",
" self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(1)]\n",
" \n",
" def forward(self, x):\n",
" raw,h = self.rnn(self.i_h(x), self.h)\n",
@ -1553,7 +1553,7 @@
"metadata": {},
"source": [
"1. If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?\n",
"1. Why do we concatenating the documents in our dataset before creating a language model?\n",
"1. Why do we concatenate the documents in our dataset before creating a language model?\n",
"1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make?\n",
"1. How can we share a weight matrix across multiple layers in PyTorch?\n",
"1. Write a module which predicts the third word given the previous two words of a sentence, without peeking.\n",
@ -1626,6 +1626,18 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,