From c9441118d865dc37323d92755a293c8db6b264f5 Mon Sep 17 00:00:00 2001 From: Joe Bender Date: Thu, 16 Apr 2020 16:50:18 -0400 Subject: [PATCH 1/6] Writing and formatting fixes --- 01_intro.ipynb | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/01_intro.ipynb b/01_intro.ipynb index 3cdd86b..1971547 100644 --- a/01_intro.ipynb +++ b/01_intro.ipynb @@ -1575,7 +1575,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> important: When you train a model, you must **always** have both a training set, and a validation set, and must measure the accuracy of your model only on the validation set. If you train for too long, with not enough data, you will see the accuracy of your model start to get worse; this is called **over-fitting**. fastai defaults `valid_pct` to `0.2`, so even if you forget, fastai will create a validation set for you!" + "> important: When you train a model, you must **always** have both a training set and a validation set, and must measure the accuracy of your model only on the validation set. If you train for too long, with not enough data, you will see the accuracy of your model start to get worse; this is called **over-fitting**. fastai defaults `valid_pct` to `0.2`, so even if you forget, fastai will create a validation set for you!" ] }, { @@ -1715,7 +1715,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As you can see by looking at the right-hand side of this picture, the features are now able to identify and match with higher levels semantic components, such as car wheels, text, and flower petals. Using these components, layers four and five can identify even higher-level concepts, as shown in <>." + "As you can see by looking at the right-hand side of this picture, the features are now able to identify and match with higher-level semantic components, such as car wheels, text, and flower petals. Using these components, layers four and five can identify even higher-level concepts, as shown in <>." ] }, { @@ -1775,7 +1775,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Another interesting fast.ai student project example comes from Gleb Esman. He was working on fraud detection at Splunk, and was working with a dataset of users' mouse movements and mouse clicks. He turned these into pictures by drawing an image where the position, speed and acceleration of the mouse was displayed using coloured lines, and the clicks were displayed using [small coloured circles](https://www.splunk.com/en_us/blog/security/deep-learning-with-splunk-and-tensorflow-for-security-catching-the-fraudster-in-neural-networks-with-behavioral-biometrics.html) as shown in <>. He then fed this into an image recognition model just like the one we've shown in this chapter, and it worked so well that had led to a patent for this approach to fraud analytics!" + "Another interesting fast.ai student project example comes from Gleb Esman. He was working on fraud detection at Splunk, and was working with a dataset of users' mouse movements and mouse clicks. He turned these into pictures by drawing an image where the position, speed and acceleration of the mouse was displayed using coloured lines, and the clicks were displayed using [small coloured circles](https://www.splunk.com/en_us/blog/security/deep-learning-with-splunk-and-tensorflow-for-security-catching-the-fraudster-in-neural-networks-with-behavioral-biometrics.html) as shown in <>. He then fed this into an image recognition model just like the one we've shown in this chapter, and it worked so well that it led to a patent for this approach to fraud analytics!" ] }, { @@ -1817,7 +1817,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As you can see, the different types of malware look very distinctive to the human eye. The model they train based on this image representation was more accurate at malware classification than any previous approach shown in the academic literature. This suggests a good rule of thumb for converting a dataset into an image representation: if the human eye can recognize categories from the images, then a deep learning model should be able to do so too.\n", + "As you can see, the different types of malware look very distinctive to the human eye. The model they trained based on this image representation was more accurate at malware classification than any previous approach shown in the academic literature. This suggests a good rule of thumb for converting a dataset into an image representation: if the human eye can recognize categories from the images, then a deep learning model should be able to do so too.\n", "\n", "In general, you'll find that a small number of general approaches in deep learning can go a long way, if you're a bit creative in how you represent your data! You shouldn't think of approaches like the above as \"hacky workarounds\", since actually they often (as here) beat previously state of the art results. These really are the right way to think about these problem domains." ] @@ -1872,7 +1872,7 @@ "\n", "In order to make the training process go faster, we might start with a *pretrained model*, a model which has already been trained on someone else's data. We then adapt it to our data by training it a bit more on our data, a process called *fine tuning*.\n", "\n", - "When we train a model, a key concern is to ensure that our model *generalizes* -- that is, that it learns general lessons from our data which also apply to new items it will encounter, so that it can make good predictions on those items. The risk is that if we train our model badly, instead of learning general lessons it effectively memorizes what it has already seen, and then it will make poor predictions about new images. Such a failure is called *overfitting*. In order to avoid this, we always divide our data into two parts, the *training set* and the *validation set*. We train the model by showing it only the *training set* and then we evaluate how well the model is doing by seeing how well it predicts on items from the *validation set* . In this way, we check if the lessons the model learns from the training set are lessons that generalize to the validation set. In order for a person to assess how well the model is doing on the validation set overall, we define a *metric* . During the training process, when the model has seen every item in the training set, we call that an *epoch*.\n", + "When we train a model, a key concern is to ensure that our model *generalizes* -- that is, that it learns general lessons from our data which also apply to new items it will encounter, so that it can make good predictions on those items. The risk is that if we train our model badly, instead of learning general lessons it effectively memorizes what it has already seen, and then it will make poor predictions about new images. Such a failure is called *overfitting*. In order to avoid this, we always divide our data into two parts, the *training set* and the *validation set*. We train the model by showing it only the *training set* and then we evaluate how well the model is doing by seeing how well it predicts on items from the *validation set*. In this way, we check if the lessons the model learns from the training set are lessons that generalize to the validation set. In order for a person to assess how well the model is doing on the validation set overall, we define a *metric*. During the training process, when the model has seen every item in the training set, we call that an *epoch*.\n", "\n", "All these concepts apply to machine learning in general. That is, they apply to all sorts of schemes for defining a model by training it with data. What makes deep learning distinctive is a particular class of architectures, the architectures based on *neural networks*. In particular, tasks like image classification rely heavily on *convolutional neural networks*, which we will discuss shortly." ] @@ -2181,14 +2181,14 @@ "learn.fine_tune(4, 1e-2)\n", "```\n", "\n", - "This reduces the batch size to 32 (we will explain this later). If you keep hitting the same error, change 32 by 16." + "This reduces the batch size to 32 (we will explain this later). If you keep hitting the same error, change 32 to 16." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "This model is using the IMDb dataset from the paper [Learning Word Vectors for Sentiment Analysis]((https://ai.stanford.edu/~amaas/data/sentiment/)). It works well with movie reviews of many thousands of words. But let's test it out on a very short one, to see it does its thing:" + "This model is using the IMDb dataset from the paper [Learning Word Vectors for Sentiment Analysis](https://ai.stanford.edu/~amaas/data/sentiment/). It works well with movie reviews of many thousands of words. But let's test it out on a very short one, to see it do its thing:" ] }, { @@ -2234,7 +2234,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Sidebar: The order matter" + "### Sidebar: The order matters" ] }, { @@ -2276,7 +2276,7 @@ "doc(learn.predict)\n", "```\n", "\n", - "This will make a small window pop with content like this:\n", + "This will make a small window pop up with content like this:\n", "\n", "" ] @@ -2656,7 +2656,7 @@ "\n", "Most datasets used in this book took the creators a lot of work to build. For instance, later in the book we’ll be showing you how to create a model that can translate between French and English. The key input to this is a French/English parallel text corpus prepared back in 2009 by Professor Chris Callison-Burch of the University of Pennsylvania. This dataset contains over 20 million sentence pairs in French and English. He built the dataset in a really clever way: by crawling millions of Canadian web pages (which are often multi-lingual) and then using a set of simple heuristics to transform French URLs onto English URLs.\n", "\n", - "As you look at datasets throughout this book, think about where they might have come from, and how they might have been curated. Then, think about what kinds of interesting dataset you could create for your own projects. (We’ll even take you step by step through the process of creating your own image dataset soon.)\n", + "As you look at datasets throughout this book, think about where they might have come from, and how they might have been curated. Then, think about what kinds of interesting datasets you could create for your own projects. (We’ll even take you step by step through the process of creating your own image dataset soon.)\n", "\n", "fast.ai has spent a lot of time creating cutdown versions of popular datasets that are specially designed to support rapid prototyping and experimentation, and to be easier to learn with. In this book we will often start by using one of the cutdown versions, and we later on scale up to the full-size version (just as we're doing in this chapter!) In fact, this is how the world’s top practitioners do their modelling projects in practice; they do most of their experimentation and prototyping with subsets of their data, and only use the full dataset when they have a good understanding of what they have to do." ] @@ -2672,7 +2672,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Each of the models we trained showed a training and validation loss. A good validation set is one of the most important pieces of your training, let's see why and learn how to create one." + "Each of the models we trained showed a training and validation loss. A good validation set is one of the most important pieces of your training. Let's see why and learn how to create one." ] }, { @@ -2690,7 +2690,7 @@ "\n", "It is in order to avoid this that our first step was to split our dataset into two sets, the *training set* (which our model sees in training) and the *validation set*, also known as the *development set* (which is used only for evaluation). This lets us test that the model learns lessons from the training data which generalize to new data, the validation data.\n", "\n", - "One way to understand this situation is that, in a sense, we don't want our model to get good results by \"cheating\". If it predicts well on a data item, that should be because it has learned principles that govern that kind of item, and not because the model has been shaped by *actually having seeing that particular item*.\n", + "One way to understand this situation is that, in a sense, we don't want our model to get good results by \"cheating\". If it predicts well on a data item, that should be because it has learned principles that govern that kind of item, and not because the model has been shaped by *actually having seen that particular item*.\n", "\n", "Splitting off our validation data means our model never sees it in training, and so is completely untainted by it, and is not cheating in any way. Right?\n", "\n", From e93cdeb32805c5044a39051fe5202354989d1750 Mon Sep 17 00:00:00 2001 From: Ben Mainye Date: Fri, 17 Apr 2020 06:47:13 +0300 Subject: [PATCH 2/6] Update 01_intro.ipynb remove "s" in example. --- 01_intro.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/01_intro.ipynb b/01_intro.ipynb index 8be9f17..0b36b98 100644 --- a/01_intro.ipynb +++ b/01_intro.ipynb @@ -1789,7 +1789,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Another examples comes from the paper [Malware Classification with Deep Convolutional Neural Networks](https://ieeexplore.ieee.org/abstract/document/8328749) which explains that \"the malware binary file is divided into 8-bit sequences which are then converted to equivalent decimal values. This decimal vector is reshaped and gray-scale image is generated that represent the malware sample\", like in <>." + "Another example comes from the paper [Malware Classification with Deep Convolutional Neural Networks](https://ieeexplore.ieee.org/abstract/document/8328749) which explains that \"the malware binary file is divided into 8-bit sequences which are then converted to equivalent decimal values. This decimal vector is reshaped and gray-scale image is generated that represent the malware sample\", like in <>." ] }, { From 810f6075dbbbe08ed9d4b4f7c54d0a605fe2f855 Mon Sep 17 00:00:00 2001 From: holynec Date: Thu, 16 Apr 2020 20:53:12 -0700 Subject: [PATCH 3/6] grammar changes --- 05_pet_breeds.ipynb | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/05_pet_breeds.ipynb b/05_pet_breeds.ipynb index 495edcf..ff25ea9 100644 --- a/05_pet_breeds.ipynb +++ b/05_pet_breeds.ipynb @@ -77,7 +77,7 @@ "- Individual files representing items of data, such as text documents or images, possibly organised into folders or with filenames representing information about those items, or\n", "- A table of data, such as in CSV format, where each row is an item, each row which may include filenames providing a connection between the data in the table and data in other formats such as text documents and images.\n", "\n", - "There are exceptions to these rules, particularly in domains such as genomics, where there can be binary database formats or even network streams, but overall the vast majority of the datasets your work with use some combination of the above two formats.\n", + "There are exceptions to these rules, particularly in domains such as genomics, where there can be binary database formats or even network streams, but overall the vast majority of the datasets you work with use some combination of the above two formats.\n", "\n", "To see what is in our dataset we can use the ls method:" ] @@ -1264,7 +1264,7 @@ "\n", " log(a*b) = log(a)+log(b)\n", "\n", - "When we see it in that format looks a bit boring; but have a think about what this really means. It means that logarithms increase linearly when the underlying signal increases exponentially or multiplicatively. This is used for instance in the Richter scale of earthquake severity, and the dB scale of noise levels. It's also often used on financial charts, where we want to show compound growth rates more clearly. Computer scientists love using logarithms, because it means that modification, which can create really really large and really really small numbers, can be replaced by addition, which is much less likely to result in scales which are difficult for our computer to handle." + "When we see it in that format, it looks a bit boring; but have a think about what this really means. It means that logarithms increase linearly when the underlying signal increases exponentially or multiplicatively. This is used for instance in the Richter scale of earthquake severity, and the dB scale of noise levels. It's also often used on financial charts, where we want to show compound growth rates more clearly. Computer scientists love using logarithms, because it means that modification, which can create really really large and really really small numbers, can be replaced by addition, which is much less likely to result in scales which are difficult for our computer to handle." ] }, { @@ -2017,7 +2017,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This has improved our model a bit, but there's more we can do. The deepest layers of our pretrained model might not need as high a learning rate as the last ones, so we should probably use different learning rates for those, something called discriminative learning rates." + "This has improved our model a bit, but there's more we can do. The shallowest layers of our pretrained model might not need as high a learning rate as the last ones, so we should probably use different learning rates for those, something called discriminative learning rates." ] }, { @@ -2532,6 +2532,18 @@ "display_name": "Python 3", "language": "python", "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" } }, "nbformat": 4, From e2828ed61db8e39e6122dcd424892142727e1c9c Mon Sep 17 00:00:00 2001 From: SOVIETIC-BOSS88 Date: Fri, 17 Apr 2020 11:03:35 +0200 Subject: [PATCH 4/6] Update 05_pet_breeds.ipynb Grammar change: that we working with -->> that we are working with --- 05_pet_breeds.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/05_pet_breeds.ipynb b/05_pet_breeds.ipynb index 495edcf..d6c9bc3 100644 --- a/05_pet_breeds.ipynb +++ b/05_pet_breeds.ipynb @@ -53,7 +53,7 @@ "source": [ "In our very first model we learnt how to classify dogs versus cats. Just a few years ago this was considered a very challenging task. But today, it is far too easy! We will not be able to show you the nuances of training models with this problem, because we get the nearly perfect result without worrying about any of the details. But it turns out that the same dataset also allows us to work on a much more challenging problem: figuring out what breed of pet is shown in each image.\n", "\n", - "In the first chapter we presented the applications as already solved problems. But this is not how things work in real life. We start with some dataset which we know nothing about. We have to understand how it is put together, how to extract the data we need from it, and what that data looks like. For the rest of this book we will be showing you how to solve these problems in practice, including all of these intermediate steps necessary to understand the data that we working with and test our modelling as we go.\n", + "In the first chapter we presented the applications as already solved problems. But this is not how things work in real life. We start with some dataset which we know nothing about. We have to understand how it is put together, how to extract the data we need from it, and what that data looks like. For the rest of this book we will be showing you how to solve these problems in practice, including all of these intermediate steps necessary to understand the data that we are working with and test our modelling as we go.\n", "\n", "We have already downloaded the pets dataset. We can get a path to this dataset using the same code we saw in <>:" ] From 3ba44079f7f2da669df1b0c202f79a92a9bced49 Mon Sep 17 00:00:00 2001 From: alvarotap <61787129+alvarotap@users.noreply.github.com> Date: Fri, 17 Apr 2020 12:20:29 +0200 Subject: [PATCH 5/6] Some minor typos in chapter 11 --- 11_midlevel_data.ipynb | 31 ++++--------------------------- 1 file changed, 4 insertions(+), 27 deletions(-) diff --git a/11_midlevel_data.ipynb b/11_midlevel_data.ipynb index ffba7ce..4302786 100644 --- a/11_midlevel_data.ipynb +++ b/11_midlevel_data.ipynb @@ -61,7 +61,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The factory method `TextDataLoaders.from_folder` is very convenient when your data is arranged the exact same way as the IMDb dataset, but in practice, that often won't be the case. The data block API offers more flexibility. As we saw in the last chapter, we can ge the same result with:" + "The factory method `TextDataLoaders.from_folder` is very convenient when your data is arranged the exact same way as the IMDb dataset, but in practice, that often won't be the case. The data block API offers more flexibility. As we saw in the last chapter, we can get the same result with:" ] }, { @@ -147,29 +147,6 @@ "toks[0]" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "tensor([ 2, 8, 20, 27, 11, 88, 18, 53, 3286, 45])" - ] - }, - "execution_count": null, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "num = Numericalize()\n", - "num.setup(toks)\n", - "nums = toks.map(num)\n", - "nums[0][:10]" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -1079,7 +1056,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Here the resize transform is applied to each of the two images, but not the boolean flag. Even if we have a custom type, we can thus benefit form all the data augmentation transforms inside the library.\n", + "Here the resize transform is applied to each of the two images, but not the boolean flag. Even if we have a custom type, we can thus benefit from all the data augmentation transforms inside the library.\n", "\n", "We are now ready to build the `Transform` that we will use to get our data ready for a Siamese model. First, we will need a function to determine the class of all our images:" ] @@ -1098,7 +1075,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Then here is our main transform. For each image, il will, with a probability of 0.5, draw an image from the same class and return a `SiameseImage` with a true label, or draw an image from another class and a return a `SiameseImage` with a false label. This is all done in the private `_draw` function. There is one difference between the training and validation set, which is why the transform needs to be initialized with the splits: on the training set, we will make that random pick each time we read an image, whereas on the validation set, we make this random pick once and for all at initialization. This way, we get more varied samples during training, but always the same validation set." + "Then here is our main transform. For each image, il will, with a probability of 0.5, draw an image from the same class and return a `SiameseImage` with a true label, or draw an image from another class and return a `SiameseImage` with a false label. This is all done in the private `_draw` function. There is one difference between the training and validation set, which is why the transform needs to be initialized with the splits: on the training set, we will make that random pick each time we read an image, whereas on the validation set, we make this random pick once and for all at initialization. This way, we get more varied samples during training, but always the same validation set." ] }, { @@ -1220,7 +1197,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "And we have can now train a model using those `DataLoaders`. It needs a bit more customization than the usual model provided by `cnn_learner` since it has to take two images instead of one. We will see how to create such a model and train it in <>." + "And we can now train a model using those `DataLoaders`. It needs a bit more customization than the usual model provided by `cnn_learner` since it has to take two images instead of one. We will see how to create such a model and train it in <>." ] }, { From 70215cb29bbfff8ccda6747dab6dd1f29d47aad8 Mon Sep 17 00:00:00 2001 From: SOVIETIC-BOSS88 Date: Fri, 17 Apr 2020 21:17:44 +0200 Subject: [PATCH 6/6] Update 05_pet_breeds.ipynb Changed: - and then an_character -->> and then an _ character - same linguist Noam Chomskey -->> same linguist Noam Chomsky --- 05_pet_breeds.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/05_pet_breeds.ipynb b/05_pet_breeds.ipynb index 495edcf..0f9dbc0 100644 --- a/05_pet_breeds.ipynb +++ b/05_pet_breeds.ipynb @@ -145,7 +145,7 @@ "source": [ "Most functions and methods in fastai which return a collection use a class called `L`. `L` can be thought of as an enhanced version of the ordinary Python `list` type, with added conveniences for common operations. For instance, when we display an object of this class in a notebook it appears in the format you see above. The first thing that is shown is the number of items in the collection, prefixed with a `#`. You'll also see in the above output that the list is suffixed with a \"…\". This means that only the first few items are displayed — which is a good thing, because we would not want more than 7000 filenames on our screen!\n", "\n", - "By examining these filenames, we see how they appear to be structured. Each file name contains the pet breed, and then an_character, a number, and finally the file extension. We need to create a piece of code that extracts the breed from a single `Path`. Jupyter notebook makes this easy, because we can gradually build up something that works, and then use it for the entire dataset. We do have to be careful to not make too many assumptions at this point. For instance, if you look carefully you may notice that some of the pet breeds contain multiple words, so we cannot simply break at the first `_` character that we find. To allow us to test our code, let's pick out one of these filenames:" + "By examining these filenames, we see how they appear to be structured. Each file name contains the pet breed, and then an _ character, a number, and finally the file extension. We need to create a piece of code that extracts the breed from a single `Path`. Jupyter notebook makes this easy, because we can gradually build up something that works, and then use it for the entire dataset. We do have to be careful to not make too many assumptions at this point. For instance, if you look carefully you may notice that some of the pet breeds contain multiple words, so we cannot simply break at the first `_` character that we find. To allow us to test our code, let's pick out one of these filenames:" ] }, { @@ -167,7 +167,7 @@ "\n", "We do not have the space to give you a complete regular expression tutorial here, particularly because there are so many excellent ones online. And we know that many of you will already be familiar with this wonderful tool. If you're not, that is totally fine — this is a great opportunity for you to rectify that! We find that regular expressions are one of the most useful tools in our programming toolkit, and many of our students tell us that it is one of the things they are most excited to learn about. So head over to Google and search for *regular expressions tutorial* now, and then come back here after you've had a good look around. The book website also provides a list of our favorites.\n", "\n", - "> a: Not only are regular expressions dead handy, they also have interesting roots. They are \"regular\" because they were originally examples of a \"regular\" language, the lowest rung within the \"Chomsky hierarchy\", a grammar classification due to the same linguist Noam Chomskey who wrote _Syntactic Structures_, the pioneering work searching for the formal grammar underlying human language. This is one of the charms of computing: it may be that the hammer you reach for every day in fact came from a space ship.\n", + "> a: Not only are regular expressions dead handy, they also have interesting roots. They are \"regular\" because they were originally examples of a \"regular\" language, the lowest rung within the \"Chomsky hierarchy\", a grammar classification due to the same linguist Noam Chomsky who wrote _Syntactic Structures_, the pioneering work searching for the formal grammar underlying human language. This is one of the charms of computing: it may be that the hammer you reach for every day in fact came from a space ship.\n", "\n", "When you are writing a regular expression, the best way to start is just to try it against one example at first. Let's use the `findall` method to try a regular expression against the filename of the `fname` object:" ]