"In <<chapter_intro>> we saw that deep learning can be used to get great results with natural language datasets. Our example relied on using a pretrained language model and fine-tuning it to classify those reviews. One thing is a bit different from the transfer learning we have in computer vision: the pretrained model was not trained on the same task as the model we used to classify reviews.\n",
"\n",
"What we call a language model is a model that has been trained to guess what the next word in a text is (having read the ones before). This kind of task is called *self-supervised learning*: we do not need to give labels to our model, just feed it lots and lots of texts. It has a process to automatically get labels from the data, and this task isn't trivial: to properly guess the next word in a sentence, the model will have to get an understanding of the English-- or other--language. Self-supervised learning can also be used in other domains; for instance, see [Self-supervised learning and computer vision](https://www.fast.ai/2020/01/13/self_supervised/) for an introduction to vision applications. Self-supervised learning is not usually used for the model that is trained directly, but instead is used for pre-training a model used for transfer learning."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: Self-supervised learning: Training a model using labels that are embedded in the independent variable, rather than requiring external labels. For instance, training a model to predict the next word in a text."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The language model we used in <<chapter_intro>> to classify IMDb reviews was pretrained on Wikipedia. We got great results by directly fine-tuning this language model to a movie review classifier, but with one extra step, we can do even better: the Wikipedia English is slightly different from the IMDb English. So instead of jumping directly to the classifier, we could finetune our pretrained language model to the IMDb corpus and *then* use that as the base for our classifier.\n",
"\n",
"Even if our language model knows the basics of the language we are using in the task (e.g., our pretrained model is in English), it helps to get used to the style of the corpus we are targetting. It may be more informal language, or more technical, with new words to learn or different ways of composing sentences. In the case of IMDb, there will be lots of names of movie directors and actors, and often a less formal style of language that seen in Wikipedia.\n",
"\n",
"We saw that with fastai, we can download a pre-trained language model for English, and use it to get state-of-the-art results for NLP classification. (We expect pre-trained models in many more languages to be available soon — they might well be available by the time you are reading this book, in fact.) So, why are we learning how to train a language model in detail?\n",
"\n",
"One reason, of course, is that it is helpful to understand the foundations of the models that you are using. But there is another very practical reason, which is that you get even better results if you fine tune the (sequence-based) language model prior to fine tuning the classification model. For instance, for the IMDb sentiment analysis task, the dataset includes 50,000 additional movie reviews that do not have any positive or negative labels attached. So that is 100,000 movie reviews altogether (since there are also 25,000 labelled reviews in the training set, and 25,000 in the validation set). We can use all 100,000 of these reviews to fine tune the pretrained language model — this will result in a language model that is particularly good at predicting the next word of a movie review. In contrast, the pretrained model was trained only on Wikipedia articles.\n",
"The [ULMFiT paper](https://arxiv.org/abs/1801.06146) showed that this extra stage of language model fine tuning, prior to transfer learning to a classification task, resulted in significantly better predictions. Using this approach, we have three stages for transfer learning in NLP, as summarised in <<ulmfit_process>>."
"Now have a think about how you would turn this language modelling problem into a neural network, given what you have learned so far. We'll be able to use concepts that we've seen in the last two chapters."
"It's not at all obvious how we're going to use what we've learned so far to build a language model. Sentences can be different lengths, and documents can be very long. So, how can we predict the next word of a sentence using a neural network? Let's find out!\n",
"\n",
"We've already seen how categorical variables can be used as independent variables for a neural network. The approach we took for a single categorical variable was to:\n",
"\n",
"1. Make a list of all possible levels of that categorical variable (let us call this list the *vocab*)\n",
"1. Replace each level with its index in the vocab\n",
"1. Create an embedding matrix for this containing a row for each level (i..e, for each item of the vocab)\n",
"1. Use this embedding matrix as the first layer of a neural network. (A dedicated embedding matrix can take as inputs the raw vocab indexes created in step two; this is equivalent to, but faster and more efficient, than a matrix which takes as input one-hot encoded vectors representing the indexes)\n",
"\n",
"We can do nearly the same thing with text! What is new is the idea of a sequence. First we concatenate all of the documents in our dataset into one big long string and split it into words, giving us a very long list of words. Our independent variable will be the sequence of words starting with the first word in our very long list and ending with the second last, and our dependent variable would be the sequence of words starting with the second word and ending with the last word. \n",
"\n",
"When creating our vocab, we will have very common words that will probably be in the vocabulary of our pretrained model, but we will also have new words specific to our corpus (cinematographic terms, or actor names for instance). Our embedding matrix will be built accordingly: for words that are in the vocabulary of our pretrained model, we will take the corresponding row in the embedding matrix of this pretrained model; but for new words, we won't have anything, so we will just initialize the corresponding row with a random vector."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each of the steps necessary to create a language model has jargon associated with it from the world of natural language processing, and fastai and PyTorch classes available to help. The steps are:\n",
"- **Tokenization**:: convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)\n",
"- **Numericalization**:: make a list of all of the unique words which appear (the vocab), and convert each word into a number, by looking up its index in the vocab\n",
"- **Language model data loader** creation:: fastai provides an `LMDataLoader` class which automatically handles creating a dependent variable which is offset from the independent variable buy one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required\n",
"- **Language model** creation:: we need a special kind of model which does something we haven't seen before: handles input lists which could be arbitrarily big or small. There are a number of ways to do this; in this chapter we will be using a *recurrent neural network*. We will get to the details of this in the <<chapter_nlp_dive>>, but for now, you can think of it as just another deep neural network.\n",
"When we said, *convert the text into a list of words*, we left out a lot of details. For instance, what do we do with punctuation? How do we deal with a word like \"don't\"? Is it one word, or two? What about long medical or chemical words? Should they be split into their separate pieces of meaning? How about hyphenated words? What about languages like German and Poland where we can create really long words from many, many pieces? What about languages like Japanese and Chinese which don't use bases at all, and don't really have a well-defined idea of *word*?\n",
"\n",
"Because there is no one correct answer to these questions, there is no one approach to tokenization. Each element of the list created by the tokenisation process is called a *token*. There are three main approaches:\n",
"- **Word-based**:: split a sentence on spaces, as well as applying language specific rules to try to separate parts of meaning, even when there are no spaces, such as turning \"don't\" into \"do n't\". Generally, punctuation marks are also split into separate tokens\n",
"- **Subword based**:: split words into smaller parts, based on the most commonly occurring substrings. For instance, \"occasion\" might be tokeniser as \"o c ca sion\"\n",
"- **Character-based**:: split a sentence into its individual characters.\n",
"We'll be looking at word and subword tokenization here, and we'll leave character-based tokenization for you to implement in the questionnaire at the end of this chapter."
"Rather than providing its own tokenizers, fastai instead provides a consistent interface to a range of tokenisers in external libraries. Tokenization is an active field of research, and new and improved tokenizers are coming out all the time, so the defaults that fastai uses change too. However, the API and options shouldn't change too much, since fastai tries to maintain a consistent API even as the underlying technology changes.\n",
"\n",
"Let's try it out with the IMDb dataset that we used in <<chapter_intro>>:"
"We'll need to grab the text files in order to try out a tokenizer. Just like `get_image_files`, which we've used many times already, gets all the image files in a path, `get_text_files` gets all the text files in a path. We can also optionally pass `folders` to restrict the search to a particular list of subfolders:"
"As we write this book, the default *English word tokenizer* for fastai uses a library called *spaCy*. This uses a sophisticated rules engine that has special rules for URLs, individual special English words, and much more. Rather than directly using `SpacyTokenizer`, however, we'll use `WordTokenizer`, since that will always point to fastai's current default word tokenizer (which may not always be Spacy, depending when you're reading this).\n",
"\n",
"Let's try it out. We'll use fastai's `coll_repr(collection,n)` function to display the results; this displays the first `n` items of `collection`, along with the full size--it's what `L` uses by default. Not that fastai's tokenizers take a collection of documents to tokenize, so we have to wrap `txt` in a list:"
"As you see, spaCy has mainly just separated out the words and punctuation. But it does something else here too: it has split \"it's\" into \"it\" and \"'s\". That makes intuitive sense--these are separate words, really. Tokenization is a surprisingly subtle task, when you think about all the little details that have to be handled. spaCy handles these for us, for instance, here we see that \".\" is separated when it terminates a sentence, but not in an acronym or number:"
"There are now some tokens added that start with the characters \"xx\", which is not a common word prefix in English. These are *special tokens*.\n",
"\n",
"For example, the first item in the list, \"xxbos\", is a special token that indicates the start of a new text (\"BOS\" is a standard NLP acronym which means \"beginning of stream\"). By recognizing this start token, the model will be able to learn it needs to \"forget\" what was said previously and focus on upcoming words.\n",
"\n",
"These special tokens don't come from spaCy directly. They are there because fastai adds them by default, by applying a number of rules when processing text. These rules are designed to make it easier for a model to recognise the important parts of a sentence. In a sense, we are translating the original English language sequence into a simplified tokenised language, a language which is designed to be easy for a model to learn.\n",
"\n",
"For instance, the rules will replace a sequence of four exclamation points with a single exclamation point, followed by a special *repeated character* token, and then the number four. In this way, the model's embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repetitions of every punctuation mark. Similarly, a capitalised word will be replaced with a special capitalisation token, followed by the lower case version of the word. This way, the embedding matrix only needs the lower case version of the words, saving compute and memory, but can still learn the concept of capitalisation.\n",
"\n",
"Here are some of the main special tokens you'll see:\n",
"\n",
"- xxbos:: indicates the beginning of a text (here a review)\n",
"- xxmaj:: indicates the next word begins with a capital (since we lower-cased everything)\n",
"- xxunk:: indicates the next word is unknown\n",
"\n",
"To see the rules that were used, you can check the default rules:"
"- `fix_html`:: replace special HTML characters by a readable version (IMDb reviwes have quite a few of them for instance) ;\n",
"- `replace_rep`:: replace any character repeated three times or more by a special token for repetition (xxrep), the number of times it's repeated, then the character ;\n",
"- `replace_wrep`:: replace any word repeated three times or more by a special token for word repetition (xxwrep), the number of times it's repeated, then the word ;\n",
"- `spec_add_spaces`:: add spaces around / and # ;\n",
"- `rm_useless_spaces`:: remove all repetitions of the space character ;\n",
"- `replace_all_caps`:: lowercase a word written in all caps and adds a special token for all caps (xxcap) in front of it ;\n",
"- `replace_maj`:: lowercase a capitalized word and adds a special token for capitalized (xxmaj) in front of it ;\n",
"- `lowercase`:: lowercase all text and adds a special token at the beginning (xxbos) and/or the end (xxeos)."
"In addition to the *word tokenization* approach seen in the last section, another popular tokenization method is *subword tokenization*. Word tokenization relies on an assumption that spaces provide a useful separation of components of meaning in a sentence. However, this assumption is not always appropriate. For instance, consider this sentence: 我的名字是郝杰瑞 (which means \"My name is Jeremy Howard\" in Chinese). That's not going to work very well with a word tokenizer, because there are no spaces in it! Languages like Chinese and Japanese don't use spaces, and in fact they don't even have a well-defined concept of a \"word\". There are also languages, like Turkish and Hungarian, which can add many bits together without spaces, to create very long words which include a lot of separate pieces of information.\n",
"We instantiate our tokenizer, passing in the size of the vocab we want to create, and then we need to \"train\" it. That is, we need to have it read our documents, and find the common sequences of characters, to create the vocab. This is done with `setup`. As we'll see shortly, `setup` is a special fastai method that is called automatically in our usual data processing pipelines. Since we're doing everything manually at the moment, however, we have to call it ourselves. Here's a function that does these steps for a given vocab size, and shows an example output:"
"'▁This ▁movie , ▁which ▁I ▁just ▁dis c over ed ▁at ▁the ▁video ▁st or e , ▁has ▁a p par ent ly ▁s it ▁around ▁for ▁a ▁couple ▁of ▁years ▁without ▁a ▁dis t ri but or . ▁It'"
"On the other hand, if we use a larger vocab, then most common English words will end up in the vocab themselves, and we will not need as many to represent a sentence:"
"Picking a subword vocab size represents a compromise: a larger vocab means more fewer tokens per sentence, which means faster training, less memory, and less state for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn.\n",
"\n",
"Overall, subword tokenization provides a way to easily scale between character tokenization (i.e. use a small subword vocab) and word tokenization (i.e. use a large subword vocab), and handles every human language without needing language-specific algorithms to be developed. It can even handle other \"languages\" such as genomic sequences or MIDI music notation! For this reason, in the last year its popularity has soared, and it seems likely to become the most common tokenization approach (it may well already be, by the time you read this!)"
"Numericalization is the process of mapping tokens to integers. It's basically identical to the steps necessary to create a `Category` variable, such as the dependent variable of digits in MNIST:\n",
"\n",
"1. Make a list of all possible levels of that categorical variable (the *vocab*)\n",
"1. Replace each level with its index in the vocab\n",
"\n",
"We'll take a look at this in action on the word-tokenized text we saw earlier:"
"Just like `SubwordTokenizer`, we need to call `setup` on `Numericalize`; this is how we create the `vocab`. That means we'll need our tokenized corpus first. Since tokenization takes a while, it's done in parallel by fastai; but for this manual walk-thru, we'll use a small subset:"
"Our special rules tokens appear first, and then every word appears once, in frequency order. The defaults to `Numericalize` are `min_freq=3,max_vocab=60000`. `max_vocab=60000` results in fastai replacing all words other than the most common 60000 with a special *unknown word* token `xxunk`. This is useful to avoid having an overly large embedding matrix, since that can slow down training, use up too much memory, and can also mean that there isn't enough data to train useful representations for rare words. However, this last issue is better handled by setting `min_freq`; the default `min_freq=3` means that any word appearing less than three times is replaced with `xxunk`.\n",
"\n",
"Fastai can also numericalize your dataset using a vocab that you provide, by passing a list of words as the `vocab` parameter.\n",
"\n",
"Once we've created our `Numericalize` object, we can use it as if it's a function:"
"### Putting our texts into batches for a language model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When dealing with images, we needed to resize them all to the same height and width before grouping them together in a mini-batch so they could stack together efficiently in a single tensor. Here it's going to be a little different, because one cannot simply resize text to a desired length. Also, we want our language model to read text in order, so that it can efficiently predict what the next word is. All the difficulty of a language model loader is that each new batch should begin precisely where the previous left off.\n",
"\n",
"Let's start with an example and imagine our text is the following:\n",
"\n",
"> : In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\\nThen we will study how we build a language model and train it for a while.\n",
"\n",
"The tokenization process will add special tokens and deal with punctuation to return this text:\n",
"\n",
"> : xxbos xxmaj in this chapter , we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface . xxmaj first we will look at the processing steps necessary to convert text into numbers and how to customize it . xxmaj by doing this , we 'll have another example of the preprocessor used in the data block xxup api . \\n xxmaj then we will study how we build a language model and train it for a while .\n",
"\n",
"We have separated the 90 tokens by spaces. Let's say we want a batch size of 6, then we need to break this text in 6 contiguous parts of length 15:"
"stream = \"In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\\nThen we will study how we build a language model and train it for a while.\"\n",
"tokens = tfm(stream)\n",
"bs,seq_len = 6,15\n",
"d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])\n",
"In a perfect world, we could then give this one batch to our model. But that doesn't work, because this would very likely not fit in our GPU memory (here we have 90 tokens, but all the IMDb reviews together give several millions of tokens).\n",
"\n",
"So in fact we will need to divide this array more finely into subarrays of a fixed sequence length. It is important to maintain order within and across these subarrays, because we will use a model that maintains state in order so that it remembers what it read previously when predicting what comes next. \n",
"\n",
"Going back to our previous example with 6 batches of length 15, if we chose sequence length of 5, that would mean we first feed the following array:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <tbody>\n",
" <tr>\n",
" <td>xxbos</td>\n",
" <td>xxmaj</td>\n",
" <td>in</td>\n",
" <td>this</td>\n",
" <td>chapter</td>\n",
" </tr>\n",
" <tr>\n",
" <td>movie</td>\n",
" <td>reviews</td>\n",
" <td>we</td>\n",
" <td>studied</td>\n",
" <td>in</td>\n",
" </tr>\n",
" <tr>\n",
" <td>first</td>\n",
" <td>we</td>\n",
" <td>will</td>\n",
" <td>look</td>\n",
" <td>at</td>\n",
" </tr>\n",
" <tr>\n",
" <td>how</td>\n",
" <td>to</td>\n",
" <td>customize</td>\n",
" <td>it</td>\n",
" <td>.</td>\n",
" </tr>\n",
" <tr>\n",
" <td>of</td>\n",
" <td>the</td>\n",
" <td>preprocessor</td>\n",
" <td>used</td>\n",
" <td>in</td>\n",
" </tr>\n",
" <tr>\n",
" <td>will</td>\n",
" <td>study</td>\n",
" <td>how</td>\n",
" <td>we</td>\n",
" <td>build</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#hide_input\n",
"bs,seq_len = 6,5\n",
"d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])\n",
"Going back to our dataset, the first step is to transform the individual texts into a stream by concatenating them together. As with images, it's best to randomize the order in which the inputs come, so at the beginning of each epoch we will shuffle the entries to make a new stream (we shuffle the order of the documents, not the order of the words inside, otherwise the text would not make sense anymore).\n",
"We will then cut this stream into a certain number of batches (which is our *batch size*). For instance, if the stream has 50,000 tokens and we set a batch size of 10, this will give us 10 mini-streams of 5,000 tokens. What is important is that we preserve the order of the tokens (so from 1 to 5,000 for the first mini-stream, then from 5,001 to 10,000...) because we want the model to read continuous rows of text (as in our example above). This is why each text has been added a `xxbos` token during preprocessing, so that the model knows when it reads the stream we are beginning a new entry.\n",
"So to recap, at every epoch we shuffle our collection of documents to pick one document, and then we transform that one into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length you picked.\n",
"This is all done behind the scenes by the fastai library when we create a `LMDataLoader`. We can create one by first applying our `Numericalize` object to the tokenized texts:"
"As we have seen at the beginning of this chapter to train a state-of-the-art text classifier using transfer learning will take two steps: first we need to fine-tune our langauge model pretrained on Wikipedia to the corpus of IMDb reviews, then we can use that model to train a classifier.\n",
"fastai handles tokenization and numericalization automatically when `TextBlock` is passed to `DataBlock`. All of the arguments that can be passed to `Tokenize` and `Numericalize` can also be passed to `TextBlock`. In the next chapter we'll discuss the easiest ways to run each of these steps separately, to ease debugging--but you can always just debug by running them manually on a subset of your data as shown in the previous sections. And don't forget about `DataBlock`'s handy `summary` method, which is very useful for debugging data issues.\n",
"\n",
"Here's how we use `TextBlock` to create a language model, using fastai's defaults:"
"One thing that's different to previous types used in `DataBlock` is that we're not just using the class directly (i.e. `TextBlock(...)`, but instead are calling a *class method*. A class method is a Python method which, as the name suggests, belongs to a *class* rather than an *object*. (Be sure to search online for more information about class methods if you're not familiar with them, since they're commonly used in many Python libraries and applications; we've used them a few times previously in the book, but haven't called attention to them.) The reason that `TextBlock` is special is that setting up the numericalizer's vocab can take a long time (we have to read every document and tokenize it to get the vocab); to be as efficient as possible fastai does things such as: \n",
"\n",
"- Save the tokenized documents in a temporary folder, so fastai doesn't have to tokenize more than once\n",
"- Runs multiple tokenization processes in parallel, to take advantage of your computer's CPUs.\n",
"\n",
"Therefore we need to tell `TextBlock` how to access the texts, so that it can do this initial preprocessing--that's what `from_folder` does.\n",
" <td>xxbos xxmaj it 's awesome ! xxmaj in xxmaj story xxmaj mode , your going from punk to pro . xxmaj you have to complete goals that involve skating , driving , and walking . xxmaj you create your own skater and give it a name , and you can make it look stupid or realistic . xxmaj you are with your friend xxmaj eric throughout the game until he betrays you and gets you kicked off of the skateboard</td>\n",
" <td>xxmaj it 's awesome ! xxmaj in xxmaj story xxmaj mode , your going from punk to pro . xxmaj you have to complete goals that involve skating , driving , and walking . xxmaj you create your own skater and give it a name , and you can make it look stupid or realistic . xxmaj you are with your friend xxmaj eric throughout the game until he betrays you and gets you kicked off of the skateboard xxunk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>what xxmaj i 've read , xxmaj death xxmaj bed is based on an actual dream , xxmaj george xxmaj barry , the director , successfully transferred dream to film , only a genius could accomplish such a task . \\n\\n xxmaj old mansions make for good quality horror , as do portraits , not sure what to make of the killer bed with its killer yellow liquid , quite a bizarre dream , indeed . xxmaj also , this</td>\n",
" <td>xxmaj i 've read , xxmaj death xxmaj bed is based on an actual dream , xxmaj george xxmaj barry , the director , successfully transferred dream to film , only a genius could accomplish such a task . \\n\\n xxmaj old mansions make for good quality horror , as do portraits , not sure what to make of the killer bed with its killer yellow liquid , quite a bizarre dream , indeed . xxmaj also , this is</td>\n",
"For converting the integer word indices into activations that we can use for our neural network, we will use embeddings, just like we did for collaborative filtering and tabular modelling. Then those embeddings are fed in a *Recurrent Neural Network* (RNN), using an architecture called *AWD_LSTM* (we will show how to write such a model from scratch in <<chapter_nlp_dive>>). As we discussed earlier, the embeddings in the pretrained model are merged with random embeddings added for words that weren't in the pretraining vocabulary. This is handled automatically inside `language_model_learner`:"
"The loss function used by default is cross entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab). A metric often used in NLP for language models is called *perplexity*. It is the exponential of the loss (i.e. `torch.exp(cross_entropy)`). We will also add accuracy, to see how many times our model is right when trying to predict the next word, since cross entropy (as we've seen) is both hard to interpret, and also tells you more about the model's confidence, rather than just its accuracy\n",
"\n",
"The grey first arrow in our overall picture has been done for us and made available as a pretrained model in fastai; we've now built the `DataLoaders` and `Learner` for the second stage, and we're ready to fine-tune it!"
"It takes quite a while to train each epoch, so we'll be saving the intermediate model results during the training process. Since `fine_tune` doesn't do that for us, we'll just use `fit_one_cycle`. Just like `cnn_learner`, `language_model_learner` automatically calls `freeze` when using a pretrained model (which is the default), so this will only train the embeddings (which is the only part of the model that contains randomly initialized weights--i.e. embeddings for words that are in our IMDb vocab, but aren't in the pretrained model vocab):"
"It will create a file in `learn.path/models/` named \"1epoch.pth\". If you want to load your model in another machine after creating your `Learner` the same way, or resume training later, you can load the content of this file with:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"learn = learn.load('1epoch')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can them finetune the model after unfreezing:"
"Once this is done, we save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not including the final layer is called the *encoder*. We can save it with `save_encoder`:"
"> jargon: Encoder: The model not including the task-specific final layer(s). It means much the same thing as *body* when applied to vision CNNs, but tends to be more used for NLP and generative models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This completes the second stage of the text classification process: fine-tuning the language model. We can now fine tune this language model using the IMDb sentiment labels."
"Before using this to fine-tune a classifier on the review, we can use our model to generate random reviews: since it's trained to guess what the next word of the sentence is, we can use it to write new reviews:"
"i liked this movie because of its story and characters . The story line was very strong , very good for a sci - fi film . The main character , Alucard , was very well developed and brought the whole story\n",
"i liked this movie because i like the idea of the premise of the movie , the ( very ) convenient virus ( which , when you have to kill a few people , the \" evil \" machine has to be used to protect\n"
"As you can see, we add some randomness (we pick a random word based on the probabilities returned by the model) so you don't get exactly the same review twice. Our model doesn't have any programmed knowledge of the structure of a sentence or grammar rules, yet it has clearly learned a lot about English sentences: we can see it capitalized properly (I is just transformed to i with our rules -- they require two characters or more to consider a word is capitalized -- so it's normal to see it lowercased), and is using consistent tense. The general review make sense at first glance, and it's only if you read carefully you can notice something is a bit off. Not bad for a model trained in a couple of hours! \n",
"\n",
"Our end goal wasn't to train a model to generate reviews, but to classify them... so let's use this model to do just that."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating the classifier DataLoaders"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're now moving from language model fine tuning, to classifier fine tuning. To re-cap, a language model predicts the next word of a document, so it doesn't need any external labels. A classifier, however, predicts some external label--in the case of IMDb, it's the sentiment of a document.\n",
"\n",
"This means that the structure of our `DataBlock` for NLP classification will look very familiar; it's actually nearly the same as we've seen for the many image classification datasets we've worked with:"
"Just like with image classification, `show_batch` shows the dependent variable (sentiment, in this case) with each independent variable (movie review text):"
" <td>xxbos i rate this movie with 3 skulls , only coz the girls knew how to scream , this could 've been a better movie , if actors were better , the twins were xxup ok , i believed they were evil , but the eldest and youngest brother , they sucked really bad , it seemed like they were reading the scripts instead of acting them … . spoiler : if they 're vampire 's why do they freeze the blood ? vampires ca n't drink frozen blood , the sister in the movie says let 's drink her while she is alive … .but then when they 're moving to another house , they take on a cooler they 're frozen blood . end of spoiler \\n\\n it was a huge waste of time , and that made me mad coz i read all the reviews of how</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>xxbos i have read all of the xxmaj love xxmaj come xxmaj softly books . xxmaj knowing full well that movies can not use all aspects of the book , but generally they at least have the main point of the book . i was highly disappointed in this movie . xxmaj the only thing that they have in this movie that is in the book is that xxmaj missy 's father comes to xxunk in the book both parents come ) . xxmaj that is all . xxmaj the story line was so twisted and far fetch and yes , sad , from the book , that i just could n't enjoy it . xxmaj even if i did n't read the book it was too sad . i do know that xxmaj pioneer life was rough , but the whole movie was a downer . xxmaj the rating</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>xxbos xxmaj this , for lack of a better term , movie is lousy . xxmaj where do i start … … \\n\\n xxmaj cinemaphotography - xxmaj this was , perhaps , the worst xxmaj i 've seen this year . xxmaj it looked like the camera was being tossed from camera man to camera man . xxmaj maybe they only had one camera . xxmaj it gives you the sensation of being a volleyball . \\n\\n xxmaj there are a bunch of scenes , haphazardly , thrown in with no continuity at all . xxmaj when they did the ' split screen ' , it was absurd . xxmaj everything was squished flat , it looked ridiculous . \\n\\n xxmaj the color tones were way off . xxmaj these people need to learn how to balance a camera . xxmaj this ' movie ' is poorly made , and</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"dls_clas.show_batch(max_n=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the `DataBlock` definition above, every piece is familiar from previous data blocks we've built, with two important exceptions:\n",
"\n",
"- `TextBlock.from_folder` no longer has the `is_lm=True` parameter, and\n",
"- We pass the `vocab` we created for the language model fine-tuning.\n",
"\n",
"The reason that we pass the vocab of the language model is to make sure we use the same correspondence of token to index. Otherwise the embeddings we learned in our fine-tuned language model won't make any sense to this model, and the fine-tuning step won't be of any use.\n",
"\n",
"By passing `is_lm=False` (or not passing `is_lm` at all, since it defaults to `False`) we tell `TextBlock` that we have regular labeled data, rather than using the next tokens as labels. There is one challenge we have to deal with, however, which is to do with collating multiple documents into a minibatch. Let's see with an example, by trying to create a minibatch containing the first 10 documents. First we'll numericalize them:"
"Remember, PyTorch `DataLoader`s need to collate all the items in a batch into a single tensor, and that a single tensor has a fixed shape (i.e. it has some particular length on every axis, and all items must be consistent). This should look a bit familiar: we had the same issue with images. In that case, we use cropping, padding, and/or squishing to make everything the same size. Cropping might not be a good idea for documents, because it seems likely we'd remove some key information (having said that, the same issue is true for images, and we use cropping there; data augmentation hasn't been well explored for NLP yet, so perhaps there are actually opportunities to use cropping in NLP too!) You can't really \"squish\" a document. So that leaves padding!\n",
"\n",
"We will expand the shortest texts to make them all the same size. To do this, we use a special token that will be ignored by our model. This is called *padding* (just like in vision). Additionally, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths (with some shuffling for the training set). We do this by (approximately, for the training set) sorting the documents by length prior to each epoch. The result of this is that the documents collated into a single batch will tend of be of similar lengths. We won't make every batch, therefore, the same size, but will instead use the size of the largest document in each batch. (It is possible to do something similar with images, which is especially useful for irregularly sized rectangular images, although as we write these words, no library provides good support for this yet, and there aren't any papers covering it. It's something we're planning to add to fastai soon however, so have a look on the book website, where we'll add information about this if and when it's working well.)\n",
"\n",
"The padding and sorting is automatically done by the data block API for us when using a `TextBlock`, with `is_lm=False`. (We don't have this same issue for language model data, since we concatenate all the documents together first, and then split them into equally sized sections.)\n",
"\n",
"We can now create a model to classify our texts:"
"The final step prior to training the classifier is to load the encoder from our fine-tuned language model. We use `load_encoder` instead of `load` because we only have pretrained weights available for the encoder; `load` by default raises an exception if an incomplete model is loaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"learn = learn.load_encoder('finetuned')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fine tuning the classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last step is to train with discriminative learning rates and *gradual unfreezing*. In computer vision, we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.347427</td>\n",
" <td>0.184480</td>\n",
" <td>0.929320</td>\n",
" <td>00:33</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.fit_one_cycle(1, 2e-2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In just one epoch we get the same result as our training in <<chapter_intro>>, not too bad! We can pass `-2` to `freeze_to` to freeze all except the last two parameter groups:"
"We reach 94.3% accuracy, which was state-of-the-art just three years ago. By training a model on all the texts read backwards and averaging the predictions of those two models, we can even get to 95.1% accuracy, which was the state of the art introduced by the ULMFiT paper. It was only beaten a few months ago, fine-tuning a much bigger model and using expensive data augmentation (translating sentences in another language and back, using another model for translation).\n",
"\n",
"Using a pretrained model let us build a fine-tuned language model that was pretty powerful, to either generate fake reviews or help classify them. It is good to remember that this technology can also be used for malign purposes."
"Even simple algorithms based on rules, before the days of widely available deep learning language models, could be used to create fraudulent accounts and try to influence policymakers. Jeff Kao, now a computational journalist at ProPublica, analysed the comments that were sent to the FCC in the USA regarding a 2017 proposal to repeal net neutrality. In his article [More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6)\", he discovered a large cluster of comments opposing net neutrality that seemed to have been generated by some sort of Madlibs-style mail merge. In <<disinformation>>, the fake comments have been helpfully color-coded by Kao to highlight their formulaic nature."
"Kao estimated that \"less than 800,000 of the 22M+ comments… could be considered truly unique\" and that \"more than 99% of the truly unique comments were in favor of keeping net neutrality.\"\n",
"Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the tools at your disposal necessary to create and compelling language model. That is, something that can generate context appropriate believable text. It won't necessarily be perfectly accurate or correct, but it will be believable. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about. Take a look at this conversation on Reddit shown in <<ethics_reddit>>, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending:"
"<img src=\"images/ethics/image14.png\" id=\"ethics_reddit\" caption=\"An algorithm talking to itself on Reddit\" alt=\"An algorithm talking to itself on Reddit\" width=\"600\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, the use of the algorithm is being done explicitly. But imagine what would happen if a bad actor decided to release such an algorithm across social networks. They could do it slowly and carefully, allowing the algorithms to gradually develop followings and trust over time. It would not take many resources to have literally millions of accounts doing this. In such a situation we could easily imagine it getting to a point where the vast majority of discourse online was from bots, and nobody would have any idea that it was happening.\n",
"We are already starting to see examples of machine learning being used to generate identities. For example, <<katie_jones>> shows us a LinkedIn profile for Katie Jones."
"Katie Jones was connected on LinkedIn to several members of mainstream Washington think tanks. But she didn't exist. That image you see is auto generated by a generative adversarial network, and somebody named Katie Jones has not, in fact, graduated from the Centre for Strategic and International Studies.\n",
"\n",
"Many people assume or hope that algorithms will come to our defence here. The hope is that we will develop classification algorithms which can automatically recognise auto generated content. The problem, however, is that this will always be an arms race, in which better classification (or discriminator) algorithms can be used to create better generation algorithms."
"We have seen what `Tokenizer` or a `Numericalize` do to a collection of texts, and how they're used inside the data block API, which handles those transforms for us directly using the `TextBlock`. But what if we want to only apply one of those transforms, either to see intermediate results or because we have already tokenized texts. More generally, what can we do when the data block API is not flexible enough to accommodate our particular use case? For this, we need to use fastai's *mid-level API* for processing data. The data block API is built on top of that layer, so it will allow you to do everything the data block API does, and much much more."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Going deeper into fastai's layered API"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The fastai library is built on a *layered API*. At the very top layer, there are *applications* that allow us to train a model in five lines of codes, as we saw in <<chapter_intro>>. In the case of creating `DataLoaders` for a text classifier, for instance, we used the line:"
"The factory method `TextDataLoaders.from_folder` is very convenient when your data is arranged the exact same way as the IMDb dataset, but in practice, that often won't be the case. The data block API offers more flexibility. As we saw in the last chapter, we can ge the same result with:"
"But it's sometimes not flexible enough. For debugging purposes for instance, we might need to apply just parts of the transforms that come with this data block. Or, we might want to create `DataLoaders` for some application that isn't directly supported by fastai. In this section, we'll dig into the pieces that are used inside fastai to implement the data block API. By understanding these pieces, you'll be able to leverage the power and flexibility of this mid-tier API."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> note: The mid-level API in general does not only contain functionality for creating `DataLoaders`. It also has the *callback* system that we will study in <<chapter_callbacks>>, which allows us to customize the training loop any way we like, and the *general optimizer* that we will cover in <<chapter_accel_sgd>>."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transforms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we studied tokenization and numericalization in the last chapter, we started by grabbing a bunch of texts:"
"...and `Tokenizer.decode` turns this back into a single string (it may not, however, be exactly the same as the original string; this depends on whether the tokenizer is *reversible*, which the default word tokenizer is not at the time we're writing this book):"
"`decode` is used by fastai's `show_batch` and `show_results`, as well as some other inference methods, to convert predictions and mini-batches into a human-understandable representation.\n",
"\n",
"For each of `tok` or `num` above, we created an object, called the setup method (which trains the tokenizer if needed for `tok` and creates the vocab for `num`), applied it to our raw texts (by calling the object as a function), and then finally decoded it back to an understandable representation. These steps are needed for most data preprocessing tasks, so fastai provides a class that encapsulates them. This is the `Transform` class. Both `Tokenize` and `Numericalize` are `Transform`s.\n",
"\n",
"In general, a `Transform` is an object that behaves like a function, has an optional *setup* that will initialize some inner state (like the vocab inside `num` for instance), and has an optional *decode* that will reverse the function (this reversal may not be perfect, as we saw above for `tok`).\n",
"\n",
"A good example of `decode` is found in the `Normalize` transform that we saw in <<chapter_sizing_and_tta>>: to be able to plot the images its `decode` method undoes the normalization (i.e. it multiplies by the std and adds back the mean). On the other hand, data augmentation transforms do not have a `decode` method, since we want to show the effects on images, to make sure the data augmentation is working as we want."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The second special behavior of `Transform`s is that they always get applied over tuples: in general, our data is always a tuple `(input,target)` (sometimes with more than one input or more than one target). When applying a transform on an item like this, such as `Resize`, we don't want to resize the tuple, but resize the input (if applicable) and the target (if applicable). It's the same for the batch transforms that do data augmentation: when the input is an image and the target is a segmentation mask, the transform needs to be applied (the same way) to the input and the target.\n",
"\n",
"We can see this behavior if we pass a tuple of texts to `tok`:"
"`tfm` will automatically convert `f` to a `Transform` with no setup and no decode method. If you need either of those, you will need to subclass `Transform`. When writing this subclass, you need to implement the actual function in `encodes`, then (optionally), the setup behavior in `setups` and the decoding behavior in `decodes`:"
"Here `NormalizeMean` will initialize some state during the setup (the mean of all elements passed), then the transformation is to subtract that mean. For decoding purposes, we implement the reverse of that transformation by adding the mean. Here is an example of `NormalizeMean` in action:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3.0, 5.0, 2.0)"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tfm = NormalizeMean()\n",
"tfm.setup([1,2,3,4,5])\n",
"start = 2\n",
"y = tfm(start)\n",
"z = tfm.decode(y)\n",
"tfm.mean,y,z"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To learn more about `Transform`s and how you can use them to have different behavior depending on the type of the input, be sure to check our tutorial in the docs online."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To compose several transforms together, fastai provides `Pipeline`. We define a `Pipeline` by passing it a list of `Transform`s; it will then compose the transforms inside it. When you call a `Pipeline` on an object, it will automatically call the transforms inside, in order:"
"And you can call decode on the result of your encoding, to get back something you can display and analyze:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'xxbos xxmaj well , \" cube \" ( 1997 ) , xxmaj vincenzo \\'s first movie , was one of the most interesti'"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tfms.decode(t)[:100]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The only part that doesn't work the same way as in `Transform` is the setup. To properly setup a `Pipeline` of `Transform`s on some data, you need to use a `TfmdLists`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TfmdLists and Datasets: Transformed collections"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your data is usually a set of raw items (like filenames, or rows in a dataframe) to which you want to apply a succession of transformations. We just saw that the succession of transformations was represented by a `Pipeline` in fastai. The class that groups together this pipeline with your raw items is called `TfmdLists`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### TfmdLists"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is the short way of doing the transformation we saw in the previous section:"
"At initialization, the `TfmdLists` will automatically call the setup method of each transform in order, providing them not with the raw items but the items transformed by all the previous `Transform`s in order. We can get the result of our pipeline on any raw element just by indexing into the `TfmdLists`:"
"And the `TfmdLists` knows how to decode for showing purposing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'xxbos xxmaj well , \" cube \" ( 1997 ) , xxmaj vincenzo \\'s first movie , was one of the most interesti'"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tls.decode(t)[:100]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In fact, it even has a `show` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"xxbos xxmaj well , \" cube \" ( 1997 ) , xxmaj vincenzo 's first movie , was one of the most interesting and tricky ideas that xxmaj i 've ever seen when talking about movies . xxmaj they had just one scenery , a bunch of actors and a plot . xxmaj so , what made it so special were all the effective direction , great dialogs and a bizarre condition that characters had to deal like rats in a labyrinth . xxmaj his second movie , \" cypher \" ( 2002 ) , was all about its story , but it was n't so good as \" cube \" but here are the characters being tested like rats again . \n",
"\n",
" \" nothing \" is something very interesting and gets xxmaj vincenzo coming back to his ' cube days ' , locking the characters once again in a very different space with no time once more playing with the characters like playing with rats in an experience room . xxmaj but instead of a thriller sci - fi ( even some of the promotional teasers and trailers erroneous seemed like that ) , \" nothing \" is a loose and light comedy that for sure can be called a modern satire about our society and also about the intolerant world we 're living . xxmaj once again xxmaj xxunk amaze us with a great idea into a so small kind of thing . 2 actors and a blinding white scenario , that 's all you got most part of time and you do n't need more than that . xxmaj while \" cube \" is a claustrophobic experience and \" cypher \" confusing , \" nothing \" is completely the opposite but at the same time also desperate . \n",
"\n",
" xxmaj this movie proves once again that a smart idea means much more than just a millionaire budget . xxmaj of course that the movie fails sometimes , but its prime idea means a lot and offsets any flaws . xxmaj there 's nothing more to be said about this movie because everything is a brilliant surprise and a totally different experience that i had in movies since \" cube \" .\n"
]
}
],
"source": [
"tls.show(t)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `TfmdLists` is named with an \"s\" because it can handle a training and validation set with a splits argument. You just need to pass the indices of which elemets are in the training set, and which are in the validation set:"
"If you have manually written a `Transform` that returns your whole data (input and target) from the raw items you had, then `TfmdLists` is the class you need. You can directly convert it to a `DataLoaders` object with the `dataloaders` method. This is what we will do in our Siamese example further in this chapter.\n",
"\n",
"In general though, you have two (or more) parallel pipelines of transforms: one for processing your raw items into inputs and one to process your raw items into targets. For instance, here, the pipeline we defined only processes the input. If we want to do text classification, we have to process the labels as well. \n",
"\n",
"Here we need to do two things: first take the label name from the parent folder. There is a function `parent_label` for this:"
"Then we need a `Transform` that will grab the unique items and build a vocab with it during setup, then will transform the string labels into integers when called. fastai provides this transform, it's called `Categorize`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((#2) ['neg','pos'], TensorCategory(1))"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat = Categorize()\n",
"cat.setup(lbls)\n",
"cat.vocab, cat(lbls[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To do the whole setup automatically on our list of files, we can create a `TfmdLists` as before:"
"But then we end up with two separate objects for our inputs and targets, which is not what we want. This is where `Datasets` comes to the rescue."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`Datasets` will apply two (or more) pipelines in parallel to the same raw object and build a tuple with the result. Like `TfmdLists`, it will automatically do the setup for us, and when we index into a `Datasets`, it will return us a tuple with the results of each pipeline:"
"It can also decode any processed tuple or show it directly:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('xxbos xxmaj this movie had horrible lighting and terrible camera movements . xxmaj this movie is a jumpy horror flick with no meaning at all . xxmaj the slashes are totally fake looking . xxmaj it looks like some 17 year - old idiot wrote this movie and a 10 year old kid shot it . xxmaj with the worst acting you can ever find . xxmaj people are tired of knives . xxmaj at least move on to guns or fire . xxmaj it has almost exact lines from \" when a xxmaj stranger xxmaj calls \" . xxmaj with gruesome killings , only crazy people would enjoy this movie . xxmaj it is obvious the writer does n\\'t have kids or even care for them . i mean at show some mercy . xxmaj just to sum it up , this movie is a \" b \" movie and it sucked . xxmaj just for your own sake , do n\\'t even think about wasting your time watching this crappy movie .',\n",
" 'neg')"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"t = dsets.valid[0]\n",
"dsets.decode(t)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last step is to convert your `Datasets` object to a `DataLoaders`, which can be done with the `dataloaders` method. Here we need to pass along special arguments to take care of the padding problem (as we saw in the last chapter). This needs to happen just before we batch the elements, so we pass it to `before_batch`: "
"`dataloaders` directly calls `DataLoader` on each subset of our `Datasets`. fastai's `DataLoader` expands the PyTorch class of the same name and is responsible for collating the items from our datasets into batches. It has a lot of points of customization but the most important you should know are:\n",
"\n",
"- `after_item`: applied on each item after grabbing it inside the dataset. This is the equivalent of the `item_tfms` in `DataBlock`.\n",
"- `before_batch`: applied on the list of items before they are collated. This is the ideal place to pad items to the same size.\n",
"- `after_batch`: applied on the batch as a whole after its construction. This is the equivalent of the `batch_tfms` in `DataBlock`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a conclusion, here is the full code necessary to prepare the data for text classification:"
"The two differences with what we had above is the use of `GrandParentSplitter` to split our training and validation data, and the `dl_type` argument. This is to tell `dataloaders` to use the `SortedDL` class of `DataLoader`, and not the usual one. This is the class that will handle the construction of batches by putting samples of roughly the same lengths into batches.\n",
"\n",
"This does the exact same thing as our `DataBlock` from above:"
"...except that now, you know how to customize every single piece of it!\n",
"\n",
"Let's practice what we just learned on this mid-level API for data preprocessing on a computer vision example now, with a Siamese Model input pipeline."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Applying the mid-tier data API: SiamesePair"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A Siamese model takes two images and has to determine if they are of the same classe or not. For this example, we will use the pets dataset again, and prepare the data for a model that will have to predict if two images of pets are of the same breed or not. TK see if we train that model later in the book. "
"1. Why is a language model considered self-supervised learning?\n",
"1. What are self-supervised models usually used for?\n",
"1. What do we fine-tune language models?\n",
"1. What are the three steps to create a state-of-the-art text classifier?\n",
"1. How do the 50,000 unlabeled movie reviews help create a better text classifier for the IMDb dataset?\n",
"1. What are the three steps to prepare your data for a language model?\n",
"1. What is tokenization? Why do we need it?\n",
"1. Name three different approaches to tokenization.\n",
"1. What is 'xxbos'?\n",
"1. List 4 rules that fastai applies to text during tokenization.\n",
"1. Why are repeated characters replaced with a token showing the number of repetitions, and the character that's repeated?\n",
"1. What is numericalization?\n",
"1. Why might there be words that are replaced with the \"unknown word\" token?\n",
"1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer against the book website.)\n",
"1. Why do we need padding for text classification? Why don't we need it for language modeling?\n",
"1. What does an embedding matrix for NLP contain? What is its shape?\n",
"1. What is perplexity?\n",
"1. Why do we have to pass the vocabulary of the language model to the classifier data block?\n",
"1. What is gradual unfreezing?\n",
"1. Why is text generation always likely to be ahead of automatic identification of machine generated texts?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further research"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. See what you can learn about language models and disinformation. What are the best language models today? Have a look at some of their outputs. Do you find them convincing? How could a bad actor best use this to create conflict and uncertainty?\n",
"1. Given the limitation that models are unlikely to be able to consistently recognise machine generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leveraged deep learning?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Becoming a deep learning practitioner"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations — you've completed all of the chapters in this book which cover the key practical parts of training and using deep learning! You know how to use all of fastai's built in applications, and how to customise them using the data blocks API and loss functions. You even know how to create a neural network from scratch, and train it! (And hopefully you now know some of the questions to ask to help make sure your creations help improve society too.)\n",
"\n",
"The knowledge you already have is enough to create full working prototypes of many types of neural network application. More importantly, it will help you understand the capabilities and limitations of deep learning models, and how to design a system which best handles these capabilities and limitations.\n",
"\n",
"In the rest of this book we will be pulling apart these applications, piece by piece, to understand all of the foundations they are built on. This is important knowledge for a deep learning practitioner, because it is the knowledge which allows you to inspect and debug models that you build, and to create new applications which are customised for your particular projects."