fastbook/10_nlp.ipynb
2020-02-28 11:44:06 -08:00

2361 lines
90 KiB
Plaintext

{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"from utils import *\n",
"from IPython.display import display,HTML"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"[[chapter_nlp]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NLP deep dive: RNNs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In <<chapter_intro>> we saw that deep learning can be used to get great results with natural language datasets. Our example relied on using a pretrained language model and fine-tuning it to classify those reviews. One thing is a bit different from the transfer learning we have in computer vision: the pretrained model was not trained on the same task as the model we used to classify reviews.\n",
"\n",
"What we call a language model is a model that has been trained to guess what the next word in a text is (having read the ones before). This kind of task is called *self-supervised learning*: we do not need to give labels to our model, just feed it lots and lots of texts. It has a process to automatically get labels from the data, and this task isn't trivial: to properly guess the next word in a sentence, the model will have to get an understanding of the English-- or other--language. Self-supervised learning can also be used in other domains; for instance, see [Self-supervised learning and computer vision](https://www.fast.ai/2020/01/13/self_supervised/) for an introduction to vision applications. Self-supervised learning is not usually used for the model that is trained directly, but instead is used for pre-training a model used for transfer learning."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: Self-supervised learning: Training a model using labels that are embedded in the independent variable, rather than requiring external labels. For instance, training a model to predict the next word in a text."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The language model we used in <<chapter_intro>> to classify IMDb reviews was pretrained on Wikipedia. We got great results by directly fine-tuning this language model to a movie review classifier, but with one extra step, we can do even better: the Wikipedia English is slightly different from the IMDb English. So instead of jumping directly to the classifier, we could finetune our pretrained language model to the IMDb corpus and *then* use that as the base for our classifier.\n",
"\n",
"Even if our language model knows the basics of the language we are using in the task (e.g., our pretrained model is in English), it helps to get used to the style of the corpus we are targetting. It may be more informal language, or more technical, with new words to learn or different ways of composing sentences. In the case of IMDb, there will be lots of names of movie directors and actors, and often a less formal style of language that seen in Wikipedia.\n",
"\n",
"We saw that with fastai, we can download a pre-trained language model for English, and use it to get state-of-the-art results for NLP classification. (We expect pre-trained models in many more languages to be available soon — they might well be available by the time you are reading this book, in fact.) So, why are we learning how to train a language model in detail?\n",
"\n",
"One reason, of course, is that it is helpful to understand the foundations of the models that you are using. But there is another very practical reason, which is that you get even better results if you fine tune the (sequence-based) language model prior to fine tuning the classification model. For instance, for the IMDb sentiment analysis task, the dataset includes 50,000 additional movie reviews that do not have any positive or negative labels attached. So that is 100,000 movie reviews altogether (since there are also 25,000 labelled reviews in the training set, and 25,000 in the validation set). We can use all 100,000 of these reviews to fine tune the pretrained language model — this will result in a language model that is particularly good at predicting the next word of a movie review. In contrast, the pretrained model was trained only on Wikipedia articles.\n",
"\n",
"The [ULMFiT paper](https://arxiv.org/abs/1801.06146) showed that this extra stage of language model fine tuning, prior to transfer learning to a classification task, resulted in significantly better predictions. Using this approach, we have three stages for transfer learning in NLP, as summarised in this figure:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Diagram of the ULMFiT process\" width=\"700\" caption=\"The ULMFiT process\" id=\"ulmfit_process\" src=\"images/att_00027.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Language model fine tuning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A *language model* is a model that learns to predict the next word of a sentence. Have a think about how you would turn this language modelling problem into a neural network, given what you have learned so far. We'll be able to use concepts that we've seen in the last two chapters.\n",
"\n",
"It's not at all obvious how we're going to use what we've learned so far to build a language model. Sentences can be different lengths, and documents can be very long. So, how can we predict the next word of a sentence using a neural network? Let's find out!\n",
"\n",
"We've already seen how categorical variables can be used as independent variables for a neural network. The approach we took for a single categorical variable was to:\n",
"\n",
"1. Make a list of all possible levels of that categorical variable (let us call this list the *vocab*)\n",
"1. Replace each level with its index in the vocab\n",
"1. Create an embedding matrix for this containing a row for each level (i..e, for each item of the vocab)\n",
"1. Use this embedding matrix as the first layer of a neural network. (A dedicated embedding matrix can take as inputs the raw vocab indexes created in step two; this is equivalent to, but faster and more efficient, than a matrix which takes as input one-hot encoded vectors representing the indexes)\n",
"\n",
"We can do nearly the same thing with text! What is new is the idea of a sequence. First we concatenate all of the documents in our dataset into one big long string and split it into words, giving us a very long list of words. Our independent variable will be the sequence of words starting with the first word in our very long list and ending with the second last, and our dependent variable would be the sequence of words starting with the second word and ending with the last word. \n",
"\n",
"When creating our vocab, we will have very common words that will probably be in the vocabulary of our pretrained model, but we will also have new words specific to our corpus (cinematographic terms, or actor names for instance). Our embedding matrix will be built accordingly: for words that are in the vocabulary of our pretrained model, we will take the corresponding row in the embedding matrix of this pretrained model; but for new words, we won't have anything, so we will just initialize the corresponding row with a random vector."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each of the steps necessary to create a language model has jargon associated with it from the world of natural language processing, and fastai and PyTorch classes available to help. The steps are:\n",
"\n",
"- **Tokenization**: convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)\n",
"- **Numericalization**: make a list of all of the unique words which appear (the vocab), and convert each word into a number, by looking up its index in the vocab\n",
"- **Language model data loader** creation: fastai provides an `LMDataLoader` class which automatically handles creating a dependent variable which is offset from the independent variable buy one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required\n",
"- **Language model** creation: we need a special kind of model which does something we haven't seen before: handles input lists which could be arbitrarily big or small. There are a number of ways to do this; in this chapter we will be using a *recurrent neural network*. We will get to the details of this in the <<chapter_nlp_dive>>, but for now, you can think of it as just another deep neural network.\n",
"\n",
"Let's take a look at how each step works in detail."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocessing text with fastai"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"### Tokenization"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"When we said, *convert the text into a list of words*, we left out a lot of details. For instance, what do we do with punctuation? How do we deal with a word like \"don't\"? Is it one word, or two? What about long medical or chemical words? Should they be split into their separate pieces of meaning? How about hyphenated words? What about languages like German and Poland where we can create really long words from many, many pieces? What about languages like Japanese and Chinese which don't use bases at all, and don't really have a well-defined idea of *word*?\n",
"\n",
"Because there is no one correct answer to these questions, there is no one approach to tokenization. Each element of the list created by the tokenisation process is called a *token*. There are three main approaches:\n",
"\n",
"- **Word-based**: split a sentence on spaces, as well as applying language specific rules to try to separate parts of meaning, even when there are no spaces, such as turning \"don't\" into \"do n't\". Generally, punctuation marks are also split into separate tokens\n",
"- **Subword based**: split words into smaller parts, based on the most commonly occurring substrings. For instance, \"occasion\" might be tokeniser as \"o c ca sion\"\n",
"- **Character-based**: split a sentence into its individual characters.\n",
"\n",
"We'll be looking at word and subword tokenization here, and we'll leave character-based tokenization for you to implement in the questionnaire at the end of this chapter."
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"> jargon: token: one element of a list created by the tokenisation process. It could be a word, part of a word (a _subword_), or a single character."
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"### Word tokenization with fastai"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Rather than providing its own tokenizers, fastai instead provides a consistent interface to a range of tokenisers in external libraries. Tokenization is an active field of research, and new and improved tokenizers are coming out all the time, so the defaults that fastai uses change too. However, the API and options shouldn't change too much, since fastai tries to maintain a consistent API even as the underlying technology changes.\n",
"\n",
"Let's try it out with the IMDb dataset that we used in <<chapter_intro>>:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"from fastai2.text.all import *\n",
"path = untar_data(URLs.IMDB)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"We'll need to grab the text files in order to try out a tokenizer. Just like `get_image_files`, which we've used many times already, gets all the image files in a path, `get_text_files` gets all the text files in a path. We can also optionally pass `folders` to restrict the search to a particular list of subfolders:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"files = get_text_files(path, folders = ['train', 'test', 'unsup'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Here's a review that we'll tokenize (we'll just print the start of it here to save space):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/plain": [
"'This movie, which I just discovered at the video store, has apparently sit '"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"txt = files[0].open().read(); txt[:75]"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"As we write this book, the default *English word tokenizer* for fastai uses a library called *spaCy*. This uses a sophisticated rules engine that has special rules for URLs, individual special English words, and much more. Rather than directly using `SpacyTokenizer`, however, we'll use `WordTokenizer`, since that will always point to fastai's current default word tokenizer (which may not always be Spacy, depending when you're reading this).\n",
"\n",
"Let's try it out. We'll use fastai's `coll_repr(collection,n)` function to display the results; this displays the first `n` items of `collection`, along with the full size--it's what `L` uses by default. Not that fastai's tokenizers take a collection of documents to tokenize, so we have to wrap `txt` in a list:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"hidden": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(#201) ['This','movie',',','which','I','just','discovered','at','the','video','store',',','has','apparently','sit','around','for','a','couple','of','years','without','a','distributor','.','It',\"'s\",'easy','to','see'...]\n"
]
}
],
"source": [
"spacy = WordTokenizer()\n",
"toks = first(spacy([txt]))\n",
"print(coll_repr(toks, 30))"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"As you see, spaCy has mainly just separated out the words and punctuation. But it does something else here too: it has split \"it's\" into \"it\" and \"'s\". That makes intuitive sense--these are separate words, really. Tokenization is a surprisingly subtle task, when you think about all the little details that have to be handled. spaCy handles these for us, for instance, here we see that \".\" is separated when it terminates a sentence, but not in an acronym or number:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/plain": [
"(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"first(spacy(['The U.S. dollar $1 is $1.00.']))"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"fastai then adds some additional functionality to the tokenization process with the `Tokenizer` class:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"hidden": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at','the','video','store',',','has','apparently','sit','around','for','a','couple','of','years','without','a','distributor','.','xxmaj','it',\"'s\",'easy'...]\n"
]
}
],
"source": [
"tkn = Tokenizer(spacy)\n",
"print(coll_repr(tkn(txt), 31))"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"There are now some tokens added that start with the characters \"xx\", which is not a common word prefix in English. These are *special tokens*.\n",
"\n",
"For example, the first item in the list, \"xxbos\", is a special token that indicates the start of a new text (\"BOS\" is a standard NLP acronym which means \"beginning of stream\"). By recognizing this start token, the model will be able to learn it needs to \"forget\" what was said previously and focus on upcoming words.\n",
"\n",
"These special tokens don't come from spaCy directly. They are there because fastai adds them by default, by applying a number of rules when processing text. These rules are designed to make it easier for a model to recognise the important parts of a sentence. In a sense, we are translating the original English language sequence into a simplified tokenised language, a language which is designed to be easy for a model to learn.\n",
"\n",
"For instance, the rules will replace a sequence of four exclamation points with a single exclamation point, followed by a special *repeated character* token, and then the number four. In this way, the model's embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repetitions of every punctuation mark. Similarly, a capitalised word will be replaced with a special capitalisation token, followed by the lower case version of the word. This way, the embedding matrix only needs the lower case version of the words, saving compute and memory, but can still learn the concept of capitalisation.\n",
"\n",
"Here are some of the main special tokens you'll see:\n",
"\n",
"- xxbos:: indicates the beginning of a text (here a review)\n",
"- xxmaj:: indicates the next word begins with a capital (since we lower-cased everything)\n",
"- xxunk:: indicates the next word is unknown\n",
"\n",
"To see the rules that were used, you can check the default rules:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/plain": [
"[<function fastai2.text.core.fix_html(x)>,\n",
" <function fastai2.text.core.replace_rep(t)>,\n",
" <function fastai2.text.core.replace_wrep(t)>,\n",
" <function fastai2.text.core.spec_add_spaces(t)>,\n",
" <function fastai2.text.core.rm_useless_spaces(t)>,\n",
" <function fastai2.text.core.replace_all_caps(t)>,\n",
" <function fastai2.text.core.replace_maj(t)>,\n",
" <function fastai2.text.core.lowercase(t, add_bos=True, add_eos=False)>]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"defaults.text_proc_rules"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"As always, you can look at the source code of each of them in a notebook by typing\n",
"\n",
"```\n",
"??replace_rep\n",
"```\n",
"\n",
"Here is a brief summary of what each does:\n",
"\n",
"- `fix_html`: replace special HTML characters by a readable version (IMDb reviwes have quite a few of them for instance) ;\n",
"- `replace_rep`: replace any character repeated three times or more by a special token for repetition (xxrep), the number of times it's repeated, then the character ;\n",
"- `replace_wrep`: replace any word repeated three times or more by a special token for word repetition (xxwrep), the number of times it's repeated, then the word ;\n",
"- `spec_add_spaces`: add spaces around / and # ;\n",
"- `rm_useless_spaces`: remove all repetitions of the space character ;\n",
"- `replace_all_caps`: lowercase a word written in all caps and adds a special token for all caps (xxcap) in front of it ;\n",
"- `replace_maj`: lowercase a capilaized word and adds a special token for capitalized (xxmaj) in front of it ;\n",
"- `lowercase`: lowercase all text and adds a special token at the beginning (xxbos) and/or the end (xxeos)."
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Let's take a look at a few of them in action:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/plain": [
"\"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index'...]\""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"coll_repr(tkn('&copy; Fast.ai www.fast.ai/INDEX'), 31)"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"### Subword tokenization"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"In addition to the *word tokenization* approach seen in the last section, another popular tokenization method is *subword tokenization*. Word tokenization relies on an assumption that spaces provide a useful separation of components of meaning in a sentence. However, this assumption is not always appropriate. For instance, consider this sentence: 我的名字是郝杰瑞 (which means \"My name is Jeremy Howard\" in Chinese). That's not going to work very well with a word tokenizer, because there are no spaces in it! Languages like Chinese and Japanese don't use spaces, and in fact they don't even have a well-defined concept of a \"word\". There are also \"agglutinative languages\", like Polish, which can add many morphemes together to create very long \"words\" which include a lot of separate pieces of information.\n",
"\n",
"To handle these cases, it's generally best to use subword tokenization. This proceeds in two steps:\n",
"\n",
"1. Analyze a corpus of documents to find the most commonly occuring groups of letters. These become the vocab.\n",
"2. Tokenize the corpus using this vocab of *subword units*.\n",
"\n",
"Let's look at an example. For our corpus, we'll use the first 2000 movie reviews:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"txts = L(o.open().read() for o in files[:2000])"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"We instantiate our tokenizer, passing in the size of the vocab we want to create, and then we need to \"train\" it. That is, we need to have it read our documents, and find the common sequences of characters, to create the vocab. This is done with `setup`. As we'll see shortly, `setup` is a special fastai method that is called automatically in our usual data processing pipelines. Since we're doing everything manually at the moment, however, we have to call it ourselves. Here's a function that does these steps for a given vocab size, and shows an example output:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"def subword(sz):\n",
" sp = SubwordTokenizer(vocab_sz=sz)\n",
" sp.setup(txts)\n",
" return ' '.join(first(sp([txt]))[:40])"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Let's try it out:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/html": [],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'▁This ▁movie , ▁which ▁I ▁just ▁dis c over ed ▁at ▁the ▁video ▁st or e , ▁has ▁a p par ent ly ▁s it ▁around ▁for ▁a ▁couple ▁of ▁years ▁without ▁a ▁dis t ri but or . ▁It'"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"subword(1000)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"When using fastai's subword tokenizer, the special character `▁` represents a space character in the original text.\n",
"\n",
"If we use a smaller vocab, then each token will represent fewer characters, and it will take more tokens to represent a sentence:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/html": [],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'▁ T h i s ▁movie , ▁w h i ch ▁I ▁ j us t ▁ d i s c o ver ed ▁a t ▁the ▁ v id e o ▁ st or e , ▁h a s'"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"subword(200)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"On the other hand, if we use a larger vocab, then most common English words will end up in the vocab themselves, and we will not need as many to represent a sentence:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/html": [],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\"▁This ▁movie , ▁which ▁I ▁just ▁discover ed ▁at ▁the ▁video ▁store , ▁has ▁apparently ▁sit ▁around ▁for ▁a ▁couple ▁of ▁years ▁without ▁a ▁distributor . ▁It ' s ▁easy ▁to ▁see ▁why . ▁The ▁story ▁of ▁two ▁friends ▁living\""
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"subword(10000)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Picking a subword vocab size represents a compromise: a larger vocab means more fewer tokens per sentence, which means faster training, less memory, and less state for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn.\n",
"\n",
"Overall, subword tokenization provides a way to easily scale between character tokenization (i.e. use a small subword vocab) and word tokenization (i.e. use a large subword vocab), and handles every human language without needing language-specific algorithms to be developed. It can even handle other \"languages\" such as genomic sequences or MIDI music notation! For this reason, in the last year its popularity has soared, and it seems likely to become the most common tokenization approach (it may well already be, by the time you read this!)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Numericalization with fastai"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numericalization is the process of mapping tokens to integers. It's basically identical to the steps necessary to create a `Category` variable, such as the dependent variable of digits in MNIST:\n",
"\n",
"1. Make a list of all possible levels of that categorical variable (the *vocab*)\n",
"1. Replace each level with its index in the vocab\n",
"\n",
"We'll take a look at this in action on the word-tokenized text we saw earlier:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at','the','video','store',',','has','apparently','sit','around','for','a','couple','of','years','without','a','distributor','.','xxmaj','it',\"'s\",'easy'...]\n"
]
}
],
"source": [
"toks = tkn(txt)\n",
"print(coll_repr(tkn(txt), 31))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just like `SubwordTokenizer`, we need to call `setup` on `Numericalize`; this is how we create the `vocab`. That means we'll need our tokenized corpus first. Since tokenization takes a while, it's done in parallel by fastai; but for this manual walk-thru, we'll use a small subset:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at'...]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"toks200 = txts[:200].map(tkn)\n",
"toks200[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can pass this to `setup` to create our `vocab`:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"(#2000) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','is','in','i','it'...]\""
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num = Numericalize()\n",
"num.setup(toks200)\n",
"coll_repr(num.vocab,20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our special rules tokens appear first, and then every word appears once, in frequency order. The defaults to `Numericalize` are `min_freq=3,max_vocab=60000`. `max_vocab=60000` results in fastai replacing all words other than the most common 60000 with a special *unknown word* token `xxunk`. This is useful to avoid having an overly large embedding matrix, since that can slow down training, use up too much memory, and can also mean that there isn't enough data to train useful representations for rare words. However, this last issue is better handled by setting `min_freq`; the default `min_freq=3` means that any word appearing less than three times is replaced with `xxunk`.\n",
"\n",
"Fastai can also numericalize your dataset using a vocab that you provide, by passing a list of words as the `vocab` parameter.\n",
"\n",
"Once we've created our `Numericalize` object, we can use it as if it's a function:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tensor([ 2, 8, 21, 28, 11, 90, 18, 59, 0, 45, 9, 351, 499, 11, 72, 533, 584, 146, 29, 12])"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nums = num(toks)[:20]; nums"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This time, our tokens have been converted to a tensor of integers that our model can receive. We can check that they map back to the original text:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'xxbos xxmaj this movie , which i just xxunk at the video store , has apparently sit around for a'"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"' '.join(num.vocab[o] for o in nums)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Putting our texts into batches for a language model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When dealing with images, we needed to resize them all to the same height and width before grouping them together in a mini-batch so they could stack together efficiently in a single tensor. Here it's going to be a little different, because one cannot simply resize text to a desired length. Also, we want our language model to read text in order, so that it can efficiently predict what the next word is. All the difficulty of a language model loader is that each new batch should begin precisely where the previous left off.\n",
"\n",
"Let's start with an example and imagine our text is the following:\n",
"\n",
"> : In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\\nThen we will study how we build a language model and train it for a while.\n",
"\n",
"The tokenization process will add special tokens and deal with punctuation to return this text:\n",
"\n",
"> : xxbos xxmaj in this chapter , we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface . xxmaj first we will look at the processing steps necessary to convert text into numbers and how to customize it . xxmaj by doing this , we 'll have another example of the preprocessor used in the data block xxup api . \\n xxmaj then we will study how we build a language model and train it for a while .\n",
"\n",
"We have separated the 90 tokens by spaces. Let's say we want a batch size of 6, then we need to break this text in 6 contiguous parts of length 15:"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <tbody>\n",
" <tr>\n",
" <td>xxbos</td>\n",
" <td>xxmaj</td>\n",
" <td>in</td>\n",
" <td>this</td>\n",
" <td>chapter</td>\n",
" <td>,</td>\n",
" <td>we</td>\n",
" <td>will</td>\n",
" <td>go</td>\n",
" <td>back</td>\n",
" <td>over</td>\n",
" <td>the</td>\n",
" <td>example</td>\n",
" <td>of</td>\n",
" <td>classifying</td>\n",
" </tr>\n",
" <tr>\n",
" <td>movie</td>\n",
" <td>reviews</td>\n",
" <td>we</td>\n",
" <td>studied</td>\n",
" <td>in</td>\n",
" <td>chapter</td>\n",
" <td>1</td>\n",
" <td>and</td>\n",
" <td>dig</td>\n",
" <td>deeper</td>\n",
" <td>under</td>\n",
" <td>the</td>\n",
" <td>surface</td>\n",
" <td>.</td>\n",
" <td>xxmaj</td>\n",
" </tr>\n",
" <tr>\n",
" <td>first</td>\n",
" <td>we</td>\n",
" <td>will</td>\n",
" <td>look</td>\n",
" <td>at</td>\n",
" <td>the</td>\n",
" <td>processing</td>\n",
" <td>steps</td>\n",
" <td>necessary</td>\n",
" <td>to</td>\n",
" <td>convert</td>\n",
" <td>text</td>\n",
" <td>into</td>\n",
" <td>numbers</td>\n",
" <td>and</td>\n",
" </tr>\n",
" <tr>\n",
" <td>how</td>\n",
" <td>to</td>\n",
" <td>customize</td>\n",
" <td>it</td>\n",
" <td>.</td>\n",
" <td>xxmaj</td>\n",
" <td>by</td>\n",
" <td>doing</td>\n",
" <td>this</td>\n",
" <td>,</td>\n",
" <td>we</td>\n",
" <td>'ll</td>\n",
" <td>have</td>\n",
" <td>another</td>\n",
" <td>example</td>\n",
" </tr>\n",
" <tr>\n",
" <td>of</td>\n",
" <td>the</td>\n",
" <td>preprocessor</td>\n",
" <td>used</td>\n",
" <td>in</td>\n",
" <td>the</td>\n",
" <td>data</td>\n",
" <td>block</td>\n",
" <td>xxup</td>\n",
" <td>api</td>\n",
" <td>.</td>\n",
" <td>\\n</td>\n",
" <td>xxmaj</td>\n",
" <td>then</td>\n",
" <td>we</td>\n",
" </tr>\n",
" <tr>\n",
" <td>will</td>\n",
" <td>study</td>\n",
" <td>how</td>\n",
" <td>we</td>\n",
" <td>build</td>\n",
" <td>a</td>\n",
" <td>language</td>\n",
" <td>model</td>\n",
" <td>and</td>\n",
" <td>train</td>\n",
" <td>it</td>\n",
" <td>for</td>\n",
" <td>a</td>\n",
" <td>while</td>\n",
" <td>.</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#hide\n",
"stream = \"In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\\nThen we will study how we build a language model and train it for a while.\"\n",
"tokens = tfm(stream)\n",
"bs,seq_len = 6,15\n",
"d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])\n",
"df = pd.DataFrame(d_tokens)\n",
"display(HTML(df.to_html(index=False,header=None)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"TK: add title\" width=\"800\" caption=\"TK: add title\" id=\"TK: add it\" src=\"images/att_00071.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In a perfect world, we could then give this one batch to our model. But that doesn't work, because this would very likely not fit in our GPU memory (here we have 90 tokens, but all the IMDb reviews together give several millions of tokens).\n",
"\n",
"So in fact we will need to divide this array more finely into subarrays of a fixed sequence length. It is important to maintain order within and across these subarrays, because we will use a model that maintains state in order so that it remembers what it read previously when predicting what comes next. \n",
"\n",
"Going back to our previous example with 6 batches of length 15, if we chose sequence length of 5, that would mean we first feed the following array:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <tbody>\n",
" <tr>\n",
" <td>xxbos</td>\n",
" <td>xxmaj</td>\n",
" <td>in</td>\n",
" <td>this</td>\n",
" <td>chapter</td>\n",
" </tr>\n",
" <tr>\n",
" <td>movie</td>\n",
" <td>reviews</td>\n",
" <td>we</td>\n",
" <td>studied</td>\n",
" <td>in</td>\n",
" </tr>\n",
" <tr>\n",
" <td>first</td>\n",
" <td>we</td>\n",
" <td>will</td>\n",
" <td>look</td>\n",
" <td>at</td>\n",
" </tr>\n",
" <tr>\n",
" <td>how</td>\n",
" <td>to</td>\n",
" <td>customize</td>\n",
" <td>it</td>\n",
" <td>.</td>\n",
" </tr>\n",
" <tr>\n",
" <td>of</td>\n",
" <td>the</td>\n",
" <td>preprocessor</td>\n",
" <td>used</td>\n",
" <td>in</td>\n",
" </tr>\n",
" <tr>\n",
" <td>will</td>\n",
" <td>study</td>\n",
" <td>how</td>\n",
" <td>we</td>\n",
" <td>build</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#hide_input\n",
"bs,seq_len = 6,5\n",
"d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])\n",
"df = pd.DataFrame(d_tokens)\n",
"display(HTML(df.to_html(index=False,header=None)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <tbody>\n",
" <tr>\n",
" <td>,</td>\n",
" <td>we</td>\n",
" <td>will</td>\n",
" <td>go</td>\n",
" <td>back</td>\n",
" </tr>\n",
" <tr>\n",
" <td>chapter</td>\n",
" <td>1</td>\n",
" <td>and</td>\n",
" <td>dig</td>\n",
" <td>deeper</td>\n",
" </tr>\n",
" <tr>\n",
" <td>the</td>\n",
" <td>processing</td>\n",
" <td>steps</td>\n",
" <td>necessary</td>\n",
" <td>to</td>\n",
" </tr>\n",
" <tr>\n",
" <td>xxmaj</td>\n",
" <td>by</td>\n",
" <td>doing</td>\n",
" <td>this</td>\n",
" <td>,</td>\n",
" </tr>\n",
" <tr>\n",
" <td>the</td>\n",
" <td>data</td>\n",
" <td>block</td>\n",
" <td>xxup</td>\n",
" <td>api</td>\n",
" </tr>\n",
" <tr>\n",
" <td>a</td>\n",
" <td>language</td>\n",
" <td>model</td>\n",
" <td>and</td>\n",
" <td>train</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#hide_input\n",
"bs,seq_len = 6,5\n",
"d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])\n",
"df = pd.DataFrame(d_tokens)\n",
"display(HTML(df.to_html(index=False,header=None)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And finally"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <tbody>\n",
" <tr>\n",
" <td>over</td>\n",
" <td>the</td>\n",
" <td>example</td>\n",
" <td>of</td>\n",
" <td>classifying</td>\n",
" </tr>\n",
" <tr>\n",
" <td>under</td>\n",
" <td>the</td>\n",
" <td>surface</td>\n",
" <td>.</td>\n",
" <td>xxmaj</td>\n",
" </tr>\n",
" <tr>\n",
" <td>convert</td>\n",
" <td>text</td>\n",
" <td>into</td>\n",
" <td>numbers</td>\n",
" <td>and</td>\n",
" </tr>\n",
" <tr>\n",
" <td>we</td>\n",
" <td>'ll</td>\n",
" <td>have</td>\n",
" <td>another</td>\n",
" <td>example</td>\n",
" </tr>\n",
" <tr>\n",
" <td>.</td>\n",
" <td>\\n</td>\n",
" <td>xxmaj</td>\n",
" <td>then</td>\n",
" <td>we</td>\n",
" </tr>\n",
" <tr>\n",
" <td>it</td>\n",
" <td>for</td>\n",
" <td>a</td>\n",
" <td>while</td>\n",
" <td>.</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#hide_input\n",
"bs,seq_len = 6,5\n",
"d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])\n",
"df = pd.DataFrame(d_tokens)\n",
"display(HTML(df.to_html(index=False,header=None)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Going back to our dataset, the first step is to transform the individual texts into a stream by concatenating them together. As with images, it's best to randomize the order in which the inputs come, so at the beginnaing of each epoch we will shuffle the entries to make a new stream (we shuffle the order of the documents, not the order of the words inside, otherwise the text would not make sense anymore).\n",
"\n",
"We will then cut this stream into a certain number of batches (which is our *batch size*). For instance, if the stream has 50,000 tokens and we set a batch size of 10, this will give us 10 mini-streams of 5,000 tokens. What is important is that we preserve the order of the tokens (so from 1 to 5,000 for the first mini-stream, then from 5,001 to 10,000...) because we want the model to read continuous rows of text (as in our example above). This is why each text has been added a `xxbos` token during preprocessing, so that the model knows when it reads the stream we are beginning a new entry.\n",
"\n",
"So to recap, at every epoch we shuffle our collection of documents to pick one docment, and then we transform that one into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length you picked.\n",
"\n",
"This is all done behind the scenes by the fastai library when we create a `LMDataLoader`. We can create one by first applying our `Numericalize` object to the tokenized texts:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"nums200 = toks200.map(num)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"...and then passing that to `LMDataLoader`:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"dl = LMDataLoader(nums200)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's confirm that this gives the expected results, by grabbing the first batch:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(torch.Size([64, 72]), torch.Size([64, 72]))"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x,y = first(dl)\n",
"x.shape,y.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"...and then looking at the first row of the independent variable, which should be the start of the first text:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'xxbos xxmaj this movie , which i just xxunk at the video store , has apparently sit around for a'"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"' '.join(num.vocab[o] for o in x[0][:20])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"...and the first row of the dependent variable, which is the same thing offset by one token:"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'xxmaj this movie , which i just xxunk at the video store , has apparently sit around for a couple'"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"' '.join(num.vocab[o] for o in y[0][:20])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training a text classifier"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"### Language model using DataBlock"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"fastai handles tokenization and numericalization automatically when `TextBlock` is passed to `DataBlock`. All of the arguments that can be passed to `Tokenize` and `Numericalize` can also be passed to `TextBlock`. In the next chapter we'll discuss the easiest ways to run each of these steps separately, to ease debugging--but you can always just debug by running them manually on a subset of your data as shown in the previous sections. And don't forget about `DataBlock`'s handy `summary` method, which is very useful for debugging data issues.\n",
"\n",
"Here's how we use `TextBlock` to create a language model, using fastai's defaults:"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])\n",
"\n",
"dls_lm = DataBlock(\n",
" blocks=TextBlock.from_folder(path, is_lm=True),\n",
" get_items=get_imdb, splitter=RandomSplitter(0.1)\n",
").dataloaders(path, path=path, bs=128, seq_len=80)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"One thing that's different to previous types used in `DataBlock` is that we're not just using the class directly (i.e. `TextBlock(...)`, but instead are calling a *class method*. A class method is a Python method which, as the name suggests, belongs to a *class* rather than an *object*. (Be sure to search online for more information about class methods if you're not familiar with them, since they're commonly used in many Python libraries and applications; we've used them a few times previously in the book, but haven't called attention to them.) The reason that `TextBlock` is special is that setting up the numericalizer's vocab can take a long time (we have to read every document and tokenize it to get the vocab); to be as efficient as possible fastai does things such as: \n",
"\n",
"- Save the tokenized documents in a temporary folder, so fastai doesn't have to tokenize more than once\n",
"- Runs multiple tokenization processes in parallel, to take advantage of your computer's CPUs.\n",
"\n",
"Therefore we need to tell `TextBlock` how to access the texts, so that it can do this initial preprocessing--that's what `from_folder` does.\n",
"\n",
"`show_batch` then works in the usual way:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>text_</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>xxbos xxmaj it 's awesome ! xxmaj in xxmaj story xxmaj mode , your going from punk to pro . xxmaj you have to complete goals that involve skating , driving , and walking . xxmaj you create your own skater and give it a name , and you can make it look stupid or realistic . xxmaj you are with your friend xxmaj eric throughout the game until he betrays you and gets you kicked off of the skateboard</td>\n",
" <td>xxmaj it 's awesome ! xxmaj in xxmaj story xxmaj mode , your going from punk to pro . xxmaj you have to complete goals that involve skating , driving , and walking . xxmaj you create your own skater and give it a name , and you can make it look stupid or realistic . xxmaj you are with your friend xxmaj eric throughout the game until he betrays you and gets you kicked off of the skateboard xxunk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>what xxmaj i 've read , xxmaj death xxmaj bed is based on an actual dream , xxmaj george xxmaj barry , the director , successfully transferred dream to film , only a genius could accomplish such a task . \\n\\n xxmaj old mansions make for good quality horror , as do portraits , not sure what to make of the killer bed with its killer yellow liquid , quite a bizarre dream , indeed . xxmaj also , this</td>\n",
" <td>xxmaj i 've read , xxmaj death xxmaj bed is based on an actual dream , xxmaj george xxmaj barry , the director , successfully transferred dream to film , only a genius could accomplish such a task . \\n\\n xxmaj old mansions make for good quality horror , as do portraits , not sure what to make of the killer bed with its killer yellow liquid , quite a bizarre dream , indeed . xxmaj also , this is</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"dls_lm.show_batch(max_n=2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"### Fine tuning the language model"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"For converting the integer word indices into activations that we can use for our neural network, we will use embeddings, just like we did for collaborative filtering and tabular modelling. Then those embeddings are fed in a *Recurrent Neural Network* (RNN), using an architecture called *AWD_LSTM* (we will show how to write such a model from scratch in <<chapter_nlp_dive>>). As we discussed earlier, the embeddings in the pretrained model are merged with random embeddings added for words that weren't in the pretraining vocabulary. This is handled automatically inside `language_model_learner`:"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"learn = language_model_learner(\n",
" dls_lm, AWD_LSTM, drop_mult=0.3, \n",
" metrics=[accuracy, Perplexity()]).to_fp16()"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"The loss function used by default is cross entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab). A metric often used in NLP for language models is called *perplexity*. It is the exponential of the loss (i.e. `torch.exp(cross_entropy)`). We will also add accuracy, to see how many times our model is right when trying to predict the next word, since cross entropy (as we've seen) is both hard to interpret, and also tells you more about the model's confidence, rather than just its accuracy\n",
"\n",
"The grey first arrow in our overall picture has been done for us and made available as a pretrained model in fastai; we've now built the `DataLoaders` and `Learner` for the second stage, and we're ready to fine-tune it!"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"<img alt=\"Diagram of the ULMFiT process\" width=\"450\" src=\"images/att_00027.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"It takes quite a while to train each epoch, so we'll be saving the intermediate model results during the training process. Since `fine_tune` doesn't do that for us, we'll just use `fit_one_cycle`. Just like `cnn_learner`, `language_model_learner` automatically calls `freeze` when using a pretrained model (which is the default), so this will only train the embeddings (which is the only part of the model that contains randomly initialized weights--i.e. embeddings for words that are in our IMDb vocab, but aren't in the pretrained model vocab):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>perplexity</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>4.120048</td>\n",
" <td>3.912788</td>\n",
" <td>0.299565</td>\n",
" <td>50.038246</td>\n",
" <td>11:39</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.fit_one_cycle(1, 2e-2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Saving and loading models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This model takes a while to train, so it's a good opportunity to talk about saving intermediary results. You can easily save the state of your model like so:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"learn.save('1epoch')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It will create a file in `learn.path/models/` named \"1epoch.pth\". If you want to load your model in another machine after creating your `Learner` the same way, or resume training later, you can load the content of this file with:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"learn = learn.load('1epoch')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can them finetune the model after unfreezing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>perplexity</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>3.893486</td>\n",
" <td>3.772820</td>\n",
" <td>0.317104</td>\n",
" <td>43.502548</td>\n",
" <td>12:37</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>3.820479</td>\n",
" <td>3.717197</td>\n",
" <td>0.323790</td>\n",
" <td>41.148880</td>\n",
" <td>12:30</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>3.735622</td>\n",
" <td>3.659760</td>\n",
" <td>0.330321</td>\n",
" <td>38.851997</td>\n",
" <td>12:09</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>3.677086</td>\n",
" <td>3.624794</td>\n",
" <td>0.333960</td>\n",
" <td>37.516987</td>\n",
" <td>12:12</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>3.636646</td>\n",
" <td>3.601300</td>\n",
" <td>0.337017</td>\n",
" <td>36.645859</td>\n",
" <td>12:05</td>\n",
" </tr>\n",
" <tr>\n",
" <td>5</td>\n",
" <td>3.553636</td>\n",
" <td>3.584241</td>\n",
" <td>0.339355</td>\n",
" <td>36.026001</td>\n",
" <td>12:04</td>\n",
" </tr>\n",
" <tr>\n",
" <td>6</td>\n",
" <td>3.507634</td>\n",
" <td>3.571892</td>\n",
" <td>0.341353</td>\n",
" <td>35.583862</td>\n",
" <td>12:08</td>\n",
" </tr>\n",
" <tr>\n",
" <td>7</td>\n",
" <td>3.444101</td>\n",
" <td>3.565988</td>\n",
" <td>0.342194</td>\n",
" <td>35.374371</td>\n",
" <td>12:08</td>\n",
" </tr>\n",
" <tr>\n",
" <td>8</td>\n",
" <td>3.398597</td>\n",
" <td>3.566283</td>\n",
" <td>0.342647</td>\n",
" <td>35.384815</td>\n",
" <td>12:11</td>\n",
" </tr>\n",
" <tr>\n",
" <td>9</td>\n",
" <td>3.375563</td>\n",
" <td>3.568166</td>\n",
" <td>0.342528</td>\n",
" <td>35.451500</td>\n",
" <td>12:05</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.unfreeze()\n",
"learn.fit_one_cycle(10, 2e-3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once this is done, we save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not without the final layer is called the *encoder*. We can save it with `save_encoder`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"learn.save_encoder('finetuned')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: Encoder: The model not including the task-specific final layer(s). It means much the same thing as *body* when applied to vision CNNs, but tends to be more used for NLP and generative models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This completes the second stage of the text classification process: fine-tuning the language model. We can now fine tune this language model using the IMDb sentiment labels."
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"### Text generation"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Before using this to fine-tune a classifier on the review, we can use our model to generate random reviews: since it's trained to guess what the next word of the sentence is, we can use it to write new reviews:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/html": [],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"TEXT = \"I liked this movie because\"\n",
"N_WORDS = 40\n",
"N_SENTENCES = 2\n",
"preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"i liked this movie because of its story and characters . The story line was very strong , very good for a sci - fi film . The main character , Alucard , was very well developed and brought the whole story\n",
"i liked this movie because i like the idea of the premise of the movie , the ( very ) convenient virus ( which , when you have to kill a few people , the \" evil \" machine has to be used to protect\n"
]
}
],
"source": [
"print(\"\\n\".join(preds))"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"As you can see, we add some randomness (we pick a random word based on the probabilities returned by the model) so you don't get exactly the same review twice. Our model doesn't have any programmed knowledge of the structure of a sentence or grammar rules, yet it has clearly learned a lot about English sentences: we can see it capitalized properly (I is just transformed to i with our rules -- they require two characters or more to consider a word is capitalized -- so it's normal to see it lowercased), and is using consistent tense. The general review make sense at first glance, and it's only if you read carefully you can notice something is a bit off. Not bad for a model trained in a couple of hours! \n",
"\n",
"Our end goal wasn't to train a model to generate reviews, but to classify them... so let's use this model to do just that."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating the classifier DataLoaders"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're now moving from language model fine tuning, to classifier fine tuning. To re-cap, a language model predicts the next word of a document, so it doesn't need any external labels. A classifier, however, predicts some external label--in the case of IMDb, it's the sentiment of a document.\n",
"\n",
"This means that the structure of our `DataBlock` for NLP classification will look very familiar; it's actually nearly the same as we've seen for the many image classification datasets we've worked with:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dls_clas = DataBlock(\n",
" blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),\n",
" get_y = parent_label,\n",
" get_items=partial(get_text_files, folders=['train', 'test']),\n",
" splitter=GrandparentSplitter(valid_name='test')\n",
").dataloaders(path, path=path, bs=128, seq_len=72)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just like with image classification, `show_batch` shows the dependent variable (sentiment, in this case) with each independent variable (movie review text):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>category</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>xxbos i rate this movie with 3 skulls , only coz the girls knew how to scream , this could 've been a better movie , if actors were better , the twins were xxup ok , i believed they were evil , but the eldest and youngest brother , they sucked really bad , it seemed like they were reading the scripts instead of acting them … . spoiler : if they 're vampire 's why do they freeze the blood ? vampires ca n't drink frozen blood , the sister in the movie says let 's drink her while she is alive … .but then when they 're moving to another house , they take on a cooler they 're frozen blood . end of spoiler \\n\\n it was a huge waste of time , and that made me mad coz i read all the reviews of how</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>xxbos i have read all of the xxmaj love xxmaj come xxmaj softly books . xxmaj knowing full well that movies can not use all aspects of the book , but generally they at least have the main point of the book . i was highly disappointed in this movie . xxmaj the only thing that they have in this movie that is in the book is that xxmaj missy 's father comes to xxunk in the book both parents come ) . xxmaj that is all . xxmaj the story line was so twisted and far fetch and yes , sad , from the book , that i just could n't enjoy it . xxmaj even if i did n't read the book it was too sad . i do know that xxmaj pioneer life was rough , but the whole movie was a downer . xxmaj the rating</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>xxbos xxmaj this , for lack of a better term , movie is lousy . xxmaj where do i start … … \\n\\n xxmaj cinemaphotography - xxmaj this was , perhaps , the worst xxmaj i 've seen this year . xxmaj it looked like the camera was being tossed from camera man to camera man . xxmaj maybe they only had one camera . xxmaj it gives you the sensation of being a volleyball . \\n\\n xxmaj there are a bunch of scenes , haphazardly , thrown in with no continuity at all . xxmaj when they did the ' split screen ' , it was absurd . xxmaj everything was squished flat , it looked ridiculous . \\n\\n xxmaj the color tones were way off . xxmaj these people need to learn how to balance a camera . xxmaj this ' movie ' is poorly made , and</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"dls_clas.show_batch(max_n=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the `DataBlock` definition above, every piece is familiar from previous data blocks we've built, with two important exceptions:\n",
"\n",
"- `TextBlock.from_folder` no longer has the `is_lm=True` parameter, and\n",
"- We pass the `vocab` we created for the language model fine-tuning.\n",
"\n",
"The reason that we pass the vocab of the language model is to make sure we use the same correspondence of token to index. Otherwise the embeddings we learned in our fine-tuned language model won't make any sense to this model, and the fine-tuning step won't be of any use.\n",
"\n",
"By passing `is_lm=False` (or not passing `is_lm` at all, since it defaults to `False`) we tell `TextBlock` that we have regular labeled data, rather than using the next tokens as labels. There is one challenge we have to deal with, however, which is to do with collating multiple documents into a minibatch. Let's see with an example, by trying to create a minibatch containing the first 10 documents. First we'll numericalize them:"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"nums_samp = toks200[:10].map(num)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now look at how many tokens each of these 10 movie reviews have:"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(#10) [228,238,121,290,196,194,533,124,581,155]"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nums_samp.map(len)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember, PyTorch `DataLoader`s need to collate all the items in a batch into a single tensor, and that a single tensor has a fixed shape (i.e. it has some particular length on every axis, and all items must be consistent). This should look a bit familiar: we had the same issue with images. In that case, we use cropping, padding, and/or squishing to make everything the same size. Cropping might not be a good idea for documents, because it seems likely we'd remove some key information (having said that, the same issue is true for images, and we use cropping there; data augmentation hasn't been well explored for NLP yet, so perhaps there are actually opportunities to use cropping in NLP too!) You can't really \"squish\" a document. So that leaves padding!\n",
"\n",
"We will expand the shortest texts to make them all the same size. To do this, we use a special token that will be ignored by our model. This is called *padding* (just like in vision). Additionally, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths (with some shuffling for the training set). We do this by (approximately, for the training set) sorting the documents by length prior to each epoch. The result of this is that the documents collated into a single batch will tend of be of similar lengths. We won't make every batch, therefore, the same size, but will instead use the size of the largest document in each batch. (It is possible to do something similar with images, which is especially useful for irregularly sized rectangular images, although as we write these words, no library provides good support for this yet, and there aren't any papers covering it. It's something we're planning to add to fastai soon however, so have a look on the book website, where we'll add information about this if and when it's working well.)\n",
"\n",
"The padding and sorting is automatically done by the data block API for us when using a `TextBlock`, with `is_lm=False`. (We don't have this same issue for language model data, since we concatenate all the documents together first, and then split them into equally sized sections.)\n",
"\n",
"We can now create a model to classify our texts:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final step prior to training the classifier is to load the encoder from our fine-tuned language model. We use `load_encoder` instead of `load` because we only have pretrained weights available for the encoder; `load` by default raises an exception if an incomplete model is loaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"learn = learn.load_encoder('finetuned')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fine tuning the classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last step is to train with discriminative learning rates and *gradual unfreezing*. In computer vision, we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.347427</td>\n",
" <td>0.184480</td>\n",
" <td>0.929320</td>\n",
" <td>00:33</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.fit_one_cycle(1, 2e-2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In just one epoch we get the same result as our training in <<chapter_intro>>, not too bad! We can pass `-2` to `freeze_to` to freeze all except the last two parameter groups:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.247763</td>\n",
" <td>0.171683</td>\n",
" <td>0.934640</td>\n",
" <td>00:37</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.freeze_to(-2)\n",
"learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we can unfreeze a bit more, and continue training:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.193377</td>\n",
" <td>0.156696</td>\n",
" <td>0.941200</td>\n",
" <td>00:45</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.freeze_to(-3)\n",
"learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And finally, the whole model!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>epoch</th>\n",
" <th>train_loss</th>\n",
" <th>valid_loss</th>\n",
" <th>accuracy</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.172888</td>\n",
" <td>0.153770</td>\n",
" <td>0.943120</td>\n",
" <td>01:01</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0.161492</td>\n",
" <td>0.155567</td>\n",
" <td>0.942640</td>\n",
" <td>00:57</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.unfreeze()\n",
"learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We reach 94.3% accuracy, which was state-of-the-art just three years ago. By training a model on all the texts read backwards and averaging the predictions of those two models, we can even get to 95.1% accuracy, which was the state of the art introduced by the ULMFiT paper. It was only beaten a few months ago, fine-tuning a much bigger model and using expensive data augmentation (translating sentences in another language and back, using another model for translation)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Disinformation and language models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Even simple algorithms based on rules, before the days of widely available deep learning language models, could be used to create fraudulent accounts and try to influence policymakers. Jeff Kao, now a computational journalist at ProPublica, analysed the comments that were sent to the FCC in the USA regarding a 2017 proposal to repeal net neutrality. In his article [More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6)\", he discovered a large cluster of comments opposing net neutrality that seemed to have been generated by some sort of Madlibs-style mail merge. Below, the fake comments have been helpfully color-coded by Kao to highlight their formulaic nature:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"images/ethics/image16.png\" width=\"700\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Kao estimated that \"less than 800,000 of the 22M+ comments… could be considered truly unique\" and that \"more than 99% of the truly unique comments were in favor of keeping net neutrality.\"\n",
"\n",
"Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the tools at your disposal necessary to create and compelling language model. That is, something that can generate context appropriate believable text. It won't necessarily be perfectly accurate or correct, but it will be believable. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about. Take a look at this conversation on Reddit, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"images/ethics/image14.png\" id=\"ethics_reddit\" caption=\"An algorithm talking to itself on Reddit\" alt=\"An algorithm talking to itself on Reddit\" width=\"600\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, the use of the algorithm is being done explicitly. But imagine what would happen if a bad actor decided to release such an algorithm across social networks. They could do it slowly and carefully, allowing the algorithms to gradually develop followings and trust over time. It would not take many resources to have literally millions of accounts doing this. In such a situation we could easily imagine it getting to a point where the vast majority of discourse online was from bots, and nobody would have any idea that it was happening.\n",
"\n",
"We are already starting to see examples of machine learning being used to generate identities. For example, here is the LinkedIn profile for Katie Jones:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"images/ethics/image15.jpeg\" width=\"400\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Katie Jones was connected on LinkedIn to several members of mainstream Washington think tanks. But she didn't exist. That image you see is auto generated by a generative adversarial network, and somebody named Katie Jones has not, in fact, graduated from the Centre for Strategic and International Studies.\n",
"\n",
"Many people assume or hope that algorithms will come to our defence here. The hope is that we will develop classification algorithms which can automatically recognise auto generated content. The problem, however, is that this will always be an arms race, in which better classification (or discriminator) algorithms can be used to create better generation algorithms."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Questionnaire"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. What is self-supervised learning?\n",
"1. What is a language model?\n",
"1. Why is a language model considered self-supervised learning?\n",
"1. What are self-supervised models usually used for?\n",
"1. What do we fine-tune language models?\n",
"1. What are the three steps to create a state-of-the-art text classifier?\n",
"1. How do the 50,000 unlabeled movie reviews help create a better text classifier for the IMDb dataset?\n",
"1. What are the three steps to prepare your data for a language model?\n",
"1. What is tokenization? Why do we need it?\n",
"1. Name three different approaches to tokenization.\n",
"1. What is 'xxbos'?\n",
"1. List 4 rules that fastai applies to text during tokenization.\n",
"1. Why are repeated characters replaced with a token showing the number of repetitions, and the character that's repeated?\n",
"1. What is numericalization?\n",
"1. Why might there be words that are replaced with the \"unknown word\" token?\n",
"1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer against the book website.)\n",
"1. Why do we need padding for text classification? Why don't we need it for language modeling?\n",
"1. What does an embedding matrix for NLP contain? What is its shape?\n",
"1. What is perplexity?\n",
"1. Why do we have to pass the vocabulary of the language model to the classifier data block?\n",
"1. What is gradual unfreezing?\n",
"1. Why is text generation always likely to be ahead of automatic identification of machine generated texts?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further research"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. See what you can learn about language models and disinformation. What are the best language models today? Have a look at some of their outputs. Do you find them convincing? How could a bad actor best use this to create conflict and uncertainty?\n",
"1. Given the limitation that models are unlikely to be able to consistently recognise machine generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leveraged deep learning?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Becoming a deep learning practitioner"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations — you've completed all of the chapters in this book which cover the key practical parts of training and using deep learning! You know how to use all of fastai's built in applications, and how to customise them using the data blocks API and loss functions. You even know how to create a neural network from scratch, and train it! (And hopefully you now know some of the questions to ask to help make sure your creations help improve society too.)\n",
"\n",
"The knowledge you already have is enough to create full working prototypes of many types of neural network application. More importantly, it will help you understand the capabilities and limitations of deep learning models, and how to design a system which best handles these capabilities and limitations.\n",
"\n",
"In the rest of this book we will be pulling apart these applications, piece by piece, to understand all of the foundations they are built on. This is important knowledge for a deep learning practitioner, because it is the knowledge which allows you to inspect and debug models that you build, and to create new applications which are customised for your particular projects."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"split_at_heading": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {
"height": "367.997px",
"width": "278.999px"
},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}