From 08746b427effd3eafc524d44eda544326e6de9d2 Mon Sep 17 00:00:00 2001 From: Joe Bender Date: Fri, 11 Dec 2020 16:36:11 -0500 Subject: [PATCH] Tokenizer description fixes in 10_nlp.ipynb (#350) * Fix description of tokenizing repeating chars The description of the tokenization of "!!!!" was out of order of the actual functionality. * Update "xxcap" to "xxup" On line 387 it shows that "xxup" is used, not "xxcap" --- 10_nlp.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/10_nlp.ipynb b/10_nlp.ipynb index 1be5189..4bb34a0 100644 --- a/10_nlp.ipynb +++ b/10_nlp.ipynb @@ -309,7 +309,7 @@ "\n", "These special tokens don't come from spaCy directly. They are there because fastai adds them by default, by applying a number of rules when processing text. These rules are designed to make it easier for a model to recognize the important parts of a sentence. In a sense, we are translating the original English language sequence into a simplified tokenized languageā€”a language that is designed to be easy for a model to learn.\n", "\n", - "For instance, the rules will replace a sequence of four exclamation points with a single exclamation point, followed by a special *repeated character* token, and then the number four. In this way, the model's embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repetitions of every punctuation mark. Similarly, a capitalized word will be replaced with a special capitalization token, followed by the lowercase version of the word. This way, the embedding matrix only needs the lowercase versions of the words, saving compute and memory resources, but can still learn the concept of capitalization.\n", + "For instance, the rules will replace a sequence of four exclamation points with a special *repeated character* token, followed by the number four, and then a single exclamation point. In this way, the model's embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repetitions of every punctuation mark. Similarly, a capitalized word will be replaced with a special capitalization token, followed by the lowercase version of the word. This way, the embedding matrix only needs the lowercase versions of the words, saving compute and memory resources, but can still learn the concept of capitalization.\n", "\n", "Here are some of the main special tokens you'll see:\n", "\n", @@ -364,7 +364,7 @@ "- `replace_wrep`:: Replaces any word repeated three times or more with a special token for word repetition (`xxwrep`), the number of times it's repeated, then the word\n", "- `spec_add_spaces`:: Adds spaces around / and #\n", "- `rm_useless_spaces`:: Removes all repetitions of the space character\n", - "- `replace_all_caps`:: Lowercases a word written in all caps and adds a special token for all caps (`xxcap`) in front of it\n", + "- `replace_all_caps`:: Lowercases a word written in all caps and adds a special token for all caps (`xxup`) in front of it\n", "- `replace_maj`:: Lowercases a capitalized word and adds a special token for capitalized (`xxmaj`) in front of it\n", "- `lowercase`:: Lowercases all text and adds a special token at the beginning (`xxbos`) and/or the end (`xxeos`)" ]