fastbook/clean/11_midlevel_data.ipynb

623 lines
15 KiB
Plaintext
Raw Normal View History

2020-03-06 18:19:03 +00:00
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
2022-04-25 05:43:24 +00:00
"! [ -e /content ] && pip install -Uqq fastai # upgrade fastai on colab\n",
2020-09-03 22:51:00 +00:00
"import fastbook\n",
"fastbook.setup_book()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"from fastbook import *\n",
2020-03-06 18:19:03 +00:00
"from IPython.display import display,HTML"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2020-05-15 22:04:52 +00:00
"# Data Munging with fastai's Mid-Level API"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"## Going Deeper into fastai's Layered API"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
2020-08-21 19:36:27 +00:00
"from fastai.text.all import *\n",
2020-03-06 18:19:03 +00:00
"\n",
"dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"path = untar_data(URLs.IMDB)\n",
"dls = DataBlock(\n",
" blocks=(TextBlock.from_folder(path),CategoryBlock),\n",
" get_y = parent_label,\n",
" get_items=partial(get_text_files, folders=['train', 'test']),\n",
" splitter=GrandparentSplitter(valid_name='test')\n",
").dataloaders(path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transforms"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"files = get_text_files(path, folders = ['train', 'test'])\n",
"txts = L(o.open().read() for o in files[:2000])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"tok = Tokenizer.from_folder(path)\n",
"tok.setup(txts)\n",
"toks = txts.map(tok)\n",
"toks[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"num = Numericalize()\n",
"num.setup(toks)\n",
"nums = toks.map(num)\n",
"nums[0][:10]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"nums_dec = num.decode(nums[0][:10]); nums_dec"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"tok.decode(nums_dec)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"tok((txts[0], txts[1]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Writing Your Own Transform"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
2020-03-17 19:15:55 +00:00
"def f(x:int): return x+1\n",
2020-03-06 18:19:03 +00:00
"tfm = Transform(f)\n",
2020-03-17 19:15:55 +00:00
"tfm(2),tfm(2.0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-17 19:15:55 +00:00
"source": [
"@Transform\n",
"def f(x:int): return x+1\n",
"f(2),f(2.0)"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class NormalizeMean(Transform):\n",
" def setups(self, items): self.mean = sum(items)/len(items)\n",
" def encodes(self, x): return x-self.mean\n",
" def decodes(self, x): return x+self.mean"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"tfm = NormalizeMean()\n",
"tfm.setup([1,2,3,4,5])\n",
"start = 2\n",
"y = tfm(start)\n",
"z = tfm.decode(y)\n",
"tfm.mean,y,z"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"tfms = Pipeline([tok, num])\n",
"t = tfms(txts[0]); t[:20]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"tfms.decode(t)[:100]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"## TfmdLists and Datasets: Transformed Collections"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### TfmdLists"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"t = tls[0]; t[:20]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"tls.decode(t)[:100]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"tls.show(t)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cut = int(len(files)*0.8)\n",
"splits = [list(range(cut)), list(range(cut,len(files)))]\n",
"tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize], \n",
" splits=splits)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"tls.valid[0][:20]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"lbls = files.map(parent_label)\n",
"lbls"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"cat = Categorize()\n",
"cat.setup(lbls)\n",
"cat.vocab, cat(lbls[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"tls_y = TfmdLists(files, [parent_label, Categorize()])\n",
"tls_y[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Datasets"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x_tfms = [Tokenizer.from_folder(path), Numericalize]\n",
"y_tfms = [parent_label, Categorize()]\n",
"dsets = Datasets(files, [x_tfms, y_tfms])\n",
"x,y = dsets[0]\n",
"x[:20],y"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"x_tfms = [Tokenizer.from_folder(path), Numericalize]\n",
"y_tfms = [parent_label, Categorize()]\n",
"dsets = Datasets(files, [x_tfms, y_tfms], splits=splits)\n",
"x,y = dsets.valid[0]\n",
"x[:20],y"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"t = dsets.valid[0]\n",
"dsets.decode(t)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dls = dsets.dataloaders(bs=64, before_batch=pad_input)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tfms = [[Tokenizer.from_folder(path), Numericalize], [parent_label, Categorize]]\n",
"files = get_text_files(path, folders = ['train', 'test'])\n",
"splits = GrandparentSplitter(valid_name='test')(files)\n",
"dsets = Datasets(files, tfms, splits=splits)\n",
"dls = dsets.dataloaders(dl_type=SortedDL, before_batch=pad_input)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"path = untar_data(URLs.IMDB)\n",
"dls = DataBlock(\n",
" blocks=(TextBlock.from_folder(path),CategoryBlock),\n",
" get_y = parent_label,\n",
" get_items=partial(get_text_files, folders=['train', 'test']),\n",
" splitter=GrandparentSplitter(valid_name='test')\n",
").dataloaders(path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2020-05-15 22:04:52 +00:00
"## Applying the Mid-Level Data API: SiamesePair"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
2020-08-21 19:36:27 +00:00
"from fastai.vision.all import *\n",
2020-03-06 18:19:03 +00:00
"path = untar_data(URLs.PETS)\n",
"files = get_image_files(path/\"images\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
2020-09-03 22:51:00 +00:00
"class SiameseImage(fastuple):\n",
2020-03-06 18:19:03 +00:00
" def show(self, ctx=None, **kwargs): \n",
" img1,img2,same_breed = self\n",
" if not isinstance(img1, Tensor):\n",
" if img2.size != img1.size: img2 = img2.resize(img1.size)\n",
" t1,t2 = tensor(img1),tensor(img2)\n",
" t1,t2 = t1.permute(2,0,1),t2.permute(2,0,1)\n",
" else: t1,t2 = img1,img2\n",
" line = t1.new_zeros(t1.shape[0], t1.shape[1], 10)\n",
" return show_image(torch.cat([t1,line,t2], dim=2), \n",
" title=same_breed, ctx=ctx)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"img = PILImage.create(files[0])\n",
"s = SiameseImage(img, img, True)\n",
"s.show();"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"img1 = PILImage.create(files[1])\n",
"s1 = SiameseImage(img, img1, False)\n",
"s1.show();"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"s2 = Resize(224)(s1)\n",
"s2.show();"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def label_func(fname):\n",
" return re.match(r'^(.*)_\\d+.jpg$', fname.name).groups()[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class SiameseTransform(Transform):\n",
" def __init__(self, files, label_func, splits):\n",
" self.labels = files.map(label_func).unique()\n",
" self.lbl2files = {l: L(f for f in files if label_func(f) == l) \n",
" for l in self.labels}\n",
" self.label_func = label_func\n",
" self.valid = {f: self._draw(f) for f in files[splits[1]]}\n",
" \n",
" def encodes(self, f):\n",
" f2,t = self.valid.get(f, self._draw(f))\n",
" img1,img2 = PILImage.create(f),PILImage.create(f2)\n",
" return SiameseImage(img1, img2, t)\n",
" \n",
" def _draw(self, f):\n",
" same = random.random() < 0.5\n",
" cls = self.label_func(f)\n",
" if not same: \n",
" cls = random.choice(L(l for l in self.labels if l != cls)) \n",
" return random.choice(self.lbl2files[cls]),same"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"splits = RandomSplitter()(files)\n",
"tfm = SiameseTransform(files, label_func, splits)\n",
"tfm(files[0]).show();"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"tls = TfmdLists(files, tfm, splits=splits)\n",
"show_at(tls.valid, 0);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dls = tls.dataloaders(after_item=[Resize(224), ToTensor], \n",
" after_batch=[IntToFloatTensor, Normalize.from_stats(*imagenet_stats)])"
]
},
2020-04-23 18:24:16 +00:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Questionnaire"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2020-05-15 22:04:52 +00:00
"1. Why do we say that fastai has a \"layered\" API? What does it mean?\n",
"1. Why does a `Transform` have a `decode` method? What does it do?\n",
"1. Why does a `Transform` have a `setup` method? What does it do?\n",
2020-04-23 18:24:16 +00:00
"1. How does a `Transform` work when called on a tuple?\n",
"1. Which methods do you need to implement when writing your own `Transform`?\n",
2020-05-15 22:04:52 +00:00
"1. Write a `Normalize` transform that fully normalizes items (subtract the mean and divide by the standard deviation of the dataset), and that can decode that behavior. Try not to peek!\n",
"1. Write a `Transform` that does the numericalization of tokenized texts (it should set its vocab automatically from the dataset seen and have a `decode` method). Look at the source code of fastai if you need help.\n",
2020-04-23 18:24:16 +00:00
"1. What is a `Pipeline`?\n",
"1. What is a `TfmdLists`? \n",
2020-05-15 22:04:52 +00:00
"1. What is a `Datasets`? How is it different from a `TfmdLists`?\n",
"1. Why are `TfmdLists` and `Datasets` named with an \"s\"?\n",
2020-04-23 18:24:16 +00:00
"1. How can you build a `DataLoaders` from a `TfmdLists` or a `Datasets`?\n",
"1. How do you pass `item_tfms` and `batch_tfms` when building a `DataLoaders` from a `TfmdLists` or a `Datasets`?\n",
"1. What do you need to do when you want to have your custom items work with methods like `show_batch` or `show_results`?\n",
"1. Why can we easily apply fastai data augmentation transforms to the `SiamesePair` we built?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Further Research"
2020-04-23 18:24:16 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2020-05-18 21:18:08 +00:00
"1. Use the mid-level API to prepare the data in `DataLoaders` on your own datasets. Try this with the Pet dataset and the Adult dataset from Chapter 1.\n",
2020-05-15 22:04:52 +00:00
"1. Look at the Siamese tutorial in the fastai documentation to learn how to customize the behavior of `show_batch` and `show_results` for new type of items. Implement it in your own project."
2020-04-23 18:24:16 +00:00
]
},
2020-03-06 18:19:03 +00:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
2020-05-19 01:18:45 +00:00
"## Understanding fastai's Applications: Wrap Up"
2020-03-06 18:19:03 +00:00
]
},
2020-04-23 18:24:16 +00:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
2020-05-15 22:04:52 +00:00
"Congratulations—you've completed all of the chapters in this book that cover the key practical parts of training models and using deep learning! You know how to use all of fastai's built-in applications, and how to customize them using the data block API and loss functions. You even know how to create a neural network from scratch, and train it! (And hopefully you now know some of the questions to ask to make sure your creations help improve society too.)\n",
2020-04-23 18:24:16 +00:00
"\n",
2020-09-03 22:51:00 +00:00
"The knowledge you already have is enough to create full working prototypes of many types of neural network applications. More importantly, it will help you understand the capabilities and limitations of deep learning models, and how to design a system that's well adapted to them.\n",
2020-04-23 18:24:16 +00:00
"\n",
2020-05-15 22:04:52 +00:00
"In the rest of this book we will be pulling apart those applications, piece by piece, to understand the foundations they are built on. This is important knowledge for a deep learning practitioner, because it is what allows you to inspect and debug models that you build and create new applications that are customized for your particular projects."
2020-04-23 18:24:16 +00:00
]
},
2020-03-06 18:19:03 +00:00
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
2020-03-24 12:47:36 +00:00
"jupytext": {
"split_at_heading": true
},
2020-03-06 18:19:03 +00:00
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}