mirror of
https://github.com/fastai/fastbook.git
synced 2025-04-05 10:20:48 +00:00
962 lines
206 KiB
Plaintext
962 lines
206 KiB
Plaintext
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"#hide\n",
|
||
|
"from utils import *\n",
|
||
|
"from IPython.display import display,HTML"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# Data munging with fastai"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"In the previous chapter, we showed on examples of texts what `Tokenizer` or a `Numericalize` do to a collection of texts, to explain the usual preprocessing in NLP. We then switched to the data block API, that handles those transforms for us directly using the `TextBlock`. But what if we want to only apply one of those transforms, either to see intermediate results or because we have already tokenized texts. More generally, what can we do when the data block API is not flexible enough to accomodate our particular use case?\n",
|
||
|
"\n",
|
||
|
"In this chapter, we will saw how to use what we call the mid-level API for processing data. The data block API is built on top of that layer, so it will allow you to do everything the data block API does, and much much more! After looking at all the pieces that compose it on our example to preprocess text (like in the last chapter), we will show you an example of preparing data for a Siamese Network, which is a model that takes two images as inputs, and has to predict if those images are of the same class or not."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## The mid-level API for data collection"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The fastai library is built on a layered API. At the very top, you have functions that allow you to train a model in five lines of codes, as we saw in <<chapter_intro>>. In the case of collecting data for a text classifier for instance, we used the line:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from fastai2.text.all import *\n",
|
||
|
"\n",
|
||
|
"dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The factory method `TextDataLoaders.from_folder` is very convenient when your data is arranged the exact same way as the IMDb dataset, but in practice, that often won't be the case. The high-level API for data collection is the data block API and offers more flexibility. As we saw in the last chapter, we can ge the same result with:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"path = untar_data(URLs.IMDB)\n",
|
||
|
"dls = DataBlock(\n",
|
||
|
" blocks=(TextBlock.from_folder(path),CategoryBlock),\n",
|
||
|
" get_y = parent_label,\n",
|
||
|
" get_items=partial(get_text_files, folders=['train', 'test']),\n",
|
||
|
" splitter=GrandparentSplitter(valid_name='test')\n",
|
||
|
").dataloaders(path)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"But it's sometimes not flexible enough. For debugging purposes for instance, we might need to apply just parts of the transforms that come with this data block. So let's dig into the pieces that we used to write this data block API."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"> note: The mid-level API in general does not only contain functionality for data collection. It also has the callback system that we will study in <<chapter_callbacks>>, which allows us to customize the training loop any way we like, and the general optimizer that we will cover in <<chapter_accel_sgd>>."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### Transforms and Pipelines"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"When we studied tokenization and numericalization in the last chapter, we started by grabbing a bunch of texts:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"files = get_text_files(path, folders = ['train', 'test'])\n",
|
||
|
"txts = L(o.open().read() for o in files[:2000])"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"We then showed how to tokenize them with a `Tokenizer`:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"(#374) ['xxbos','xxmaj','well',',','\"','cube','\"','(','1997',')'...]"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tok = Tokenizer.from_folder(path)\n",
|
||
|
"tok.setup(txts)\n",
|
||
|
"toks = txts.map(tok)\n",
|
||
|
"toks[0]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"And how to numericalize an automatically create the vocab for our corpus:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"tensor([ 2, 8, 76, 10, 23, 3112, 23, 34, 3113, 33])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"num = Numericalize()\n",
|
||
|
"num.setup(toks)\n",
|
||
|
"nums = toks.map(num)\n",
|
||
|
"nums[0][:10]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"For each of those steps, we created an object (`tok` or `num`) called the setup method (which trains the tokenizer if needed for `tok` and creates the vocab for `nums`) then applied it on our raw texts. \n",
|
||
|
"\n",
|
||
|
"There is a general behavior here that is used in all data preprocessing, that we captured in a class called `Transform`. Both `tok` and `num` are `Transform`s, and in general, a transform is a function that will be applied to your data *lazily* (that means when you ask for it and not all at once, because we don't want to load all images in memory at once, for instance, in computer vision) with an optional *setup* that will initialize some inner state (like the vocab inside `num` for instance).\n",
|
||
|
"\n",
|
||
|
"fastai's `Transform`s have two more functionality. The first one is that they can optionally reverse their transformation with a *decode* method. This is what is used inside fastai for all the show methods we have seen. For instance, `num` has a decode method that will give us back the tokenized text: "
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'xxbos xxmaj well , \" cube \" ( 1997 ) , xxmaj vincenzo \\'s first movie , was one of the most interesti'"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"num.decode(nums[0])[:100]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"On ther other end, when looking at our data, we would like to see the result of the tokenization, to make sure none of the rules damaged the texts, so `tok` does not have a decode method (in practice, it has one that does nothing). It's the same for data augmentation transforms: since we want to show the effects on images, to make sure we didn't do too much data augmentation (or not enough) we don't decode those transforms. However, we need to undo the effects of the `Normalize` transform we saw in <<chapter_sizing_and_tta>> to be able to plt the images, so this one has a decode method."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The second special behavior of `Transform`s is that they always get applied over tuples: in general, our data is always a tuple `(input,target)` (sometimes with more than one input or more than one target). When applying a transform on an item like this, such as `Resize`, we don't want to resize the tuple, but resize the input (if applicable) and the target (if applicable). It's the same for the batch transforms that do data augmentation: when the input is an image and the target is a segmentation mask, the transform needs to be applied (the same way) to the input and the target.\n",
|
||
|
"\n",
|
||
|
"We can see this behavior if we pass a tuple of texts to `tok`:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"((#374) ['xxbos','xxmaj','well',',','\"','cube','\"','(','1997',')'...],\n",
|
||
|
" (#207) ['xxbos','xxmaj','conrad','xxmaj','hall','went','out','with','a','bang'...])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tok((txts[0], txts[1]))"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### Sidebar: Writing your own Transform"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"If you want to write a custom transform to apply to your data, the easiest way is to write a function:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def f(x): return x+1\n",
|
||
|
"tfm = Transform(f)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"`tfm` will automatically convert `f` to a `Transform` with no setup and no decode method. If you need either of those, you will need to subclass `Transform`. When writing this subclass, you need to implement the actual function in `encodes`, then (optionally), the setup behavior in `setups` and the decoding behavior in `decodes`:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"class MyTfm(Transform):\n",
|
||
|
" def setups(self, items): self.mean = sum(items)/len(items)\n",
|
||
|
" def encodes(self, x): return x+self.mean\n",
|
||
|
" def decodes(self, x): return x-self.mean"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Here `MyTfm` will initialize some states during the setup (the mean of all elements passed), then the transformation is to add that mean. For decoding purposes, we implement the reverse of that transformation by substracting the mean. Here is an example of `myTfm` in action:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"(3.0, 5.0, 2.0)"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tfm = MyTfm()\n",
|
||
|
"tfm.setup([1,2,3,4,5])\n",
|
||
|
"start = 2\n",
|
||
|
"y = tfm(start)\n",
|
||
|
"z = tfm.decode(y)\n",
|
||
|
"tfm.mean,y,z"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"To learn more about `Transform`s and how you can use them to have different behavior depending on the type of the input, be sure to check our tutorial in the docs online."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### End sidebar"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"To compose several transforms together, fastai uses `Pipeline`. You can define a `Pipeline` by passing it a list of `Transform`s and it will then compose the transforms indide it: when you call it on an object, it will automatically call the transforms inside in order:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"tensor([ 2, 8, 76, 10, 23, 3112, 23, 34, 3113, 33, 10, 8, 4477, 22, 88, 32, 10, 27, 42, 14])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tfms = Pipeline([tok, num])\n",
|
||
|
"t = tfms(txts[0]); t[:20]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"And you can call decode on the result of your encoding, to get back something you can display and analyze:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'xxbos xxmaj well , \" cube \" ( 1997 ) , xxmaj vincenzo \\'s first movie , was one of the most interesti'"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tfms.decode(t)[:100]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The only part that doesn't work the same way as in `Transform` is the setup. To properly setup a `Pipeline` of `Transform`s on some data, you need to use a `TfmdLists`."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### TfmdLists and Datasets"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"You data is usually a set of raw items (like filenames, or rows in a dataframe) to which you want to apply a succesion of transformations. We just saw that the sucession of transformations was represented by a `Pipeline` in fastai. The class that groups together this pipeline with your raw items is called `TfmdLists`. Here is the short way of doing the transformation we saw in the previous section:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize])"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"At initialization, the `TfmdLists` will automatically call the setup method of each transform in order, providing them not with the raw items but the items transformed by all the previous `Transform`s in order. We can get the result of our pipeline on any raw element just by indexing into the `TfmdLists`:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"tensor([ 2, 8, 91, 11, 22, 5793, 22, 37, 4910, 34, 11, 8, 13042, 23, 107, 30, 11, 25, 44, 14])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"t = tls[0]; t[:20]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"And the `TfmdLists` knows how to decode for showing purposing:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'xxbos xxmaj well , \" cube \" ( 1997 ) , xxmaj vincenzo \\'s first movie , was one of the most interesti'"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tls.decode(t)[:100]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"In fact, it even has a `show` method:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"xxbos xxmaj well , \" cube \" ( 1997 ) , xxmaj vincenzo 's first movie , was one of the most interesting and tricky ideas that xxmaj i 've ever seen when talking about movies . xxmaj they had just one scenery , a bunch of actors and a plot . xxmaj so , what made it so special were all the effective direction , great dialogs and a bizarre condition that characters had to deal like rats in a labyrinth . xxmaj his second movie , \" cypher \" ( 2002 ) , was all about its story , but it was n't so good as \" cube \" but here are the characters being tested like rats again . \n",
|
||
|
"\n",
|
||
|
" \" nothing \" is something very interesting and gets xxmaj vincenzo coming back to his ' cube days ' , locking the characters once again in a very different space with no time once more playing with the characters like playing with rats in an experience room . xxmaj but instead of a thriller sci - fi ( even some of the promotional teasers and trailers erroneous seemed like that ) , \" nothing \" is a loose and light comedy that for sure can be called a modern satire about our society and also about the intolerant world we 're living . xxmaj once again xxmaj xxunk amaze us with a great idea into a so small kind of thing . 2 actors and a blinding white scenario , that 's all you got most part of time and you do n't need more than that . xxmaj while \" cube \" is a claustrophobic experience and \" cypher \" confusing , \" nothing \" is completely the opposite but at the same time also desperate . \n",
|
||
|
"\n",
|
||
|
" xxmaj this movie proves once again that a smart idea means much more than just a millionaire budget . xxmaj of course that the movie fails sometimes , but its prime idea means a lot and offsets any flaws . xxmaj there 's nothing more to be said about this movie because everything is a brilliant surprise and a totally different experience that i had in movies since \" cube \" .\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tls.show(t)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The `TfmdLists` is named with an \"s\" because it can handle a training and validation set with a splits argument. You just need to pass the indices of which elemets are in the training set, and which are in the validation set:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"cut = int(len(files)*0.8)\n",
|
||
|
"splits = [list(range(cut)), list(range(cut,len(files)))]\n",
|
||
|
"tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize], splits=splits)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"You can then access them through the `train` and `valid` attribute:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"tensor([ 2, 8, 20, 30, 87, 510, 1570, 12, 408, 379, 4196, 10, 8, 20, 30, 16, 13, 12216, 202, 509])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tls.valid[0][:20]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"If you have manually written a `Transform` that returns your whole data (input and target) from the raw items you had, then `TfmdLists` is the class you need. You can directly convert it to a `DataLoaders` object with the `dataloaders` method. This is what we will do in our Siamese example further in this chapter.\n",
|
||
|
"\n",
|
||
|
"In general though, you have two (or more) parallel pipelines of transforms: one for processing your raw items into inputs and one to process your raw items into targets. For instance, here, the pipeline we defined only processes the input. If we want to do text classification, we have to process the labels as well. \n",
|
||
|
"\n",
|
||
|
"Here we need to do two things: first take the label name from the parent folder. There is a function `parent_label` for this:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"(#50000) ['pos','pos','pos','pos','pos','pos','pos','pos','pos','pos'...]"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"lbls = files.map(parent_label)\n",
|
||
|
"lbls"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Then we need a `Transform` that will grab the unique items and build a vocab with it during setup, then will transform the string labels into integers when called. fastai provides this transform, it's called `Categorize`:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"((#2) ['neg','pos'], TensorCategory(1))"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"cat = Categorize()\n",
|
||
|
"cat.setup(lbls)\n",
|
||
|
"cat.vocab, cat(lbls[0])"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"To do the whole setup automatically on our list of files, we can create a `TfmdLists` as before:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"TensorCategory(1)"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tls_y = TfmdLists(files, [parent_label, Categorize()])\n",
|
||
|
"tls_y[0]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"But then we end up with two separate objects for our inputs and targets, which is not what we want. This is where `Datasets` comes to the rescue. `Datasets` will apply two (or more) pipelines in parallel to the same raw object and build a tuple with the result. Like `TfmdLists`, it will automatically do the setup for us, and when we index into a `Datasets`, it will return us a tuple with the results of each pipeline:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"x_tfms = [Tokenizer.from_folder(path), Numericalize]\n",
|
||
|
"y_tfms = [parent_label, Categorize()]\n",
|
||
|
"dsets = Datasets(files, [x_tfms, y_tfms])\n",
|
||
|
"x,y = dsets[0]\n",
|
||
|
"x[:20],y"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Like a `TfmdLists`, we can pass along `splits` to a `Datasets` to split our data between training and validation:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"(tensor([ 2, 8, 20, 30, 87, 510, 1570, 12, 408, 379, 4196, 10, 8, 20, 30, 16, 13, 12216, 202, 509]),\n",
|
||
|
" TensorCategory(0))"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"x_tfms = [Tokenizer.from_folder(path), Numericalize]\n",
|
||
|
"y_tfms = [parent_label, Categorize()]\n",
|
||
|
"dsets = Datasets(files, [x_tfms, y_tfms], splits=splits)\n",
|
||
|
"x,y = dsets.valid[0]\n",
|
||
|
"x[:20],y"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"It can also decode any processed tuple or show it directly:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"('xxbos xxmaj this movie had horrible lighting and terrible camera movements . xxmaj this movie is a jumpy horror flick with no meaning at all . xxmaj the slashes are totally fake looking . xxmaj it looks like some 17 year - old idiot wrote this movie and a 10 year old kid shot it . xxmaj with the worst acting you can ever find . xxmaj people are tired of knives . xxmaj at least move on to guns or fire . xxmaj it has almost exact lines from \" when a xxmaj stranger xxmaj calls \" . xxmaj with gruesome killings , only crazy people would enjoy this movie . xxmaj it is obvious the writer does n\\'t have kids or even care for them . i mean at show some mercy . xxmaj just to sum it up , this movie is a \" b \" movie and it sucked . xxmaj just for your own sake , do n\\'t even think about wasting your time watching this crappy movie .',\n",
|
||
|
" 'neg')"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"t = dsets.valid[0]\n",
|
||
|
"dsets.decode(t)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The last step is to convert your `Datasets` object to a `DataLoaders`, which can be done with the `dataloaders` method. Here we need to pass along special arguments to take care of the padding problem (as we saw in the last chapter). This needs to happen just before we batch the elements, so we pass it to `before_batch`: "
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"dls = dsets.dataloaders(bs=64, before_batch=pad_input)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"`dataloaders` directly calls `DataLoader` on each subset of our `Datasets`. fastai's `DataLoader` expands the PyTorch class of the same name and is responsible for collating the items from our datasets into batches. It has a lot of points of customization but the most important you should know are:\n",
|
||
|
"\n",
|
||
|
"- `after_item`: applied on each item after grabbing it inside the dataset. This is the equivalent of the `item_tfms` in `DataBlock`.\n",
|
||
|
"- `before_batch`: applied on the list of items before they are collated. This is the ideal place to pad items to the same size.\n",
|
||
|
"- `after_batch`: applied on the batch as a whole after its construction. This is the equivalent of the `batch_tfms` in `DataBlock`."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"As a conclusion, here is the full code necessary to prepare the data for text classification:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"tfms = [[Tokenizer.from_folder(path), Numericalize], [parent_label, Categorize]]\n",
|
||
|
"files = get_text_files(path, folders = ['train', 'test'])\n",
|
||
|
"splits = GrandparentSplitter(valid_name='test')(files)\n",
|
||
|
"dsets = Datasets(files, tfms, splits=splits)\n",
|
||
|
"dls = dsets.dataloaders(dl_type=SortedDL, before_batch=pad_input)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The two differences with what we had above is the use of `GrandParentSplitter` to split our training and validation data, and the `dl_type` argument. This is to tell `dataloaders` to use the `SortedDL` class of `DataLoader`, and not the usual one. This is the class that will handle the construction of batches by putting samples of roughly the same lengths into batches.\n",
|
||
|
"\n",
|
||
|
"This does the exact same thing as our `DataBlock` from above:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"path = untar_data(URLs.IMDB)\n",
|
||
|
"dls = DataBlock(\n",
|
||
|
" blocks=(TextBlock.from_folder(path),CategoryBlock),\n",
|
||
|
" get_y = parent_label,\n",
|
||
|
" get_items=partial(get_text_files, folders=['train', 'test']),\n",
|
||
|
" splitter=GrandparentSplitter(valid_name='test')\n",
|
||
|
").dataloaders(path)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"...except that now, yuo know how to customize every single piece of it!"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Let's practice what we just learned on this mid-level API for data preprocessing on a computer vision example now, with a Siamese Model input pipeline."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Siamese model"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"A Siamese model takes to images and has to determine if they are of the same classe or not. For this example, we will use the pets dataset again, and prepare the data for a model that will have to predict if two images of pets are of the same breed or not. TK see if we train that model later in the book. "
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from fastai2.vision.all import *\n",
|
||
|
"path = untar_data(URLs.PETS)\n",
|
||
|
"files = get_image_files(path/\"images\")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"class SiameseImage(Tuple):\n",
|
||
|
" def show(self, ctx=None, **kwargs): \n",
|
||
|
" img1,img2,same_breed = self\n",
|
||
|
" dim = 2 if isinstance(img1, Tensor) else 1\n",
|
||
|
" return show_image(torch.cat([tensor(img1),tensor(img2)], dim=dim), \n",
|
||
|
" title=same_breed, ctx=ctx)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"<matplotlib.axes._subplots.AxesSubplot at 0x7f966fc3c750>"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAASUAAAB8CAYAAAAvkab8AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nOy9a7Bt2VXf9xtjzrX23udxX31vP24/1C2puyVZAgkhsEBEDsK8MRhQTGEwJk4c48qDqlRwKkBQQtlViSuVD64kxuUPKapiXE5SFWJXFAoIAov3Q+j9QK1+t7r73tv3ntd+rDXnGPkw5trnilitiiDQxGdU3b639zln77XmGnOM//iP/5hH3J0zO7MzO7NXiumf9gWc2Zmd2ZndbmdB6czO7MxeUXYWlM7szM7sFWVnQenMzuzMXlF2FpTO7MzO7BVlZ0HpzM7szF5RdhaUzuzMzuwVZWdB6cwAEBH/An+e+NO+xjP718Pyn/YFnNkrxu657d9fAfxs+/vp9lr9V/2QiPTuPvx/fG1n9q+RnSGlMwPA3Z+f/gAvtZev3fb6NQAReV5EfkJE/pGIvAT8oojMG5r67tvfU0TeLyL/8Lb/70Xk74rIkyKyEpGPiMgP/ond5Jn9mbAzpHRmX4z9x8B/BXwl/+986KeBR4B/G/gM8Hbgp0RkcPf/6Y/9Ks/sz6SdBaUz+2LsX7r7353+R0TmX+gHROR1wF8BXu3uj7eXHxeRNwL/AXAWlM4MOAtKZ/bF2W99ET/ztvb3h0Xk9tczcPJHvqIz+/+NnQWlM/ti7A8HEWt/yx96vbvt3wo4EZzGz/PzZ3ZmZ0HpzP7o5u6DiBwAV6fXRGQHeBT4QHvpd4igda+7/8Kf/FWe2Z8VOwtKZ/bHZb8A/Psi8hvAEvgJbuvuuvtHReSfAP+jiPwI8JvAPvDlwHl3/2/+FK75zF6BdiYJOLM/Lvth4NNEcPoXwHuBD/2h7/kB4H8A3gN8HPh54K8Cj/2JXeWZveJNzk6ePLMzO7NXkp0hpTM7szN7RdlZUDqzMzuzV5SdBaUzO7Mze0XZWVA6szM7s1eUnQWlMzuzM3tF2cvqlHZl4TWvmUtGspOSAIr0wMaopmw8IVpQT3TiCEYSxV2wDNkLJhkRwd1xd4SEophVqoKYkxSoStUBtYyqYgLqBek6cEVsxBAMJ7lgotSxoDlhZqQsqBmGbD9vMqtNbqweP0/G1VBPOJWcla5LRJw23IWUEmjCyggqeDVUQVUZx8p81qFZweOzzCyu2wzNiW6xQz/fY7aYM5vNOH/hIjkpVRJ7+/u4CmKCacHGQi2ODRu6nTlSjOLG+mRJSh1936NdjntqYxrDuAYTVEFMWB6fsHPuPK4FzKnVKaUwrJaIJHZ2dig4fdexXq9RN1wTqoqXynoo0HWwWjKOFc1tDb2yHEZsLBwcHJC6BWk2Z33zGn7jeXKfePZGwcRwHelzT9aKiEBSdDSqOKua2nOp7KSMM6IkkihFFaUgIiiJUh3R8JfkjingHcUrGUc81r1qgSrknCE57pXOOkyMKpVworgPMaW64e6oKklBHMwF0YohYPG1MhqdQEHQBGAIXXifwKxv+VwSIUhXVBVQilQ6lKGMzGcZqwJemc1muApWKi5sPwsgLxZ0/ZxutqDvey7feQVyj1lh0c/QxYKcM2aGlYEyGm4FT5lM+HkdRkopdN0MV2c+32G9XtMtdijrE8qmIIuMFGNYDYgIi/09qoXAvqxGxnFDKYWce/q+J/UddRgZ6oC6xvNUpWwGqmTMCmW9QklIgjJUqhvDsOb46AA8412HzuesPvEhLMX+eeza8R9W/2/tZZGSzwtd7ti4sCnOcgPr0VgdOcvR2Vih1DVmzsYGhuKMLpwUZ4mhrphkMMdKRU1RydRacQqkSp4GDFyxVBHLFI15BHfHU8JKRWxk4y0giWI41IJnxashItjomGpsJPPYANVwExCjaHyOtFjlBUSd5hckV6oXigiWBBfCCVow7rIiCkkys/40nkewFVQEASQLbpUyVKwMqDh9PwM3nAgi63HgeH2M5QyDsKkgObG7u0uSTMUBJQF9nxmGNSerYzxlFvM5pJ4yjHQuZI+NLSJsVis6ncfGq4JaJmtHXR/z/FOfpjis12uqGZhg7kjbTOuTA4aTFWM1HEVniapGl3ehglejLNcMyxWp69Cy3s6HeC5odlwyYzFWg7AahPXaWFZYj7GWtY6MxVgXY1UTqwLHY6FUwy0hHkFdMGpxcKXU8F9xJ7tQWtIpqaKekAylFNwEkUSR2BhuiZTS9nmrRaDU5iOG45oQwF3CT4Dq8cyLKq5xHYbgVNAIcmLxpi5QVLcJrEpsOrdEl3rEjZQiaLo4VGuf53SioAnHKJs1wzAwlg3agk+tlVQNT5my3rAaNyjGUJSchH62oJ9NiTQ+N+cMGCfrI1bjQLfYQZKSTBFzeulIJBTh+OQALNGnBVIU9RlZOxLw4jOf5uD4gHGorMu4PU1LRBBxzODGtRfxCnWM/ec6xzOIWwSpAsdHNxmHyp5nzEfEnOELyJBeVqc067OrxwKaEA89gZrgfwiJOJWUM9qyeCmFLLFIxQtUY9b3zFUoRKAQF6o66qAI1ePmKk5qY1RZFDJI9W2QEAkHcipObMbJ8WqtqCraCVKd6gaueAYtMKjToTgVIZFTBImUpoyXUY0MmZvz1lpRhNwnpF1jBKEIfJITXo1a42GZGeO4od9dcOH8Jfb29pidO8/uufN4GXnp+WdIB9eZqXM8P8+lVz1M3/fklOj6ni71bMY1w/KEW7duMVz7LDvrm4CxkQ45d5l8/jJpNme+WJBzj9WRkxc+y+qpTzGcv8KFex5gZ+8CYCyPDykvPIUdXudkVVhJ4c43vZ1OlINbL7G7c57l+oTxxedYHr3E7M57yfsXqJuCpcrq2kucHN1kfv4KHFxHzVglxUenp5KS8uTBCgBzJzm4nm7yrWkgEdEIhNKcvNaKmTHLHdplymaNamaRFFHHXFAUd8OIn1cLJN0+NAJF84NINEqxeG4QSBkJXzA8vi9FQtOcUDfEnCKB5GmBA8A1nvnkLyIN2UsgPzTQnYgwWEXGisyFzgUkrgsgqWAikVRroVQFHxlGx6xw8c7L7Cz22L14kb0LF7brcvj0E+zUY7wI3PMQ+5ev0HUdSZXcLTAbcDOObt3k4Prz6MFNFnZCIbHJc/TiPVy8dBdFjXnu0NyzOrzJ8VOfYCxGPXeFi3deZWdnhzqOHDz/NPObzzEcD1yzDXrxMnc/+AaWx4d4NVKXOT64SX3xGY6HgSuPfAXr1U2SwrqMHDz9FHmxj+SOtDrAaqGmBayOyKqszXn61vLzIqWXDUp9l1w9ELAQJRJiJBL1toMIVfJ2kwOkHA6pKaBxr1FGuCZ2Zx2UEec2hJME9QalcRDDULJEUMgCnoTkgV68GmUKZLVu4XmRCDjK6WebR2lZ1eirUjR+zs0Q1YDm6mQVVHMEGhJFlTIMjNVhNNI8sdPPQKNEyKKoCifLkflOR+4SRy8tGYYBb87Y7864fOcVFnu7XLh0hXPnzjEOA/WJj7OXYoON6xUvWc/i4S9j99yC8XiNiHCyvMX1zzzOostclCUjwrqOyOAMQ+UWlcuPvJmDp59htMrdr341drLkpWc+xeZ44Ggs3Pnga7l019288NRnWN94gWqGO3jK0M9ZjQN+csxs9xzrUujqgDiUfoaIUldHUBPdPNNZ4nhYs7u3oKuFmjpSrWQpJO144mAT5RYwRYsoNSMQiDouhpeA/yKRhLImRKMkFVe6JNQ6UjzKpHPdDCcQUfXAZQmhKHQuUaLV8KHs8bmaItiZ2TbZWY3XXG5Lsg5ZhCxQxakFcoqva/N3b4FTAERIOYLSaSKLexmrUcfCYI5qYm8e6BNNZDFUhbHEnum6js2msl6uGMZIoqLOxSuX2Tu3z/mLl7lwxyUkJ248+xT3rQ8pmzXMEscHa06uvoZ7HniIzbhmM6xQ73jhsT9gc3yTO7rCrOs5Wa9apaBcPzlk9+ojpNTx/LXPcscdlzl/6Q6OHv8
|
||
|
"text/plain": [
|
||
|
"<Figure size 360x360 with 1 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {
|
||
|
"needs_background": "light"
|
||
|
},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"img = PILImage.create(files[0])\n",
|
||
|
"s = SiameseImage(img, img, True)\n",
|
||
|
"s.show()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"tst = ToTensor()(s)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"<matplotlib.axes._subplots.AxesSubplot at 0x7f966cc5add0>"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAASUAAAB8CAYAAAAvkab8AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nOy9a7Bt2VXf9xtjzrX23udxX31vP24/1C2puyVZAgkhsEBEDsK8MRhQTGEwJk4c48qDqlRwKkBQQtlViSuVD64kxuUPKapiXE5SFWJXFAoIAov3Q+j9QK1+t7r73tv3ntd+rDXnGPkw5trnilitiiDQxGdU3b639zln77XmGnOM//iP/5hH3J0zO7MzO7NXiumf9gWc2Zmd2ZndbmdB6czO7MxeUXYWlM7szM7sFWVnQenMzuzMXlF2FpTO7MzO7BVlZ0HpzM7szF5RdhaUzuzMzuwVZWdB6cwAEBH/An+e+NO+xjP718Pyn/YFnNkrxu657d9fAfxs+/vp9lr9V/2QiPTuPvx/fG1n9q+RnSGlMwPA3Z+f/gAvtZev3fb6NQAReV5EfkJE/pGIvAT8oojMG5r67tvfU0TeLyL/8Lb/70Xk74rIkyKyEpGPiMgP/ond5Jn9mbAzpHRmX4z9x8B/BXwl/+986KeBR4B/G/gM8Hbgp0RkcPf/6Y/9Ks/sz6SdBaUz+2LsX7r7353+R0TmX+gHROR1wF8BXu3uj7eXHxeRNwL/AXAWlM4MOAtKZ/bF2W99ET/ztvb3h0Xk9tczcPJHvqIz+/+NnQWlM/ti7A8HEWt/yx96vbvt3wo4EZzGz/PzZ3ZmZ0HpzP7o5u6DiBwAV6fXRGQHeBT4QHvpd4igda+7/8Kf/FWe2Z8VOwtKZ/bHZb8A/Psi8hvAEvgJbuvuuvtHReSfAP+jiPwI8JvAPvDlwHl3/2/+FK75zF6BdiYJOLM/Lvth4NNEcPoXwHuBD/2h7/kB4H8A3gN8HPh54K8Cj/2JXeWZveJNzk6ePLMzO7NXkp0hpTM7szN7RdlZUDqzMzuzV5SdBaUzO7Mze0XZWVA6szM7s1eUnQWlMzuzM3tF2cvqlHZl4TWvmUtGspOSAIr0wMaopmw8IVpQT3TiCEYSxV2wDNkLJhkRwd1xd4SEophVqoKYkxSoStUBtYyqYgLqBek6cEVsxBAMJ7lgotSxoDlhZqQsqBmGbD9vMqtNbqweP0/G1VBPOJWcla5LRJw23IWUEmjCyggqeDVUQVUZx8p81qFZweOzzCyu2wzNiW6xQz/fY7aYM5vNOH/hIjkpVRJ7+/u4CmKCacHGQi2ODRu6nTlSjOLG+mRJSh1936NdjntqYxrDuAYTVEFMWB6fsHPuPK4FzKnVKaUwrJaIJHZ2dig4fdexXq9RN1wTqoqXynoo0HWwWjKOFc1tDb2yHEZsLBwcHJC6BWk2Z33zGn7jeXKfePZGwcRwHelzT9aKiEBSdDSqOKua2nOp7KSMM6IkkihFFaUgIiiJUh3R8JfkjingHcUrGUc81r1qgSrknCE57pXOOkyMKpVworgPMaW64e6oKklBHMwF0YohYPG1MhqdQEHQBGAIXXifwKxv+VwSIUhXVBVQilQ6lKGMzGcZqwJemc1muApWKi5sPwsgLxZ0/ZxutqDvey7feQVyj1lh0c/QxYKcM2aGlYEyGm4FT5lM+HkdRkopdN0MV2c+32G9XtMtdijrE8qmIIuMFGNYDYgIi/09qoXAvqxGxnFDKYWce/q+J/UddRgZ6oC6xvNUpWwGqmTMCmW9QklIgjJUqhvDsOb46AA8412HzuesPvEhLMX+eeza8R9W/2/tZZGSzwtd7ti4sCnOcgPr0VgdOcvR2Vih1DVmzsYGhuKMLpwUZ4mhrphkMMdKRU1RydRacQqkSp4GDFyxVBHLFI15BHfHU8JKRWxk4y0giWI41IJnxashItjomGpsJPPYANVwExCjaHyOtFjlBUSd5hckV6oXigiWBBfCCVow7rIiCkkys/40nkewFVQEASQLbpUyVKwMqDh9PwM3nAgi63HgeH2M5QyDsKkgObG7u0uSTMUBJQF9nxmGNSerYzxlFvM5pJ4yjHQuZI+NLSJsVis6ncfGq4JaJmtHXR/z/FOfpjis12uqGZhg7kjbTOuTA4aTFWM1HEVniapGl3ehglejLNcMyxWp69Cy3s6HeC5odlwyYzFWg7AahPXaWFZYj7GWtY6MxVgXY1UTqwLHY6FUwy0hHkFdMGpxcKXU8F9xJ7tQWtIpqaKekAylFNwEkUSR2BhuiZTS9nmrRaDU5iOG45oQwF3CT4Dq8cyLKq5xHYbgVNAIcmLxpi5QVLcJrEpsOrdEl3rEjZQiaLo4VGuf53SioAnHKJs1wzAwlg3agk+tlVQNT5my3rAaNyjGUJSchH62oJ9NiTQ+N+cMGCfrI1bjQLfYQZKSTBFzeulIJBTh+OQALNGnBVIU9RlZOxLw4jOf5uD4gHGorMu4PU1LRBBxzODGtRfxCnWM/ec6xzOIWwSpAsdHNxmHyp5nzEfEnOELyJBeVqc067OrxwKaEA89gZrgfwiJOJWUM9qyeCmFLLFIxQtUY9b3zFUoRKAQF6o66qAI1ePmKk5qY1RZFDJI9W2QEAkHcipObMbJ8WqtqCraCVKd6gaueAYtMKjToTgVIZFTBImUpoyXUY0MmZvz1lpRhNwnpF1jBKEIfJITXo1a42GZGeO4od9dcOH8Jfb29pidO8/uufN4GXnp+WdIB9eZqXM8P8+lVz1M3/fklOj6ni71bMY1w/KEW7duMVz7LDvrm4CxkQ45d5l8/jJpNme+WJBzj9WRkxc+y+qpTzGcv8KFex5gZ+8CYCyPDykvPIUdXudkVVhJ4c43vZ1OlINbL7G7c57l+oTxxedYHr3E7M57yfsXqJuCpcrq2kucHN1kfv4KHFxHzVglxUenp5KS8uTBCgBzJzm4nm7yrWkgEdEIhNKcvNaKmTHLHdplymaNamaRFFHHXFAUd8OIn1cLJN0+NAJF84NINEqxeG4QSBkJXzA8vi9FQtOcUDfEnCKB5GmBA8A1nvnkLyIN2UsgPzTQnYgwWEXGisyFzgUkrgsgqWAikVRroVQFHxlGx6xw8c7L7Cz22L14kb0LF7brcvj0E+zUY7wI3PMQ+5ev0HUdSZXcLTAbcDOObt3k4Prz6MFNFnZCIbHJc/TiPVy8dBdFjXnu0NyzOrzJ8VOfYCxGPXeFi3deZWdnhzqOHDz/NPObzzEcD1yzDXrxMnc/+AaWx4d4NVKXOT64SX3xGY6HgSuPfAXr1U2SwrqMHDz9FHmxj+SOtDrAaqGmBayOyKqszXn61vLzIqWXDUp9l1w9ELAQJRJiJBL1toMIVfJ2kwOkHA6pKaBxr1FGuCZ2Zx2UEec2hJME9QalcRDDULJEUMgCnoTkgV68GmUKZLVu4XmRCDjK6WebR2lZ1eirUjR+zs0Q1YDm6mQVVHMEGhJFlTIMjNVhNNI8sdPPQKNEyKKoCifLkflOR+4SRy8tGYYBb87Y7864fOcVFnu7XLh0hXPnzjEOA/WJj7OXYoON6xUvWc/i4S9j99yC8XiNiHCyvMX1zzzOostclCUjwrqOyOAMQ+UWlcuPvJmDp59htMrdr341drLkpWc+xeZ44Ggs3Pnga7l019288NRnWN94gWqGO3jK0M9ZjQN+csxs9xzrUujqgDiUfoaIUldHUBPdPNNZ4nhYs7u3oKuFmjpSrWQpJO144mAT5RYwRYsoNSMQiDouhpeA/yKRhLImRKMkFVe6JNQ6UjzKpHPdDCcQUfXAZQmhKHQuUaLV8KHs8bmaItiZ2TbZWY3XXG5Lsg5ZhCxQxakFcoqva/N3b4FTAERIOYLSaSKLexmrUcfCYI5qYm8e6BNNZDFUhbHEnum6js2msl6uGMZIoqLOxSuX2Tu3z/mLl7lwxyUkJ248+xT3rQ8pmzXMEscHa06uvoZ7HniIzbhmM6xQ73jhsT9gc3yTO7rCrOs5Wa9apaBcPzlk9+ojpNTx/LXPcscdlzl/6Q6OHv8
|
||
|
"text/plain": [
|
||
|
"<Figure size 360x360 with 1 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {
|
||
|
"needs_background": "light"
|
||
|
},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tst.show()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"All in one transform"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"class SiamesePair(Transform):\n",
|
||
|
" def __init__(self,items,labels):\n",
|
||
|
" self.items,self.labels,self.assoc = items,labels,self\n",
|
||
|
" sortlbl = sorted(enumerate(labels), key=itemgetter(1))\n",
|
||
|
" # dict of (each unique label) -- (list of indices with that label)\n",
|
||
|
" self.clsmap = {k:L(v).itemgot(0) for k,v in itertools.groupby(sortlbl, key=itemgetter(1))}\n",
|
||
|
" self.idxs = range_of(self.items)\n",
|
||
|
" \n",
|
||
|
" def encodes(self,i):\n",
|
||
|
" \"x: tuple of `i`th image and a random image from same or different class; y: True if same class\"\n",
|
||
|
" othercls = self.clsmap[self.labels[i]] if random.random()>0.5 else self.idxs\n",
|
||
|
" otherit = random.choice(othercls)\n",
|
||
|
" return SiameseImage(self.items[i], self.items[otherit], self.labels[otherit]==self.labels[i])"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": []
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"jupytext": {
|
||
|
"split_at_heading": true
|
||
|
},
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 3",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 2
|
||
|
}
|