Update 02_production.ipynb

Replaced Bing Image Search with DuckDuckGo Image Search, which is also free, recommended instead by Jeremy Howard in his recorded lecture, and which doesn't require any advanced setup or creation of an Azure trial to use.
This commit is contained in:
Dean Mauro 2023-07-02 22:00:47 -04:00 committed by GitHub
parent 823b69e00a
commit 9473e1d983
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -289,14 +289,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"At the time of writing, Bing Image Search is the best option we know of for finding and downloading images. It's free for up to 1,000 queries per month, and each query can download up to 150 images. However, something better might have come along between when we wrote this and when you're reading the book, so be sure to check out the [book's website](https://book.fast.ai/) for our current recommendation."
"At the time of writing, DuckDuckGo Image Search is the best option we know of for finding and downloading images. It's free and each query can download up to 200 images. However, something better might have come along between when we wrote this and when you're reading the book, so be sure to check out the [book's website](https://book.fast.ai/) for our current recommendation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> important: Keeping in Touch With the Latest Services: Services that can be used for creating datasets come and go all the time, and their features, interfaces, and pricing change regularly too. In this section, we'll show how to use the Bing Image Search API available at the time this book was written. We'll be providing more options and more up to date information on the [book's website](https://book.fast.ai/), so be sure to have a look there now to get the most current information on how to download images from the web to create a dataset for deep learning."
"> important: Keeping in Touch With the Latest Services: Services that can be used for creating datasets come and go all the time, and their features, interfaces, and pricing change regularly too. In this section, we'll show how to use the DuckDuckGo Image Search API available at the time this book was written. We'll be providing more options and more up to date information on the [book's website](https://book.fast.ai/), so be sure to have a look there now to get the most current information on how to download images from the web to create a dataset for deep learning."
]
},
{
@ -304,29 +304,7 @@
"metadata": {},
"source": [
"# clean\n",
"To download images with Bing Image Search, sign up at [Microsoft Azure](https://azure.microsoft.com/en-us/services/cognitive-services/bing-web-search-api/) for a free account. You will be given a key, which you can copy and enter in a cell as follows (replacing 'XXX' with your key and executing it):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"key = os.environ.get('AZURE_SEARCH_KEY', 'XXX')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or, if you're comfortable at the command line, you can set it in your terminal with:\n",
"\n",
" export AZURE_SEARCH_KEY=your_key_here\n",
"\n",
"and then restart Jupyter Notebook, and use the above line without editing it.\n",
"\n",
"Once you've set `key`, you can use `search_images_bing`. This function is provided by the small `utils` class included with the notebooks online. If you're not sure where a function is defined, you can just type it in your notebook to find out:"
"To download images with DuckDuckGo Image Search, you can use `search_images_ddg`. This function is provided by the small `utils` class included with the notebooks online. If you're not sure where a function is defined, you can just type it in your notebook to find out:"
]
},
{
@ -337,7 +315,7 @@
{
"data": {
"text/plain": [
"<function fastbook.search_images_bing(key, term, min_sz=128, max_images=150)>"
"<function fastbook.search_images_ddg(term, min_sz=128, max_images=150)>"
]
},
"execution_count": null,
@ -346,7 +324,7 @@
}
],
"source": [
"search_images_bing"
"search_images_ddg"
]
},
{
@ -366,8 +344,7 @@
}
],
"source": [
"results = search_images_bing(key, 'grizzly bear')\n",
"ims = results.attrgot('contentUrl')\n",
"ims = search_images_ddg('grizzly bear')\n",
"len(ims)"
]
},
@ -375,7 +352,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We've successfully downloaded the URLs of 150 grizzly bears (or, at least, images that Bing Image Search finds for that search term).\n",
"We've successfully downloaded the URLs of 200 grizzly bears (or, at least, images that DuckDuckGo Image Search finds for that search term).\n",
"\n",
"**NB**: there's no way to be sure exactly what images a search like this will find. The results can change over time. We've heard of at least one case of a community member who found some unpleasant pictures of dead bears in their search results. You'll receive whatever images are found by the web search engine. If you're running this at work, or with kids, etc, then be cautious before you display the downloaded images.\n",
"\n",
@ -485,8 +462,8 @@
"for o in bear_types:\n",
" dest = (path/o)\n",
" dest.mkdir(exist_ok=True)\n",
" results = search_images_bing(key, f'{o} bear')\n",
" download_images(dest, urls=results.attrgot('contentUrl'))"
" results = search_images_ddg(f'{o} bear')\n",
" download_images(dest, urls=results)"
]
},
{
@ -627,7 +604,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"One thing to be aware of in this process: as we discussed in <<chapter_intro>>, models can only reflect the data used to train them. And the world is full of biased data, which ends up reflected in, for example, Bing Image Search (which we used to create our dataset). For instance, let's say you were interested in creating an app that could help users figure out whether they had healthy skin, so you trained a model on the results of searches for (say) \"healthy skin.\" <<healthy_skin>> shows you the kinds of results you would get."
"One thing to be aware of in this process: as we discussed in <<chapter_intro>>, models can only reflect the data used to train them. And the world is full of biased data, which ends up reflected in, for example, Bing Image Search. For instance, let's say you were interested in creating an app that could help users figure out whether they had healthy skin, so you trained a model on the results of searches for (say) \"healthy skin.\" <<healthy_skin>> shows you the kinds of results you would get."
]
},
{
@ -1144,7 +1121,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This output shows that the image with the highest loss is one that has been predicted as \"grizzly\" with high confidence. However, it's labeled (based on our Bing image search) as \"black.\" We're not bear experts, but it sure looks to us like this label is incorrect! We should probably change its label to \"grizzly.\"\n",
"This output shows that the image with the highest loss is one that has been predicted as \"grizzly\" with high confidence. However, it's labeled (based on our DuckDuckGo image search) as \"black.\" We're not bear experts, but it sure looks to us like this label is incorrect! We should probably change its label to \"grizzly.\"\n",
"\n",
"The intuitive approach to doing data cleaning is to do it *before* you train a model. But as you've seen in this case, a model can actually help you find data issues more quickly and easily. So, we normally prefer to train a quick and simple model first, and then use it to help us with data cleaning.\n",
"\n",