merge

2025-04-09 20:30:42 +00:00 · 2020-03-16 11:39:02 -07:00 · 2020-03-16 11:39:02 -07:00 · 8efff15114
commit 8efff15114
parent 0bbc1ceea0
3 changed files with 79 additions and 54 deletions
--- a/02_production.ipynb
+++ b/02_production.ipynb
@ -107,7 +107,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's start by considering whether deep learning can be any good at the problem you are looking to work on. In general, here is a summary of the state of deep learning is at the start of 2020. However, things move very fast, and by the time you read this some of these constraints may no longer exist. We will try to keep the book website up-to-date; in addition, a Google search for \"what can AI do now\" there is likely to provide some up-to-date information."
+    "Let's start by considering whether deep learning can be any good at the problem you are looking to work on. In general, here is a summary of the state of deep learning at the start of 2020. However, things move very fast, and by the time you read this some of these constraints may no longer exist. We will try to keep the book website up-to-date; in addition, a Google search for \"what can AI do now\" there is likely to provide some up-to-date information."
   ]
  },
  {
@ -159,7 +159,7 @@
   "source": [
    "The ability of deep learning to combine text and images into a single model is, generally, far better than most people intuitively expect. For example, a deep learning model can be trained on input images, and output captions written in English, and can learn to generate surprisingly appropriate captions automatically for new images! But again, we have the same warning that we discussed in the previous section: there is no guarantee that these captions will actually be correct.\n",
    "\n",
-    "Because of this serious issue we generally recommend that deep learning be used not as an entirely automated process, but as part of a process in which the model and a human user interact closely. This can potentially make humans orders of magnitude more productive than they would be with entirely manual methods, and actually result in more accurate processes than using a human alone. For instance, an automatic system can be used to identify potential strokes directly from CT scans and send a high priority alert to have potential/scans looked at quickly. There is only a three-hour window to treat strokes, so this fast feedback loop could save lives. At the same time, however, all scans could continue to be sent to radiologists in the usual way, so there would be no reduction in human input. Other deep learning models could automatically measure items seen on the scan, and insert those measurements into reports, warning the radiologist about findings that they may have missed, and tell the radiologist about other cases which might be relevant."
+    "Because of this serious issue we generally recommend that deep learning be used not as an entirely automated process, but as part of a process in which the model and a human user interact closely. This can potentially make humans orders of magnitude more productive than they would be with entirely manual methods, and actually result in more accurate processes than using a human alone. For instance, an automatic system can be used to identify potential strokes directly from CT scans, and send a high priority alert to have those scans looked at quickly. There is only a three-hour window to treat strokes, so this fast feedback loop could save lives. At the same time, however, all scans could continue to be sent to radiologists in the usual way, so there would be no reduction in human input. Other deep learning models could automatically measure items seen on the scan, and insert those measurements into reports, warning the radiologist about findings that they may have missed, and tell the radiologist about other cases which might be relevant."
   ]
  },
  {
@ -855,7 +855,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Data augmentation refers to creating random variations of our input data, such that they appear different, but are not expected to change the meaning of the data. Examples of common data augmentation for images are rotation, flipping, perspective warping, brightness changes, contrast changes, and much more. For natural photo images such as the ones we are using here, there is a standard set of augmentations which we have found work pretty well, and are provided with the `aug_transforms` function. Because the images are now all the same size, we can apply these augmentations to an entire batch of them using the GPU, which will save a lot of time. To tell fastai we want to use these transforms to a batch, we use the `batch_tfms` parameter. (Note that's we're not using `RandomResizedCrop` in this example, so you can see the differences more clearly; we're also using double the amount of augmentation compared to the default, for the same reason)."
+    "Data augmentation refers to creating random variations of our input data, such that they appear different, but are not expected to change the meaning of the data. Examples of common data augmentation for images are rotation, flipping, perspective warping, brightness changes, contrast changes, and much more. For natural photo images such as the ones we are using here, there is a standard set of augmentations which we have found work pretty well, and are provided with the `aug_transforms` function. Because the images are now all the same size, we can apply these augmentations to an entire batch of them using the GPU, which will save a lot of time. To tell fastai we want to use these transforms to a batch, we use the `batch_tfms` parameter. (Note that we're not using `RandomResizedCrop` in this example, so you can see the differences more clearly; we're also using double the amount of augmentation compared to the default, for the same reason)."
   ]
  },
  {
@ -1172,7 +1172,7 @@
    "for idx,cat in cleaner.change(): shutil.move(cleaner.fns[idx], path/cat)\n",
    "```\n",
    "\n",
-    "> s: Cleaning the data or getting it ready for your model are two of the biggest challenges for data scientists, one they say takes 90% of their time. The fastai library aims at providing tools to make it as easy as possible.\n",
+    "> s: Cleaning the data or getting it ready for your model are two of the biggest challenges for data scientists; they say it takes 90% of their time. The fastai library aims at providing tools to make it as easy as possible.\n",
    "\n",
    "We'll be seeing more examples of model-driven data cleaning throughout this book. Once we've cleaned up our data, we can retrain our model. Try it yourself, and see if your accuracy improves!"
   ]
@ -1377,7 +1377,7 @@
    "\n",
    "*IPython widgets* are GUI components that bring together JavaScript and Python functionality in a web browser, and can be created and used within a Jupyter notebook. For instance, the image cleaner that we saw earlier in this chapter is entirely written with IPython widgets. However, we don't want to require users of our application to have to run Jupyter themselves.\n",
    "\n",
-    "That is why *Voilà* exists. It is a system for making applications consisting of IPython widgets available to end-users, without them having to use Jupyter at all. Voilà is taking advantage of the fact that a notebook _already is_ a kind of web application, just a rather complex one that depends on another web application Jupyter itself. Essentially, it helps us automatically convert the complex web application which we've already implicitly made (the notebook) into a simpler, easier-to-deploy web application, which functions like a normal web application rather than like a notebook.\n",
+    "That is why *Voilà* exists. It is a system for making applications consisting of IPython widgets available to end-users, without them having to use Jupyter at all. Voilà is taking advantage of the fact that a notebook _already is_ a kind of web application, just a rather complex one that depends on another web application, Jupyter itself. Essentially, it helps us automatically convert the complex web application which we've already implicitly made (the notebook) into a simpler, easier-to-deploy web application, which functions like a normal web application rather than like a notebook.\n",
    "\n",
    "But we still have the advantage of developing in a notebook. So with ipywidgets, we can build up our GUI step by step. We will use this approach to create a simple image classifier. First, we need a file upload widget:"
   ]
@ -1692,7 +1692,7 @@
    "- The complexities of dealing with GPU inference are significant. In particular, the GPU's memory will need careful manual management, and you'll need some careful queueing system to ensure you only do one batch at a time\n",
    "- There's a lot more market competition in CPU servers than GPU, as a result of which there's much cheaper options available for CPU servers.\n",
    "\n",
-    "Because of the complexity of GPU serving, many systems have sprung up to try to automate this. However, managing and running these systems is themselves complex, and generally requires compiling your model into a different form that's specialized for that system. It doesn't make sense to deal with this complexity until/unless your app gets popular enough that it makes clear financial sense for you to do so."
+    "Because of the complexity of GPU serving, many systems have sprung up to try to automate this. However, managing and running these systems is also complex, and generally requires compiling your model into a different form that's specialized for that system. It doesn't make sense to deal with this complexity until/unless your app gets popular enough that it makes clear financial sense for you to do so."
   ]
  },
  {
@ -1735,7 +1735,7 @@
    "\n",
    "There is quite a few upsides to this approach. The initial installation is easier, because you only have to deploy a small GUI application, which connects to the server to do all the heavy lifting. More importantly perhaps, upgrades of that core logic can happen on your server, rather than needing to be distributed to all of your users. Your server can have a lot more memory and processing capacity than most edge devices, and it is far easier to scale those resources if your model becomes more demanding. The hardware that you will have on a server is going to be more standard and more easily supported by fastai and PyTorch, so you don't have to compile your model into a different form.\n",
    "\n",
-    "There are downsides too, of course. Your application will require a network connection, and there will be some latency each time the model is called. It takes a while for a neural network model to run anyway, so this additional network latency may not make a big difference to your users in practice. In fact, since you can use better hardware on the server, the overall latency may even be less! If your application uses sensitive data then your users may be concerned about an approach which sends that data to a remote server, so sometimes privacy considerations will mean that you need to run the model on the edge device. Sometimes this can be avoided by having an *on premise* server, such as inside a company's firewall. Managing the complexity and scaling the server can create additional overhead, whereas if your model runs on the edge devices then each user is bringing their own compute resources, which leads to easier scaling with an increasing number of users (also known as _horizontal scaling_)."
+    "There are downsides too, of course. Your application will require a network connection, and there will be some latency each time the model is called. It takes a while for a neural network model to run anyway, so this additional network latency may not make a big difference to your users in practice. In fact, since you can use better hardware on the server, the overall latency may even be less! If your application uses sensitive data then your users may be concerned about an approach which sends that data to a remote server, so sometimes privacy considerations will mean that you need to run the model on the edge device. Sometimes this can be avoided by having an *on premise* server, such as inside a company's firewall. Managing the complexity and scaling the server can create additional overhead, whereas if your model runs on the edge devices then each user is bringing their own compute resources, which leads to easier scaling with an increasing number of users (also known as *horizontal scaling*)."
   ]
  },
  {
@ -1808,7 +1808,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> j: I started a company 20 years ago called _Optimal Decisions_ which used machine learning and optimisation to help giant insurance companies set their pricing, impacting tens of billions of dollars of risks. We used the approaches described above to manage the potential downsides of something that might go wrong. Also, before we worked with our clients to put anything in production, we tried to simulate the impact by testing the end to end system on their previous year's data. It was always quite a nerve-wracking process, putting these new algorithms in production, but every rollout was successful."
+    "> J: I started a company 20 years ago called _Optimal Decisions_ which used machine learning and optimisation to help giant insurance companies set their pricing, impacting tens of billions of dollars of risks. We used the approaches described above to manage the potential downsides of something that might go wrong. Also, before we worked with our clients to put anything in production, we tried to simulate the impact by testing the end to end system on their previous year's data. It was always quite a nerve-wracking process, putting these new algorithms in production, but every rollout was successful."
   ]
  },
  {
@ -1936,6 +1936,31 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.4"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": false,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
  }
 },
 "nbformat": 4,
--- a/03_ethics.ipynb
+++ b/03_ethics.ipynb
@ -50,7 +50,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> j: At university, philosophy of ethics was my main thing (it would have been the topic of my thesis, if I'd finished it, instead of dropping out to join the real-world). Based on the years I spent studying ethics, I can tell you this: no one really agrees on what right and wrong are, whether they exist, how to spot them, which people are good, and which bad, or pretty much anything else. So don't expect too much from the theory! We're going to focus on examples and thought starters here, not theory."
+    "> J: At university, philosophy of ethics was my main thing (it would have been the topic of my thesis, if I'd finished it, instead of dropping out to join the real-world). Based on the years I spent studying ethics, I can tell you this: no one really agrees on what right and wrong are, whether they exist, how to spot them, which people are good, and which bad, or pretty much anything else. So don't expect too much from the theory! We're going to focus on examples and thought starters here, not theory."
   ]
  },
  {
@ -62,7 +62,7 @@
    "- Well-founded standards of right and wrong that prescribe what humans ought to do, and\n",
    "- The study and development of one's ethical standards.\n",
    "\n",
-    "There is no list of right answers for ethics. There is no list of dos and don'ts. Ethics is complicated, and context-dependent. It involves the perspectives of many stakeholders. Ethics is a muscle that you have to develop and practice. In this chapter, our goal is to provide some signposts to help you on that journey.\n",
+    "There is no list of right answers for ethics. There is no list of do's and dont's. Ethics is complicated, and context-dependent. It involves the perspectives of many stakeholders. Ethics is a muscle that you have to develop and practice. In this chapter, our goal is to provide some signposts to help you on that journey.\n",
    "\n",
    "Spotting ethical issues is best to do as part of a collaborative team. This is the only way you can really incorporate different perspectives. Different people's backgrounds will help them to see things which may not be obvious to you. Working with a team is helpful for many \"muscle building\" activities, including this one.\n",
    "\n",
@ -84,7 +84,7 @@
    "\n",
    "1.  **Recourse processes**: Arkansas's buggy healthcare algorithms left patients stranded\n",
    "2.  **Feedback loops**: YouTube's recommendation system helped unleash a conspiracy theory boom\n",
-    "3.  **Bias**: When a traditionally African-American name is searched for on Google, it displays ads for criminal background checks.\n",
+    "3.  **Bias**: When a traditionally African-American name is searched for on Google, it displays ads for criminal background checks\n",
    "\n",
    "In fact, for every concept that we introduce in this chapter, we are going to provide at least one specific example. For each one, have a think about what you could have done in this situation, and think about what kinds of obstructions there might have been to you getting that done. How would you deal with them? What would you look out for?"
   ]
@ -242,7 +242,7 @@
    "\n",
    "The modern workplace is a very specialised place. Everybody tends to have very well-defined jobs to perform. Especially in large companies, it can be very hard to know what all the pieces of the puzzle are. Sometimes companies even intentionally obscure the overall project goals that are being worked on, if they know that their employees are not going to like the answers. This is sometimes done by compartmentalising pieces as much as possible\n",
    "\n",
-    "In other words, we're not saying that any of this is easy. It's hard. It's really hard. We all have to do our best. And we have often seen that the people who do get involved in the higher-level context of these projects, and attempt to develop cross disciplinary capabilities and teams, become some of the most important and well rewarded parts of their organisations. It's the kind of work that tends to be highly appreciated by senior executives, even if it is considered, sometimes, rather uncomfortable by middle management."
+    "In other words, we're not saying that any of this is easy. It's hard. It's really hard. We all have to do our best. And we have often seen that the people who do get involved in the higher-level context of these projects, and attempt to develop cross-disciplinary capabilities and teams, become some of the most important and well rewarded members of their organisations. It's the kind of work that tends to be highly appreciated by senior executives, even if it is considered, sometimes, rather uncomfortable by middle management."
   ]
  },
  {
@ -284,7 +284,7 @@
   "source": [
    "In a complex system, it is easy for no one person to feel responsible for outcomes. While this is understandable, it does not lead to good results. In the earlier example of the Arkansas healthcare system in which a bug led to people with cerebral palsy losing access to needed care, the creator of the algorithm blamed government officials, and government officials could blame those who implemented the software. NYU professor Danah Boyd described this phenomenon: \"bureaucracy has often been used to evade responsibility, and today's algorithmic systems are extending bureaucracy.\"\n",
    "\n",
-    "An additional reason why recourse is so necessary, is because data often contains errors. Mechanisms for audits and error-correction are crucial. A database of suspected gang members maintained by California law enforcement officials was found to be full of errors, including 42 babies who had been added to the database when they were less than 1 year old (28 of whom were marked as “admitting to being gang members”). In this case, there was no process in place for correcting mistakes or removing people once they’ve been added. Another example is the US credit report system; in a large-scale study of credit reports by the FTC (Federal Trade Commission) in 2012, it was found that 26% of consumers had at least one mistake in their files, and 5% had errors that could be devastating.  Yet, the process of getting such errors corrected is incredibly slow and opaque. When public-radio reporter Bobby Allyn discovered that he was erroneously listed as having a firearms conviction, it took him \"more than a dozen phone calls, the handiwork of a county court clerk and six weeks to solve the problem. And that was only after I contacted the company’s communications department as a journalist.\" (as covered in the article [How the careless errors of credit reporting agencies are ruining people’s lives](https://www.washingtonpost.com/posteverything/wp/2016/09/08/how-the-careless-errors-of-credit-reporting-agencies-are-ruining-peoples-lives/))\n",
+    "An additional reason why recourse is so necessary is because data often contains errors. Mechanisms for audits and error-correction are crucial. A database of suspected gang members maintained by California law enforcement officials was found to be full of errors, including 42 babies who had been added to the database when they were less than 1 year old (28 of whom were marked as “admitting to being gang members”). In this case, there was no process in place for correcting mistakes or removing people once they’d been added. Another example is the US credit report system: in a large-scale study of credit reports by the FTC (Federal Trade Commission) in 2012, it was found that 26% of consumers had at least one mistake in their files, and 5% had errors that could be devastating.  Yet, the process of getting such errors corrected is incredibly slow and opaque. When public-radio reporter Bobby Allyn discovered that he was erroneously listed as having a firearms conviction, it took him \"more than a dozen phone calls, the handiwork of a county court clerk and six weeks to solve the problem. And that was only after I contacted the company’s communications department as a journalist.\" (as covered in the article [How the careless errors of credit reporting agencies are ruining people’s lives](https://www.washingtonpost.com/posteverything/wp/2016/09/08/how-the-careless-errors-of-credit-reporting-agencies-are-ruining-peoples-lives/))\n",
    "\n",
    "As machine learning practitioners, we do not always think of it as our responsibility to understand how our algorithms end up being implemented in practice. But we need to."
   ]
@ -347,7 +347,7 @@
    "\n",
    "> : \"One important signal to classify the main topic of a video is the channel it comes from. For example, a video uploaded to a cooking channel is very likely to be a cooking video. But how do we know what topic a channel is about? Well… in part by looking at the topics of the videos it contains! Do you see the loop? For example, many videos have a description which indicates what camera was used to shoot the video. As a result, some of these videos might get classified as videos about “photography”. If a channel has such as misclassified video, it might be classified as a “photography” channel, making it even more likely for future videos on this channel to be wrongly classified as “photography”. This could even lead to runaway virus-like classifications! One way to break this feedback loop is to classify videos with and without the channel signal. Then when classifying the channels, you can only use the classes obtained without the channel signal. This way, the feedback loop is broken.\"\n",
    "\n",
-    "There are positive examples of people and organizations attempting to combat these problems. Evan Estola, lead machine learning engineer at Meetup, [discussed the example](https://www.youtube.com/watch?v=MqoRzNhrTnQ) of men expressing more interest than women in tech meetups. Meetup’s algorithm could recommend fewer tech meetups to women, and as a result, fewer women would find out about and attend tech meetups, which could cause the algorithm to suggest even fewer tech meetups to women, and so on in a self-reinforcing feedback loop. Evan and his team made the ethical decision for their recommendation algorithm to not create such a feedback loop, but explicitly not using gender for that part of their model. It is encouraging to see a company not just unthinkingly optimize a metric, but to consider their impact. \"You need to decide which feature not to use in your algorithm… the most optimal algorithm is perhaps not the best one to launch into production\", he said.\n",
+    "There are positive examples of people and organizations attempting to combat these problems. Evan Estola, lead machine learning engineer at Meetup, [discussed the example](https://www.youtube.com/watch?v=MqoRzNhrTnQ) of men expressing more interest than women in tech meetups. Meetup’s algorithm could recommend fewer tech meetups to women, and as a result, fewer women would find out about and attend tech meetups, which could cause the algorithm to suggest even fewer tech meetups to women, and so on in a self-reinforcing feedback loop. Evan and his team made the ethical decision for their recommendation algorithm to not create such a feedback loop, by explicitly not using gender for that part of their model. It is encouraging to see a company not just unthinkingly optimize a metric, but to consider their impact. \"You need to decide which feature not to use in your algorithm… the most optimal algorithm is perhaps not the best one to launch into production\", he said.\n",
    "\n",
    "While Meetup chose to avoid such an outcome, Facebook provides an example of allowing a runaway feedback loop to run wild. Facebook radicalizes users interested in one conspiracy theory by introducing them to more. As [Renee DiResta, a researcher on proliferation of disinformation, writes](https://www.fastcompany.com/3059742/social-network-algorithms-are-distorting-reality-by-boosting-conspiracy-theories):"
   ]
@ -414,7 +414,7 @@
    "  - When doctors were shown identical files, they were much less likely to recommend cardiac catheterization (a helpful procedure) to Black patients\n",
    "  - When bargaining for a used car, Black people were offered initial prices $700 higher and received far smaller concessions\n",
    "  - Responding to apartment-rental ads on Craigslist with a Black name elicited fewer responses than with a white name\n",
-    "  - An all-white jury was 16 percentage points more likely to convict a Black defendant than a white one, but when a jury had 1 Black member, it convicted both at same rate.\n",
+    "  - An all-white jury was 16 percentage points more likely to convict a Black defendant than a white one, but when a jury had 1 Black member, it convicted both at the same rate\n",
    "\n",
    "The COMPAS algorithm, widely used for sentencing and bail decisions in the US, is an example of an important algorithm which, when tested by ProPublica, showed clear racial bias in practice:"
   ]
@ -497,7 +497,7 @@
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {},  
   "source": [
    "<img src=\"images/ethics/image17.png\" id=\"object_detect\" caption=\"Object detection in action\" alt=\"Figure showing an object detection algorithm performing better on western products\" width=\"500\">"
   ]
@ -635,7 +635,7 @@
    "  - People are more likely to assume algorithms are objective or error-free (even if they’re given the option of a human override)\n",
    "  - Algorithms are more likely to be implemented with no appeals process in place\n",
    "  - Algorithms are often used at scale\n",
-    "  - Algorithmic systems are cheap.\n",
+    "  - Algorithmic systems are cheap\n",
    "\n",
    "Even in the absence of bias, algorithms (and deep learning especially, since it is such an effective and scalable algorithm) can lead to negative societal problems, such as when used for *disinformation*."
   ]
@ -714,7 +714,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "It's easy to miss important issues when considered ethical implications of your work. One thing that helps enormously is simply asking the right questions. Rachel Thomas recommends considering the following questions throughout the development of a data project:\n",
+    "It's easy to miss important issues when considering ethical implications of your work. One thing that helps enormously is simply asking the right questions. Rachel Thomas recommends considering the following questions throughout the development of a data project:\n",
    "\n",
    "  - Should we even be doing this?\n",
    "  - What bias is in the data?\n",
@ -775,7 +775,7 @@
    "  - Will the effects in aggregate likely create more good than harm, and what types of good and harm?\n",
    "  - Are we thinking about all relevant types of harm/benefit (psychological, political, environmental, moral, cognitive, emotional, institutional, cultural)?\n",
    "  - How might future generations be affected by this project?\n",
-    "  - Do the risks of harm from this project fall disproportionately on the least powerful in society? Will the benefits go disproportionately the well-off?\n",
+    "  - Do the risks of harm from this project fall disproportionately on the least powerful in society? Will the benefits go disproportionately to the well-off?\n",
    "  - Have we adequately considered ‘dual-use?\n",
    "\n",
    "The alternative lens to this is the *deontological* perspective, which focuses on basic *right* and *wrong*:\n",
@ -814,7 +814,7 @@
    "\n",
    "Receiving mentorship has been statistically shown to help men advance, but not women. The reason behind this is that when women receive mentorship, it’s advice on how they should change and gain more self-knowledge. When men receive mentorship, it’s public endorsement of their authority. Guess which is more useful in getting promoted?\n",
    "\n",
-    "As long as qualified women keep dropping out of tech, teaching more girls to code will not solve the diversity issues plaguing the field. Diversity initiatives often end up focusing primarily on white women, even although women of colour face many additional barriers. In interviews with 60 women of color who work in STEM research, 100% had experienced discrimination."
+    "As long as qualified women keep dropping out of tech, teaching more girls to code will not solve the diversity issues plaguing the field. Diversity initiatives often end up focusing primarily on white women, even though women of colour face many additional barriers. In interviews with 60 women of color who work in STEM research, 100% had experienced discrimination."
   ]
  },
  {
@ -843,7 +843,7 @@
    "\n",
    "FAccT is another lens that you may find useful in considering ethical issues. One useful resource for this is the free online book [Fairness and machine learning; Limitations and Opportunities](https://fairmlbook.org/), which \"gives a perspective on machine learning that treats fairness as a central concern rather than an afterthought.\" It also warns, however, that it \"is intentionally narrow in scope... A narrow framing of machine learning ethics might be tempting to technologists and businesses as a way to focus on technical interventions while sidestepping deeper questions about power and accountability. We caution against this temptation.\" Rather than provide an overview of the FAccT approach to ethics (which is better done in books such as the one linked above), our focus here will be on the limitations of this kind of narrow framing.\n",
    "\n",
-    "One great way to consider whether an ethical lens is complete, is to try to come up with an example where the lens and our own ethical intuitions give diverging results. Os Keyes explored this in a graphic way in their paper [A Mulching Proposal\n",
+    "One great way to consider whether an ethical lens is complete, is to try to come up with an example where the lens and our own ethical intuitions give diverging results. Os Keyes et al. explored this in a graphic way in their paper [A Mulching Proposal\n",
    "Analysing and Improving an Algorithmic System for Turning the Elderly into High-Nutrient Slurry](https://arxiv.org/abs/1908.06166). The paper's abstract says:"
   ]
  },
@ -860,7 +860,7 @@
   "source": [
    "In this paper, the rather controversial proposal (\"Turning the Elderly into High-Nutrient Slurry\") and the results (\"drastically increase the algorithm's adherence to the FAT framework, resulting in a more ethical and beneficent system\") are at odds... to say the least!\n",
    "\n",
-    "In philosophy, and especially philosophy of ethics, this is one of the most effective tools: first, come up with a process, definition, set of questions, etc, which is designed to resolve some problem. Then try to come up with an example where that apparent solution results in a proposal that no-one would consider acceptable. This can then lead to a further refinement of the solution.\n",
+    "In philosophy, and especially philosophy of ethics, this is one of the most effective tools: first, come up with a process, definition, set of questions, etc., which is designed to resolve some problem. Then try to come up with an example where that apparent solution results in a proposal that no-one would consider acceptable. This can then lead to a further refinement of the solution.\n",
    "\n",
    "So far, we've focused on things that you and your organization can do. But sometimes individual or organizational action is not enough. Sometimes, governments also need to consider policy implications."
   ]
@ -876,7 +876,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We often talk to people who are eager for technical or design fixes to be full solution to the kinds of problems that we've been discussing; for instance, a technical approach to debias data, or design guidelines for making technology less addictive. While such measures can be useful, they will not be sufficient to address the underlying problems that have led to our current state. For example, as long as it is incredibly profitable to create addictive technology, companies will continue to do so, regardless of whether this has the side effect of promoting conspiracy theories and polluting our information ecosystem. While individual designers may try to tweak product designs, we will not see substantial changes until the underlying profit incentives changes."
+    "We often talk to people who are eager for technical or design fixes to be a full solution to the kinds of problems that we've been discussing; for instance, a technical approach to debias data, or design guidelines for making technology less addictive. While such measures can be useful, they will not be sufficient to address the underlying problems that have led to our current state. For example, as long as it is incredibly profitable to create addictive technology, companies will continue to do so, regardless of whether this has the side effect of promoting conspiracy theories and polluting our information ecosystem. While individual designers may try to tweak product designs, we will not see substantial changes until the underlying profit incentives changes."
   ]
  },
  {
@ -908,9 +908,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Clean air and clean drinking water are public goods which are nearly impossible to protect through individual market decisions, but rather require coordinated regulatory action. Similarly, many of the harms resulting from unintended consequences of misuses of technology involve public goods, such as a polluted information environment or deteriorated ambient privacy. Too often privacy is framed as an individual right, yet there are societal impacts to widespread surveillance (which would still be the case even if it was possible for a few individuals to opt out)\n",
+    "Clean air and clean drinking water are public goods which are nearly impossible to protect through individual market decisions, but rather require coordinated regulatory action. Similarly, many of the harms resulting from unintended consequences of misuses of technology involve public goods, such as a polluted information environment or deteriorated ambient privacy. Too often privacy is framed as an individual right, yet there are societal impacts to widespread surveillance (which would still be the case even if it was possible for a few individuals to opt out).\n",
    "\n",
-    "Many of the issues we are seeing in tech are actually human rights issues, such as when a biased algorithm recommends that Black defendants to have longer prison sentences, when particular job ads are only shown to young people, or when police use facial recognition to identify protesters. The appropriate venue to address human rights issues is typically through the law.\n",
+    "Many of the issues we are seeing in tech are actually human rights issues, such as when a biased algorithm recommends that Black defendants have longer prison sentences, when particular job ads are only shown to young people, or when police use facial recognition to identify protesters. The appropriate venue to address human rights issues is typically through the law.\n",
    "\n",
    "We need both regulatory and legal changes, *and* the ethical behavior of individuals. Individual behavior change can’t address misaligned profit incentives, externalities (where corporations reap large profits while off-loading their costs & harms to the broader society), or systemic failures. However, the law will never cover all edge cases, and it is important that individual software developers and data scientists are equipped to make ethical decisions in practice."
   ]
@ -940,7 +940,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Coming from a background of working with binary logic, the lack of clear answers in ethics can be frustrating at first.  Yet, the implications of how our work impacts the world, including unintended consequences and the work becoming weaponization by bad actors, are some of the most important questions we can (and should!) consider.  Even though there aren't any easy answers, there are definite pitfalls to avoid and practices to move towards more ethical behavior.\n",
+    "Coming from a background of working with binary logic, the lack of clear answers in ethics can be frustrating at first.  Yet, the implications of how our work impacts the world, including unintended consequences and the work becoming weaponized by bad actors, are some of the most important questions we can (and should!) consider.  Even though there aren't any easy answers, there are definite pitfalls to avoid and practices to move towards more ethical behavior.\n",
    "\n",
    "Many people (including us!) are looking for more satisfying, solid answers of how to address harmful impacts of technology. However, given the complex, far-reaching, and interdisciplinary nature of the problems we are facing, there are no simple solutions. Julia Angwin, former senior reporter at ProPublica who focuses on issues of algorithmic bias and surveillance (and one of the 2016 investigators of the COMPAS recidivism algorithm that helped spark the field of Fairness Accountability and Transparency) said in [a 2019 interview](https://www.fastcompany.com/90337954/who-cares-about-liberty-julia-angwin-and-trevor-paglen-on-privacy-surveillance-and-the-mess-were-in), “I strongly believe that in order to solve a problem, you have to diagnose it, and that we’re still in the diagnosis phase of this. If you think about the turn of the century and industrialization, we had, I don’t know, 30 years of child labor, unlimited work hours, terrible working conditions, and it took a lot of journalist muckraking and advocacy to diagnose the problem and have some understanding of what it was, and then the activism to get laws changed. I feel like we’re in a second industrialization of data information... I see my role as trying to make as clear as possible what the downsides are, and diagnosing them really accurately so that they can be solvable. That’s hard work, and lots more people need to be doing it.” It's reassuring that Angwin thinks we are largely still in the diagnosis phase: if your understanding of these problems feels incomplete, that is normal and natural. Nobody has a “cure” yet, although it is vital that we continue working to better understand and address the problems we are facing.\n",
    "\n",
--- a/04_mnist_basics.ipynb
+++ b/04_mnist_basics.ipynb
@ -33,7 +33,7 @@
   "source": [
    "Having seen what it looks like to actually train a variety of models in chapter 2, let’s now look under the hood and see exactly what is going on. We’ll start with computer vision, and will use that to introduce fundamental tools and concepts of deep learning.\n",
    "\n",
-    "To be exact, we'll discuss the role of arrays and tensors, and of brodcasting, a powerful technique for using them expressively. We'll explain stochastic gradient descent (SGD), the mechanism for learning by updating weights automatically. We'll discuss the choice of loss function for our basic classification task, and the role of mini-batches. We'll also finally describe the math that a basic neural network is actually doing. Finally, we'll put all these pieces together to see them working together.\n",
+    "To be exact, we'll discuss the role of arrays and tensors, and of broadcasting, a powerful technique for using them expressively. We'll explain stochastic gradient descent (SGD), the mechanism for learning by updating weights automatically. We'll discuss the choice of a loss function for our basic classification task, and the role of mini-batches. We'll also describe the math that a basic neural network is actually doing. Finally, we'll put all these pieces together to see them at work.\n",
    "\n",
    "In future chapters we’ll do deep dives into other applications as well, and see how these concepts and tools generalize. But this chapter is about laying foundation stones. To be frank, that also makes this one of the harder chapters, because of how these concepts all depend on each other. Like an arch, all the stones need to be in place for the structure to stay up. Also like an arch, once that happens, it's a powerful structure that can support other things. But it requires some patience to assemble.\n",
    "\n",
@ -71,7 +71,7 @@
    "\n",
    "Geoff Hinton has told of how even academic papers showing dramatically better results than anything previously published would be rejected from top journals and conferences, just because they used a neural network. Yann Lecun's work on convolutional neural networks, which we will study in the next section, showed that these models could read hand-written text--something that had never been achieved before. However his breakthrough was ignored by most researchers, even as it was used commercially to read 10% of the checks in the US!\n",
    "\n",
-    "In addition to these three Turing Award winners, there are many other researchers who have battled to get us to where we are today. For instance, Jurgen Schmidhuber (who many believe should have shared in the Turing Award) pioneered many important ideas, including working on the *LSTM* architecture with his student Sepp Hochreiter (widely used for speech recognition and other text modeling tasks, and used in the IMDb example in <<chapter_intro>>). Perhaps most important of all, Paul Werbos in 1974 invented back-propagation for neural networks, the technique shown in this chapter and used universally for training neural networks ([Werbos 1994](https://books.google.com/books/about/The_Roots_of_Backpropagation.html?id=WdR3OOM2gBwC)). His development was almost entirely ignored for decades, but today it is the most important foundation of modern AI.\n",
+    "In addition to these three Turing Award winners, there are many other researchers who have battled to get us to where we are today. For instance, Jurgen Schmidhuber (who many believe should have shared in the Turing Award) pioneered many important ideas, including working on the *LSTM* architecture with his student Sepp Hochreiter (widely used for speech recognition and other text modeling tasks, and used in the IMDB example in <<chapter_intro>>). Perhaps most important of all, Paul Werbos in 1974 invented back-propagation for neural networks, the technique shown in this chapter and used universally for training neural networks ([Werbos 1994](https://books.google.com/books/about/The_Roots_of_Backpropagation.html?id=WdR3OOM2gBwC)). His development was almost entirely ignored for decades, but today it is the most important foundation of modern AI.\n",
    "\n",
    "There is a lesson here for all of us! On your deep learning journey you will face many obstacles, both technical, and (even more difficult) people around you who don't believe you'll be successful. There's one *guaranteed* way to fail, and that's to stop trying. We've seen that the only consistent trait amongst every fast.ai student that's gone on to be a world-class practitioner is that they are all very tenacious."
   ]
@ -113,7 +113,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can see what's in this directory by using `ls()`, a method added by fastai. This method returns an object of a special fastai class called `L`, which has all the same functionality of Python's builtin `list`, plus a lot more. One of its handy features is that, when printed, it displays the count of items, before listing the items themselves (if there's more than 10 items, it just shows the first few)."
+    "We can see what's in this directory by using `ls()`, a method added by fastai. This method returns an object of a special fastai class called `L`, which has all the same functionality of Python's built-in `list`, plus a lot more. One of its handy features is that, when printed, it displays the count of items, before listing the items themselves (if there's more than 10 items, it just shows the first few)."
   ]
  },
  {
@ -1462,7 +1462,7 @@
    "\n",
    "Some operations in PyTorch, such as taking a mean, require us to cast our integer types to float types. Since we'll be needing this later, we'll also cast our stacked tensor to `float` now. Casting in PyTorch is as simple as typing the name of the type you wish to cast to, and treating it as a method.\n",
    "\n",
-    "Generally when images are floats, the pixels are expected to be be zero and one, so we will also divide by 255 here."
+    "Generally when images are floats, the pixels are expected to be between zero and one, so we will also divide by 255 here."
   ]
  },
  {
@ -1726,7 +1726,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> s: Intuitively, the difference between L1 norm and mean squared error (_MSE_) is that the latter will penalize bigger mistakes more heavily than the former (and be more lenient with small mistakes)."
+    "> S: Intuitively, the difference between L1 norm and mean squared error (*MSE*) is that the latter will penalize bigger mistakes more heavily than the former (and be more lenient with small mistakes)."
   ]
  },
  {
@ -1760,14 +1760,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> j: When I first came across this \"L1\" thingie, I looked it up to see what on Earth it meant, found on Google that it is a _vector norm_ using _absolute value_, so looked up _vector norm_ and started reading: _Given a vector space V over a field F of the real or complex numbers, a norm on V is a nonnegative-valued any function p: V → \\[0,+∞) with the following properties: For all a ∈ F and all u, v ∈ V, p(u + v) ≤ p(u) + p(v)..._ Then I stopped reading. \"Ugh, I'll never understand math!\" I thought, for the thousandth time. Since then I've learned that every time these complex mathy bits of jargon come up in practice, it turns out I can replace them with a tiny bit of code! Like the _L1 loss_ is just equal to `(a-b).abs().mean()`, where `a` and `b` are tensors. I guess mathy folks just think differently to me... I'll make sure, in this book, every time some mathy jargon comes up, I'll give you the little bit of code it's equal to as well, and explain in common sense terms what's going on."
+    "> J: When I first came across this \"L1\" thingie, I looked it up to see what on Earth it meant, found on Google that it is a *vector norm* using *absolute value*, so looked up *vector norm* and started reading: *Given a vector space V over a field F of the real or complex numbers, a norm on V is a nonnegative-valued any function p: V → \\[0,+∞) with the following properties: For all a ∈ F and all u, v ∈ V, p(u + v) ≤ p(u) + p(v)...* Then I stopped reading. \"Ugh, I'll never understand math!\" I thought, for the thousandth time. Since then I've learned that every time these complex mathy bits of jargon come up in practice, it turns out I can replace them with a tiny bit of code! Like the _L1 loss_ is just equal to `(a-b).abs().mean()`, where `a` and `b` are tensors. I guess mathy folks just think differently to me... I'll make sure, in this book, every time some mathy jargon comes up, I'll give you the little bit of code it's equal to as well, and explain in common sense terms what's going on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In the above code we completed various mathematical operations on *PyTorch tensors*. If you've done some numeric programming in Pytorch before, you may recognize these as being similar to *Numpy arrays*. Let's have a look at those two very important classes."
+    "In the above code we completed various mathematical operations on *PyTorch tensors*. If you've done some numeric programming in PyTorch before, you may recognize these as being similar to *Numpy arrays*. Let's have a look at those two very important classes."
   ]
  },
  {
@ -1796,11 +1796,11 @@
    "\n",
    "In fact, **arrays and tensors can finish computations many thousands of times faster than using pure Python.**\n",
    "\n",
-    "A PyTorch tensor is nearly the same thing as a numpy array, but with an additional restriction which unlocks some additional capabilities. It's the same in that it, too, is a multidimensional table of data, with all items of the same type. However, the restriction is that a tensor cannot use just any old type — it has to use a single basic numeric type for all componentss. As a result, a tensor is not as flexible as a genuine array of arrays, which allows jagged arrays, where the inner arrays could have different sizes. So a PyTorch tensor cannot be jagged. It is always a regularly shaped multidimensional rectangular structure.\n",
+    "A PyTorch tensor is nearly the same thing as a numpy array, but with an additional restriction which unlocks some additional capabilities. It's the same in that it, too, is a multidimensional table of data, with all items of the same type. However, the restriction is that a tensor cannot use just any old type — it has to use a single basic numeric type for all components. As a result, a tensor is not as flexible as a genuine array of arrays, which allows jagged arrays, where the inner arrays could have different sizes. So a PyTorch tensor cannot be jagged. It is always a regularly shaped multidimensional rectangular structure.\n",
    "\n",
    "The vast majority of methods and operators supported by numpy on these structures are also supported by PyTorch. But PyTorch tensors have additional capabilities. One major capability is that these structures can live on the GPU, in which case their computation will be optimised for the GPU, and can run much faster. In addition, PyTorch can automatically calculate derivatives of these operations, including combinations of operations. As you'll see, it would be impossible to do deep learning in practice without this capability.\n",
    "\n",
-    "> s: If you don't know what C is, do not worry as you won't need it at all. In a nutshell, it's a low-level  (low-level means more similar to the language that computers use internally) language that is very fast compared to Python. To take advantage of its speed while programming in Python, try to avoid as much as possible writing loops and replace them by commands that work directly on arrays or tensors.\n",
+    "> S: If you don't know what C is, do not worry as you won't need it at all. In a nutshell, it's a low-level  (low-level means more similar to the language that computers use internally) language that is very fast compared to Python. To take advantage of its speed while programming in Python, try to avoid as much as possible writing loops and replace them by commands that work directly on arrays or tensors.\n",
    "\n",
    "Perhaps the most important new coding skill for a Python programmer to learn is how to effectively use the array/tensor APIs. We will be showing lots more tricks later in this book, but here's a summary of the key things you need to know for now."
   ]
@ -1869,7 +1869,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "All the operations below are shown on tensors - the syntax and results for NumPy arrays is idential.\n",
+    "All the operations below are shown on tensors - the syntax and results for NumPy arrays is identical.\n",
    "\n",
    "You can select a row:"
   ]
@ -2339,7 +2339,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Stochastic Gradient descent (SGD)"
+    "## Stochastic Gradient Descent (SGD)"
   ]
  },
  {
@ -2348,11 +2348,11 @@
   "source": [
    "Do you remember the way that Arthur Samuel described machine learning, which we quoted in <<chapter_intro>>:\n",
    "\n",
-    "> : _Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programed would \"learn\" from its experience._\n",
+    "> : _Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would \"learn\" from its experience._\n",
    "\n",
    "As we discussed, this is the key to allowing us to have something which can get better and better — to learn. But our pixel similarity approach does not really do this. We do not have any kind of weight assignment, or any way of improving based on testing the effectiveness of a weight assignment. In other words, we can't really improve our pixel similarity approach by modifying a set of parameters (which will be the SGD part, as we will see). In order to take advantage of the power of deep learning, we will first have to represent our task in the way that Arthur Samuel described it.\n",
    "\n",
-    "Instead of trying to find the similarity between an image and a \"ideal image\" we could instead look at each individual pixel, and come up with a set of weights for each pixel, such that the highest weights are associated with those pixels most likely to be black for a particular category. For instance, pixels towards the bottom right are not very likely to be activated for a seven, so they should have a low weight for a seven, but are more likely to be activated for an eight, so they should have a high weight for an eight. This can be represented as a function for each possible category, for instance the probability of being the number eight:\n",
+    "Instead of trying to find the similarity between an image and an \"ideal image\" we could instead look at each individual pixel, and come up with a set of weights for each pixel, such that the highest weights are associated with those pixels most likely to be black for a particular category. For instance, pixels towards the bottom right are not very likely to be activated for a seven, so they should have a low weight for a seven, but are more likely to be activated for an eight, so they should have a high weight for an eight. This can be represented as a function for each possible category, for instance the probability of being the number eight:\n",
    "\n",
    "```\n",
    "def pr_eight(x,w) = (x*w).sum()\n",
@ -2498,12 +2498,12 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "These seven steps, illustrated in <<gradient_descent>> are the key to the training of all deep learning models， and we'll be using the seven terms in the above diagram throughout this book. That deep learning turns out to rely entirely on these steps is extremely surprising and counter-intuitive. It's amazing that this process can solve such complex problems. But, as you'll see, it really does!\n",
+    "These seven steps, illustrated in <<gradient_descent>> are the key to the training of all deep learning models. That deep learning turns out to rely entirely on these steps is extremely surprising and counter-intuitive. It's amazing that this process can solve such complex problems. But, as you'll see, it really does!\n",
    "\n",
    "There are many different ways to do each of these seven steps, and we will be learning about them throughout the rest of this book. These are the details which make a big difference for deep learning practitioners. But it turns out that the general approach to each one generally follows some basic principles:\n",
    "\n",
-    "- **Initialize**:: we initialise the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initialising them to the percentage of times that that pixel is activated for that category. But since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well\n",
-    "- **Loss**:: This is the thing Arthur Samuel refered to: \"*testing the effectiveness of any current weight assignment in terms of actual performance*\". We need some function that will return a number that is small if the performance of the model is good (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention)\n",
+    "- **Initialize**:: we initialize the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initialising them to the percentage of times that that pixel is activated for that category. But since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well\n",
+    "- **Loss**:: This is the thing Arthur Samuel referred to: \"*testing the effectiveness of any current weight assignment in terms of actual performance*\". We need some function that will return a number that is small if the performance of the model is good (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention)\n",
    "- **Step**:: A simple way to figure out whether a weight should be increased a bit, or decreased a bit, would be just to try it. Increase the weight by a small amount, and see if the loss goes up or down. Once you find the correct direction, you could then change that amount by a bit more, and a bit less, until you find an amount which works well. However, this is slow! As we will see, the magic of calculus allows us to directly figure out which direction, and roughly how much, to change each weight, without having to try all these small changes, by calculating *gradients*. This is just a performance optimisation, we would get exactly the same results by using the slower manual process as well\n",
    "- **Stop**:: We have already discussed how to choose how many epochs to train a model for. This is where that decision is applied. For our digit classifier, we would keep training until the accuracy of the model started getting worse, or we ran out of time."
   ]
@ -2662,7 +2662,7 @@
   "source": [
    "Notice the special method `requires_grad_`? That's the magical incantation we use to tell PyTorch that we want to calculate gradients with respect to that variable at that value. It is essentially tagging the variable, so PyTorch will remember to keep track of how to compute gradients of the other, direct calculations on it which you will ask for.\n",
    "\n",
-    "> a: This API might throw you if you're coming from math or physics. In those contexts the \"gradient\" of a function is just another function (i.e., its derivative), so you might expect gradient-related API to give you a new function. But in deep learning, \"gradients\" usually means the _value_ of a function's derivative at a particular argument value. PyTorch API also puts the focus on that argument, not the function you're actually computing gradients of. It may feel backwards at first but it's just a different perspective.\n",
+    "> A: This API might throw you if you're coming from math or physics. In those contexts the \"gradient\" of a function is just another function (i.e., its derivative), so you might expect gradient-related APIs to give you a new function. But in deep learning, \"gradients\" usually means the _value_ of a function's derivative at a particular argument value. PyTorch API also puts the focus on that argument, not the function you're actually computing gradients of. It may feel backwards at first but it's just a different perspective.\n",
    "\n",
    "Now we calculate our function with that value. Notice how PyTorch prints not just the value calculated, but also a note that it has a gradient function it'll be using to calculate our gradient when needed:"
   ]
@ -3381,7 +3381,7 @@
   "source": [
    "To summarize, at the beginning, the weights of our model can be random (training *from scratch*) or come from a pretrained model (*transfer learning*). In the first case, the output we will get from our inputs won't have anything to do with what we want, and even in the second case, it's very likely the pretrained model won't be very good at the specific task we are targeting. So the model will need to *learn* better weights.\n",
    "\n",
-    "To do this, we will compare the outputs the model gives us with our targets (we have labelled data, so we know what result the model should give) using a *loss function*, which returns a number that needs to be as low as possible. Our weights need to be improved. To do this, we take a few data items (such as images) that we feed to our model. After going through our model, we compare to the corresponding targets using our loss function. The score we get tells us how wrong our predictions were, and we will change the weights a little bit to make it slightly better.\n",
+    "To do this, we will compare the outputs the model gives us with our targets (we have labelled data, so we know what result the model should give) using a *loss function*, which returns a number that needs to be as low as possible. Our weights need to be improved. To do this, we take a few data items (such as images) that we feed to our model. After going through our model, we compare the corresponding targets using our loss function. The score we get tells us how wrong our predictions were, and we will change the weights a little bit to make it slightly better.\n",
    "\n",
    "To find how to change the weights to make the loss a bit better, we use calculus to calculate the *gradient*. (Actually, we let PyTorch do it for us!) Let's imagine you are lost in the mountains with your car parked at the lowest point. To find your way, you might wander in a random direction but that probably won't help much. Since you know your vehicle is at the lowest point, you would be better to go downhill. By always taking a step in the direction of the steepest downward slope, you should eventually arrive at your destination. We use the magnitude of the gradient (i.e., the steepness of the slope) to tell us how big a step to take; specifically, we multiply the gradient by a number we choose called the *learning rate* to decide on the step size."
   ]
@ -3405,7 +3405,7 @@
    "\n",
    "As a result, a very small change in the value of a weight will often not actually change the accuracy at all. This means it is not useful to use accuracy as a loss function. When we use accuracy as a loss function, most of the time our gradients will actually be zero, and the model will not be able to learn from that number. That is not much use at all!\n",
    "\n",
-    "> s: In mathematical terms, accuracy is a function that is constant almost everywhere (except at the threshold, 0.5) so its derivative is nil almost everywhere (and infinity at the threshold). This then gives gradients that are zero or infinite, so useless to do an update of gradient descent.\n",
+    "> S: In mathematical terms, accuracy is a function that is constant almost everywhere (except at the threshold, 0.5) so its derivative is nil almost everywhere (and infinity at the threshold). This then gives gradients that are zero or infinite, so, useless to do an update of gradient descent.\n",
    "\n",
    "Instead, we need a loss function which, when our weights result in slightly better predictions, gives us a slightly better loss. So what does a \"slightly better prediction\" look like, exactly? Well, in this case, it means that, if the correct answer is a 3, then the score is a little higher, or if the correct answer is a 7, then the score is a little lower.\n",
    "\n",
@ -3415,7 +3415,7 @@
    "\n",
    "The purpose of the loss function is to measure the difference between predicted values and the true values -- that is, the targets (aka, the labels). So let's make another argument `targets`, a vector (i.e., another rank-1 tensor), indexed over the images, with a value of 0 or 1 which tells whether that image actually is a 3.\n",
    "\n",
-    "So, for instance, suppose we had three images which we knew were a 3, a 7, and a 3. And suppose our model predicted with high confidence that the first was a 3, with slight confidence that the second was a 7, and with fair confidence (and incorrectly!) that the last was a 7. This would mean our loss function would take receive values as its inputs:"
+    "So, for instance, suppose we had three images which we knew were a 3, a 7, and a 3. And suppose our model predicted with high confidence that the first was a 3, with slight confidence that the second was a 7, and with fair confidence (and incorrectly!) that the last was a 7. This would mean our loss function would take values as its inputs:"
   ]
  },
  {
@ -3449,7 +3449,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We're using a new function, `torch.where(a,b,c)`. This the same as running the list comprehension `[b[i] if a[i] else c[i] for i in range(len(a))]`, except it works on tensors, at C/CUDA speed. In plain English, this function will measure how distant each prediction is from 1 if it should be 1, and how distant it is from from 0 if it should be 0, and then it will take the mean of all those distances.\n",
+    "We're using a new function, `torch.where(a,b,c)`. This is the same as running the list comprehension `[b[i] if a[i] else c[i] for i in range(len(a))]`, except it works on tensors, at C/CUDA speed. In plain English, this function will measure how distant each prediction is from 1 if it should be 1, and how distant it is from from 0 if it should be 0, and then it will take the mean of all those distances.\n",
    "\n",
    "> note: It's important to learn about PyTorch functions like this, because looping over tensors in Python performs at Python speed, not C/CUDA speed!\n",
    "\n",
@ -3617,7 +3617,7 @@
    "\n",
    "Having defined a loss function, now is a good moment to recapitulate why we did this. After all, we already had a *metric*, which was overall accuracy. So why did we define a *loss*?\n",
    "\n",
-    "The key difference is that the metric is to drive human understanding and the loss is to drive automated learning. To drive automted learning, the loss must be a function which has a meaningful derivative. It can't have big flat sections, and large jumps, but instead must be reasonably smooth. This is why we designed a loss function that would respond to small changes in confidence level. The requirements on loss sometimes it does not really reflect exactly what we are trying to achieve, but is something that is a compromise between our real goal, and a function that can be optimised using its gradient. The loss function is calculated for each item in our dataset, and then at the end of an epoch these are all averaged, and the overall mean is reported for the epoch.\n",
+    "The key difference is that the metric is to drive human understanding and the loss is to drive automated learning. To drive automated learning, the loss must be a function which has a meaningful derivative. It can't have big flat sections, and large jumps, but instead must be reasonably smooth. This is why we designed a loss function that would respond to small changes in confidence level. The requirements on loss sometimes do not really reflect exactly what we are trying to achieve, but are rather a compromise between our real goal, and a function that can be optimised using its gradient. The loss function is calculated for each item in our dataset, and then at the end of an epoch these are all averaged, and the overall mean is reported for the epoch.\n",
    "\n",
    "Metrics, on the other hand, are the numbers that we really care about. These are the things which are printed at the end of each epoch, and tell us how our model is really doing. It is important that we learn to focus on these metrics, rather than the loss, when judging the performance of a model."
   ]
@ -3639,7 +3639,7 @@
    "\n",
    "So instead we take a compromise between the two: we calculate the average loss for a few data items at a time. This is called a *mini-batch*. The number of data items in the mini batch is called the *batch size*. A larger batch size means that you will get a more accurate and stable estimate of your datasets gradient on the loss function, but it will take longer, and you will get less mini-batches per epoch. Choosing a good batch size is one of the decisions you need to make as a deep learning practitioner to train your model quickly and accurately. We will talk about how to make this choice throughout this book.\n",
    "\n",
-    "Another good reason for using mini-batches rather than calculating the gradient on individual data items is that, in practice, we nearly always do our training on an accelerator such as a GPU. These accelerators only perform well if they have lots of work to do at a time. So it is helpful if we can give them lots of data items to work on at a time. Using mini-batches is one of the best ways to do this. However, if you give them too much data to work on at once, they run out of memory--making GPUs happy is also tricky!.\n",
+    "Another good reason for using mini-batches rather than calculating the gradient on individual data items is that, in practice, we nearly always do our training on an accelerator such as a GPU. These accelerators only perform well if they have lots of work to do at a time. So it is helpful if we can give them lots of data items to work on at a time. Using mini-batches is one of the best ways to do this. However, if you give them too much data to work on at once, they run out of memory--making GPUs happy is also tricky!\n",
    "\n",
    "As we've seen, in the discussion of data augmentation, we get better generalisation if we can vary things during training. A simple and effective thing we can vary during training is what data items we put in each mini batch. Rather than simply enumerating our data set in order for every epoch, instead what we normally do in practice is to randomly shuffle it on every epoch, before we create mini batches. PyTorch and fastai provide a class that will do the shuffling and mini batch collation for you, called `DataLoader`.\n",
    "\n",
@ -3702,7 +3702,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "When we pass a Dataset to a DataLoader we will get back many batches which are themselves tuples of independent and dependent variable many batches:"
+    "When we pass a Dataset to a DataLoader we will get back many batches which are themselves tuples of independent and dependent variables:"
   ]
  },
  {
@ -3991,7 +3991,7 @@
   "source": [
    "Whilst we could use a python for loop to calculate the prediction for each image, that would be very slow. Because Python loops don't run on the GPU, and because Python is a slow language for loops in general, we need to represent as much of the computation in a model as possible using higher-level functions.\n",
    "\n",
-    "In this case, there's an extremely convenient mathematical operation that calculates `w*x` for every row of a matrix--it's called *matrix multiplication*. <<matmul>> show what matrix multiplication looks like (diagram from Wikipedia)."
+    "In this case, there's an extremely convenient mathematical operation that calculates `w*x` for every row of a matrix--it's called *matrix multiplication*. <<matmul>> shows what matrix multiplication looks like (diagram from Wikipedia)."
   ]
  },
  {
@ -4829,7 +4829,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> j: There is an enormous amount of jargon in deep learning, such as: _rectified linear unit_. The vast vast majority of this jargon is no more complicated than can be implemented in a short line of code and Python, as we saw in this example. The reality is that for academics to get their papers published they need to make them sound as impressive and sophisticated as possible. One of the ways that they do that is to introduce jargon. Unfortunately, this has the result that the field ends up becoming far more intimidating and difficult to get into than it should be. You do have to learn the jargon, because otherwise papers and tutorials are not going to mean much to you. But that doesn't mean you have to find the jargon intimidating. Just remember, when you come across a word or phrase that you haven't seen before, it will almost certainly turn out that it is a very simple concept that it is referring to."
+    "> J: There is an enormous amount of jargon in deep learning, such as: _rectified linear unit_. The vast vast majority of this jargon is no more complicated than can be implemented in a short line of code and Python, as we saw in this example. The reality is that for academics to get their papers published they need to make them sound as impressive and sophisticated as possible. One of the ways that they do that is to introduce jargon. Unfortunately, this has the result that the field ends up becoming far more intimidating and difficult to get into than it should be. You do have to learn the jargon, because otherwise papers and tutorials are not going to mean much to you. But that doesn't mean you have to find the jargon intimidating. Just remember, when you come across a word or phrase that you haven't seen before, it will almost certainly turn out that it is a very simple concept that it is referring to."
   ]
  },
  {
@ -4845,7 +4845,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> s: Mathematically, we say the composition of two linear functions is another linear function. So we can stack as many linear classifiers on top or each other, without non-linear functions between them, it will jsut be the same as one linear classifier."
+    "> S: Mathematically, we say the composition of two linear functions is another linear function. So we can stack as many linear classifiers on top of each other, without non-linear functions between them, it will just be the same as one linear classifier."
   ]
  },
  {
@ -5444,7 +5444,7 @@
    "1. What is \"ReLU\"? Draw a plot of it for values from `-2` to `+2`.\n",
    "1. What is an \"activation function\"?\n",
    "1. What's the difference between `F.relu` and `nn.ReLU`?\n",
-    "1. The universal approximation theorem shows that any function can be approximately as closely as needed using just one nonlinearity. So why do we normally use more?"
+    "1. The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?"
   ]
  },
  {