updates

2025-04-05 18:30:44 +00:00 · 2020-03-05 21:11:18 -08:00 · 2020-03-05 21:11:18 -08:00 · e87f1d54e7
commit e87f1d54e7
parent 7de2389531
6 changed files with 1243 additions and 1004 deletions
--- a/01_intro.ipynb
+++ b/01_intro.ipynb
@ -2894,6 +2894,31 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": false,
+   "sideBar": true,
+   "skip_h1_title": true,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
  }
 },
 "nbformat": 4,
--- a/03_ethics.ipynb
+++ b/03_ethics.ipynb
@ -25,7 +25,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This chapter was co-authored by Dr Rachel Thomas, the co-founder of fast.ai, and founding director of the Center for Applied Data Ethics at the University of San Francisco. It largely follows a subset of the syllabus she developed for the \"Introduction to Data Ethics\" course."
+    "This chapter was co-authored by Dr Rachel Thomas, the co-founder of fast.ai, and founding director of the Center for Applied Data Ethics at the University of San Francisco. It largely follows a subset of the syllabus she developed for the [Introduction to Data Ethics](https://ethics.fast.ai) course."
   ]
  },
  {
@ -204,14 +204,9 @@
    "\n",
    "Okay, so hopefully we have convinced you that you ought to care. But what should you do? As data scientists, we're naturally inclined to focus on making our model better at optimizing some metric. But optimizing that metric may not actually lead to better outcomes. And even if optimizing that metric *does* help create better outcomes, it almost certainly won't be the only thing that matters. Consider the pipeline of steps that occurs between the development of a model or an algorithm by a researcher or practitioner, and the point at which this work is actually used to make some decision. This entire pipeline needs to be considered *as a whole* if we're to have a hope of getting the kinds of outcomes we want.\n",
    "\n",
-    "Normally there is a very long chain from one end to the other. This is especially true if you are a researcher where you don't even know if your research will ever get used for anything, or if you're involved in data collection, which is even earlier in the pipeline. But no-one is better placed to "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Data often ends up being used for different purposes than why it was originally collected.  IBM began selling to Nazi Germany well before the Holocaust, including helping with Germany’s 1933 census conducted by Adolf Hitler, which was effective at identifying far more Jewish people than had previously been recognized in Germany. US census data was used to round up Japanese-Americans (who were US citizens) for internment during World War II. It is important to recognize how data and images collected can be weaponized later. Columbia professor [Tim Wu wrote](https://www.nytimes.com/2019/04/10/opinion/sunday/privacy-capitalism.html) that “You must assume that any personal data that Facebook or Android keeps are data that governments around the world will try to get or that thieves will try to steal.”"
+    "Normally there is a very long chain from one end to the other. This is especially true if you are a researcher where you don't even know if your research will ever get used for anything, or if you're involved in data collection, which is even earlier in the pipeline. But no-one is better placed to inform everyone involved in this chain about the capabilities, constraints, and details of your work than you are. Although there's no \"silver bullet\" that can ensure your work is used the right way, by getting involved in the process, and asking the right questions, you can at the very least ensured that the right issues are being considered.\n",
+    "\n",
+    "Sometimes, the right response to being asked to do a piece of work is to just say \"no\". Often, however, the response we hear is \"if I don’t do it, someone else will\". But consider this: if you’ve been picked for the job, you’re the best person they’ve found; so if you don’t do it, the best person isn’t working on that project. If the first 5 they ask all say no too, then even better!"
   ]
  },
  {
@ -391,7 +386,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img src=\"images/ethics/image5.png\" id=\"bias\" caption=\"Bias in machine learning can come from multiple sources\" alt=\"A diagram showing all sources where bias can appear in machine learning\" width=\"650\">"
+    "<img src=\"images/ethics/pipeline_diagram.svg\" id=\"bias\" caption=\"Bias in machine learning can come from multiple sources\" alt=\"A diagram showing all sources where bias can appear in machine learning\" width=\"700\">"
   ]
  },
  {
@ -543,7 +538,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In the paper [Does Machine Learning Automate Moral Hazard and Error](https://scholar.harvard.edu/files/sendhil/files/aer.p20171084.pdf) in *American Economic Review*, the authors look at a model that tries to answer the question: using historical EHR data, what factors are most predictive of stroke? These are the top predictors from the model:\n",
+    "In the paper [Does Machine Learning Automate Moral Hazard and Error](https://scholar.harvard.edu/files/sendhil/files/aer.p20171084.pdf) in *American Economic Review*, the authors look at a model that tries to answer the question: using historical electronic health record (EHR) data, what factors are most predictive of stroke? These are the top predictors from the model:\n",
    "\n",
    "  - Prior Stroke\n",
    "  - Cardiovascular disease\n",
@ -703,13 +698,8 @@
    "- analyze a project you are working on\n",
    "- implement processes at your company to find and address ethical risks\n",
    "- support good policy\n",
-    "- increase diversity"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "- increase diversity\n",
+    "\n",
    "Let's walk through each step next, staring with analyzing a project you are working on."
   ]
  },
@ -732,14 +722,11 @@
    "  - What are error rates for different sub-groups?\n",
    "  - What is the accuracy of a simple rule-based alternative?\n",
    "  - What processes are in place to handle appeals or mistakes?\n",
-    "  - How diverse is the team that built it?"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "These questions may be able to help you identify outstanding issues, and possible alternatives that are easier to understand and control. In addition to asking the right questions, it's also important to consider practices and processes to implement."
+    "  - How diverse is the team that built it?\n",
+    "\n",
+    "These questions may be able to help you identify outstanding issues, and possible alternatives that are easier to understand and control. In addition to asking the right questions, it's also important to consider practices and processes to implement.\n",
+    "\n",
+    "One thing to consider at this stage is what data you are collecting and storing. Data often ends up being used for different purposes than why it was originally collected.  For instance, IBM began selling to Nazi Germany well before the Holocaust, including helping with Germany’s 1933 census conducted by Adolf Hitler, which was effective at identifying far more Jewish people than had previously been recognized in Germany. US census data was used to round up Japanese-Americans (who were US citizens) for internment during World War II. It is important to recognize how data and images collected can be weaponized later. Columbia professor [Tim Wu wrote](https://www.nytimes.com/2019/04/10/opinion/sunday/privacy-capitalism.html) that “You must assume that any personal data that Facebook or Android keeps are data that governments around the world will try to get or that thieves will try to steal.”"
   ]
  },
  {
@ -882,33 +869,71 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Role of Policy"
+    "## Role of Policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The ethical issues that arise in the use of automated decision systems, such as machine learning, can be complex and far-reaching. To better address them, we will need thoughtful policy, in addition to the ethical efforts of those in industry.  Neither is sufficient on its own.\n",
-    "\n",
-    "Policy is the appropriate tool for addressing:\n",
-    "\n",
-    "- Negative externalities\n",
-    "- Misaligned economic incentives\n",
-    "- “Race to the bottom” situations\n",
-    "- Enforcing accountability.\n",
-    "\n",
-    "Ethical behavior in industry is necessary as well, since:\n",
-    "\n",
-    "- Law will not always keep up\n",
-    "- Edge cases will arise in which practitioners must use their best judgement."
+    "We often talk to people who are eager for technical or design fixes to be full solution to the kinds of problems that we've been discussing; for instance, a technical approach to debias data, or design guidelines for making technology less addictive. While such measures can be useful, they will not be sufficient to address the underlying problems that have led to our current state. For example, as long as it is incredibly profitable to create addictive technology, companies will continue to do so, regardless of whether this has the side effect of promoting conspiracy theories and polluting our information ecosystem. While individual designers may try to tweak product designs, we will not see substantial changes until the underlying profit incentives changes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Conclusion"
+    "### The effectiveness of regulation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To look at what can cause companies to take concrete action, consider the following two examples of how Facebook has behaved. In 2018, a UN investigation found that Facebook had played a “determining role” in the ongoing genocide of the Rohingya, an ethnic minority in Mynamar that was described by UN Secretary-General Antonio Guterres as \"one of, if not the, most discriminated people in the world\". Local activists had been warning Facebook executives that their platform was being used to spread hate speech and incite violence since as early as 2013. In 2015, they were warned that Facebook could play the same role in Myanmar that the radio broadcasts played during the Rwandan genocide (where a million people were killed). Yet, by the end of 2015, Facebook only employed 4 contractors that spoke Burmese. As one person close to the matter said, \"That’s not 20/20 hindsight. The scale of this problem was significant and it was already apparent.\" Zuckerberg promised during the congressional hearings to hire \"dozens\" to address the genocide in Myanmar (in 2018, years after the genocide had begun, including the destruction by fire of at least 288 villages in northern Rakhine state after August 2017).\n",
+    "\n",
+    "This stands in stark contrast to Facebook quickly [hiring 1,200 people in Germany](http://thehill.com/policy/technology/361722-facebook-opens-second-german-office-to-comply-with-hate-speech-law) to try to avoid expensive penalties (of up to 50 million euros) under a new German law against hate speech. Clearly, in this case, Facebook was more reactive to the threat of a financial penalty than to the systematic destruction of an ethnic minority.\n",
+    "\n",
+    "In an [article on privacy issues](https://idlewords.com/2019/06/the_new_wilderness.htm), Maciej Ceglowski draws parallels with the environmental movement… \"This regulatory project has been so successful in the First World that we risk forgetting what life was like before it. Choking smog of the kind that today kills thousands in Jakarta and Delhi was [once emblematic of London](https://en.wikipedia.org/wiki/Pea_soup_fog). The Cuyahoga River in Ohio used to [reliably catch fire](http://www.ohiohistorycentral.org/w/Cuyahoga_River_Fire). In a particularly horrific example of unforeseen consequences, tetraethyl lead added to gasoline [raised violent crime rates](https://en.wikipedia.org/wiki/Lead%E2%80%93crime_hypothesis) worldwide for fifty years. None of these harms could have been fixed by telling people to vote with their wallet, or carefully review the environmental policies of every company they gave their business to, or to stop using the technologies in question. It took coordinated, and sometimes highly technical, regulation across jurisdictional boundaries to fix them. In some cases, like the [ban on commercial refrigerants](https://en.wikipedia.org/wiki/Montreal_Protocol) that depleted the ozone layer, that regulation required a worldwide consensus. We’re at the point where we need a similar shift in perspective in our privacy law.\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Rights and policy"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Clean air and clean drinking water are public goods which are nearly impossible to protect through individual market decisions, but rather require coordinated regulatory action. Similarly, many of the harms resulting from unintended consequences of misuses of technology involve public goods, such as a polluted information environment or deteriorated ambient privacy. Too often privacy is framed as an individual right, yet there are societal impacts to widespread surveillance (which would still be the case even if it was possible for a few individuals to opt out)\n",
+    "\n",
+    "Many of the issues we are seeing in tech are actually human rights issues, such as when a biased algorithm recommends that Black defendants to have longer prison sentences, when particular job ads are only shown to young people, or when police use facial recognition to identify protesters. The appropriate venue to address human rights issues is typically through the law.\n",
+    "\n",
+    "We need both regulatory and legal changes, *and* the ethical behavior of individuals. Individual behavior change can’t address misaligned profit incentives, externalities (where corporations reap large profits while off-loading their costs & harms to the broader society), or systemic failures. However, the law will never cover all edge cases, and it is important that individual software developers and data scientists are equipped to make ethical decisions in practice."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Cars: a historical precedent"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The problems we are facing are complex and there are no simple solutions. This can be discouraging, but we find hope in considering other large challenges that people have tackled throughout history. One example is the movement to increase car safety, covered as a case study in [Datasheets for Datasets](https://arxiv.org/abs/1803.09010) and in the design podcast [99% Invisible](https://99percentinvisible.org/episode/nut-behind-wheel/). Early cars had no seatbelts, metal knobs on the dashboard that could lodge in people’s skulls during a crash, regular plate glass windows that shattered in dangerous ways, and non-collapsible steering columns that impaled drivers. However, car companies were incredibly resistant to even discussing the idea of safety as something they could help address, and the widespread belief was that cars are just the way they are, and that it was the people using them who caused problems. It took consumer safety activists and advocates decades of work to even change the national conversation to consider that perhaps car companies had some responsibility which should be addressed through regulation. When the collapsible steering column was invented, it was not implemented for several years as there was no financial incentive to do so. Major car company General Motors hired private detectives to try to dig up dirt on consumer safety advocate Ralph Nader. The requirement of seatbelts, crash test dummies, and collapsible steering columns were major victories. It was only in 2011 that car companies were required to start using crash test dummies that would represent the average women, and not just average men’s bodies; prior to this, women were 40% more likely to be injured in a car crash of the same impact compared to a man. This is a vivid example of the ways that bias, policy, and technology have important consequences."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Conclusion"
   ]
  },
  {
@ -917,6 +942,8 @@
   "source": [
    "Coming from a background of working with binary logic, the lack of clear answers in ethics can be frustrating at first.  Yet, the implications of how our work impacts the world, including unintended consequences and the work becoming weaponization by bad actors, are some of the most important questions we can (and should!) consider.  Even though there aren't any easy answers, there are definite pitfalls to avoid and practices to move towards more ethical behavior.\n",
    "\n",
+    "Many people (including us!) are looking for more satisfying, solid answers of how to address harmful impacts of technology. However, given the complex, far-reaching, and interdisciplinary nature of the problems we are facing, there are no simple solutions. Julia Angwin, former senior reporter at ProPublica who focuses on issues of algorithmic bias and surveillance (and one of the 2016 investigators of the COMPAS recidivism algorithm that helped spark the field of Fairness Accountability and Transparency) said in [a 2019 interview](https://www.fastcompany.com/90337954/who-cares-about-liberty-julia-angwin-and-trevor-paglen-on-privacy-surveillance-and-the-mess-were-in), “I strongly believe that in order to solve a problem, you have to diagnose it, and that we’re still in the diagnosis phase of this. If you think about the turn of the century and industrialization, we had, I don’t know, 30 years of child labor, unlimited work hours, terrible working conditions, and it took a lot of journalist muckraking and advocacy to diagnose the problem and have some understanding of what it was, and then the activism to get laws changed. I feel like we’re in a second industrialization of data information... I see my role as trying to make as clear as possible what the downsides are, and diagnosing them really accurately so that they can be solvable. That’s hard work, and lots more people need to be doing it.” It's reassuring that Angwin thinks we are largely still in the diagnosis phase: if your understanding of these problems feels incomplete, that is normal and natural. Nobody has a “cure” yet, although it is vital that we continue working to better understand and address the problems we are facing.\n",
+    "\n",
    "One of our reviewers for this book, Fred Monroe, used to work in hedge fund trading. He told us, after reading this chapter, that many of the issues discussed here (distribution of data being dramatically different than what was trained on, impact of model and feedback loops once deployed and at scale, and so forth) were also key issues for building profitable trading models. The kinds of things you need to do to consider societal consequences are going to have a lot of overlap with things you need to do to consider organizational, market, and customer consequences too--so thinking carefully about ethics can also help you think carefully about how to make your data product successful more generally!"
   ]
  },
@ -1029,7 +1056,10 @@
  },
  "toc": {
   "base_numbering": 1,
-   "nav_menu": {},
+   "nav_menu": {
+    "height": "600px",
+    "width": "365px"
+   },
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": true,
--- a/04_mnist_basics.ipynb
+++ b/04_mnist_basics.ipynb
--- a/09_tabular.ipynb
+++ b/09_tabular.ipynb
@ -152,7 +152,7 @@
    "\n",
    "In addition, it is also valuable in its own right that embeddings are continuous. It is valuable because models are better at understanding continuous variables. This is unsurprising considering models are built of many continuous parameter weights and continuous activation values, which are updated via gradient descent, a learning algorithm for finding the minimums of continuous functions.\n",
    "\n",
-    "It is also valuable because we can combine our continuous embedding values with truly continuous input data in a straightforward manner: we just concatenate the variables, and feed the concatenation into our first dense layer. In other words, the raw categorical data is transformed by an embedding layer, before it interacts with the raw continuous input data. This is how fastai, and the entity embeddings paper, handle tabular models containing continuous and categorical variables.\n",
+    "Is is also valuable because we can combine our continuous embedding values with truly continuous input data in a straightforward manner: we just concatenate the variables, and feed the concatenation into our first dense layer. In other words, the raw categorical data is transformed by an embedding layer, before it interacts with the raw continuous input data. This is how fastai, and the entity embeddings paper, handle tabular models containing continuous and categorical variables.\n",
    "\n",
    "An example using this concatenation approach is how Google do their recommendations on Google Play, as they explained in their paper [Wide & Deep Learning for Recommender Systems](https://arxiv.org/abs/1606.07792), and as shown in this figure from their paper:"
   ]
@ -9751,7 +9751,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.2"
+   "version": "3.7.5"
  },
  "toc": {
   "base_numbering": 1,
--- a/12_nlp_dive.ipynb
+++ b/12_nlp_dive.ipynb
@ -2343,6 +2343,31 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": false,
+   "sideBar": true,
+   "skip_h1_title": true,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
  }
 },
 "nbformat": 4,
--- a/19_learner.ipynb
+++ b/19_learner.ipynb