From 8be580737ee0cc17746a5ed68283150d489b3dc4 Mon Sep 17 00:00:00 2001 From: Armin Berres <20811121+aberres@users.noreply.github.com> Date: Mon, 22 Feb 2021 23:06:26 +0100 Subject: [PATCH] Update 09_tabular to fastai v2.2.7 (#413) saleElaped is now detected as continuous variable right away. --- 09_tabular.ipynb | 36 +++++++++++++++--------------------- clean/09_tabular.ipynb | 14 ++------------ 2 files changed, 17 insertions(+), 33 deletions(-) diff --git a/09_tabular.ipynb b/09_tabular.ipynb index 34e4b26..eb6cbf0 100644 --- a/09_tabular.ipynb +++ b/09_tabular.ipynb @@ -9366,33 +9366,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this case, however, there's one variable that we absolutely do not want to treat as categorical: the `saleElapsed` variable. A categorical variable cannot, by definition, extrapolate outside the range of values that it has seen, but we want to be able to predict auction sale prices in the future. Therefore, we need to make this a continuous variable:" + "In this case, there's one variable that we absolutely do not want to treat as categorical: the `saleElapsed` variable. A categorical variable cannot, by definition, extrapolate outside the range of values that it has seen, but we want to be able to predict auction sale prices in the future. Let's verify that `cont_cat_split` did the correct thing." ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "['saleElapsed']" + ] + }, + "execution_count": 98, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "cont_nn.append('saleElapsed')\n", - "cat_nn.remove('saleElapsed')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Also, to use this as a continuous variable, we have to ensure it's of a numeric type:" - ] - }, - { - "cell_type": "code", - "execution_count": 106, - "metadata": {}, - "outputs": [], - "source": [ - "df_nn['saleElapsed'] = df_nn['saleElapsed'].astype(int)" + "cont_nn" ] }, { @@ -9975,7 +9969,7 @@ "1. What's a good type of plot for showing tree interpreter results?\n", "1. What is the \"extrapolation problem\"?\n", "1. How can you tell if your test or validation set is distributed in a different way than your training set?\n", - "1. Why do we make `saleElapsed` a continuous variable, even although it has less than 9,000 distinct values?\n", + "1. Why do we ensure `saleElapsed` is a continuous variable, even although it has less than 9,000 distinct values?\n", "1. What is \"boosting\"?\n", "1. How could we use embeddings with a random forest? Would we expect this to help?\n", "1. Why might we not always use a neural net for tabular modeling?" diff --git a/clean/09_tabular.ipynb b/clean/09_tabular.ipynb index 04b952a..78c60c3 100644 --- a/clean/09_tabular.ipynb +++ b/clean/09_tabular.ipynb @@ -1153,17 +1153,7 @@ "metadata": {}, "outputs": [], "source": [ - "cont_nn.append('saleElapsed')\n", - "cat_nn.remove('saleElapsed')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df_nn['saleElapsed'] = df_nn['saleElapsed'].astype(int)" + "cont_nn" ] }, { @@ -1375,7 +1365,7 @@ "1. What's a good type of plot for showing tree interpreter results?\n", "1. What is the \"extrapolation problem\"?\n", "1. How can you tell if your test or validation set is distributed in a different way than your training set?\n", - "1. Why do we make `saleElapsed` a continuous variable, even although it has less than 9,000 distinct values?\n", + "1. Why do we ensure `saleElapsed` is a continuous variable, even although it has less than 9,000 distinct values?\n", "1. What is \"boosting\"?\n", "1. How could we use embeddings with a random forest? Would we expect this to help?\n", "1. Why might we not always use a neural net for tabular modeling?"