diff --git a/09_tabular.ipynb b/09_tabular.ipynb index 34e4b26..eb6cbf0 100644 --- a/09_tabular.ipynb +++ b/09_tabular.ipynb @@ -9366,33 +9366,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this case, however, there's one variable that we absolutely do not want to treat as categorical: the `saleElapsed` variable. A categorical variable cannot, by definition, extrapolate outside the range of values that it has seen, but we want to be able to predict auction sale prices in the future. Therefore, we need to make this a continuous variable:" + "In this case, there's one variable that we absolutely do not want to treat as categorical: the `saleElapsed` variable. A categorical variable cannot, by definition, extrapolate outside the range of values that it has seen, but we want to be able to predict auction sale prices in the future. Let's verify that `cont_cat_split` did the correct thing." ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "['saleElapsed']" + ] + }, + "execution_count": 98, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "cont_nn.append('saleElapsed')\n", - "cat_nn.remove('saleElapsed')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Also, to use this as a continuous variable, we have to ensure it's of a numeric type:" - ] - }, - { - "cell_type": "code", - "execution_count": 106, - "metadata": {}, - "outputs": [], - "source": [ - "df_nn['saleElapsed'] = df_nn['saleElapsed'].astype(int)" + "cont_nn" ] }, { @@ -9975,7 +9969,7 @@ "1. What's a good type of plot for showing tree interpreter results?\n", "1. What is the \"extrapolation problem\"?\n", "1. How can you tell if your test or validation set is distributed in a different way than your training set?\n", - "1. Why do we make `saleElapsed` a continuous variable, even although it has less than 9,000 distinct values?\n", + "1. Why do we ensure `saleElapsed` is a continuous variable, even although it has less than 9,000 distinct values?\n", "1. What is \"boosting\"?\n", "1. How could we use embeddings with a random forest? Would we expect this to help?\n", "1. Why might we not always use a neural net for tabular modeling?" diff --git a/clean/09_tabular.ipynb b/clean/09_tabular.ipynb index 04b952a..78c60c3 100644 --- a/clean/09_tabular.ipynb +++ b/clean/09_tabular.ipynb @@ -1153,17 +1153,7 @@ "metadata": {}, "outputs": [], "source": [ - "cont_nn.append('saleElapsed')\n", - "cat_nn.remove('saleElapsed')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df_nn['saleElapsed'] = df_nn['saleElapsed'].astype(int)" + "cont_nn" ] }, { @@ -1375,7 +1365,7 @@ "1. What's a good type of plot for showing tree interpreter results?\n", "1. What is the \"extrapolation problem\"?\n", "1. How can you tell if your test or validation set is distributed in a different way than your training set?\n", - "1. Why do we make `saleElapsed` a continuous variable, even although it has less than 9,000 distinct values?\n", + "1. Why do we ensure `saleElapsed` is a continuous variable, even although it has less than 9,000 distinct values?\n", "1. What is \"boosting\"?\n", "1. How could we use embeddings with a random forest? Would we expect this to help?\n", "1. Why might we not always use a neural net for tabular modeling?"