fastbook/clean/09_tabular.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "! [ -e /content ] && pip install -Uqq fastbook kaggle waterfallcharts treeinterpreter dtreeviz\n",
    "import fastbook\n",
    "fastbook.setup_book()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "from fastbook import *\n",
    "from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype\n",
    "from fastai.tabular.all import *\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "from dtreeviz.trees import *\n",
    "from IPython.display import Image, display_svg, SVG\n",
    "\n",
    "pd.options.display.max_rows = 20\n",
    "pd.options.display.max_columns = 8"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 3,
   "metadata": {},
   "source": [
    "# Tabular Modeling Deep Dive"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 5,
   "metadata": {},
   "source": [
    "## Categorical Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 21,
   "metadata": {},
   "source": [
    "## Beyond Deep Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 25,
   "metadata": {},
   "source": [
    "## The Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 27,
   "metadata": {},
   "source": [
    "### Kaggle Competitions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "creds = ''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "cred_path = Path('~/.kaggle/kaggle.json').expanduser()\n",
    "if not cred_path.exists():\n",
    "    cred_path.parent.mkdir(exist_ok=True)\n",
    "    cred_path.write_text(creds)\n",
    "    cred_path.chmod(0o600)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "comp = 'bluebook-for-bulldozers'\n",
    "path = URLs.path(comp)\n",
    "path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "Path.BASE_PATH = path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "from kaggle import api\n",
    "\n",
    "if not path.exists():\n",
    "    path.mkdir(parents=true)\n",
    "    api.competition_download_cli(comp, path=path)\n",
    "    shutil.unpack_archive(str(path/f'{comp}.zip'), str(path))\n",
    "\n",
    "path.ls(file_type='text')"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 38,
   "metadata": {},
   "source": [
    "### Look at the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['ProductSize'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['ProductSize'] = df['ProductSize'].astype('category')\n",
    "df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "dep_var = 'SalePrice'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[dep_var] = np.log(df[dep_var])"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 51,
   "metadata": {},
   "source": [
    "## Decision Trees"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 56,
   "metadata": {},
   "source": [
    "### Handling Dates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = add_datepart(df, 'saledate')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_test = pd.read_csv(path/'Test.csv', low_memory=False)\n",
    "df_test = add_datepart(df_test, 'saledate')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "' '.join(o for o in df.columns if o.startswith('sale'))"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 64,
   "metadata": {},
   "source": [
    "### Using TabularPandas and TabularProc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 66,
   "metadata": {},
   "outputs": [],
   "source": [
    "procs = [Categorify, FillMissing]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "cond = (df.saleYear<2011) | (df.saleMonth<10)\n",
    "train_idx = np.where( cond)[0]\n",
    "valid_idx = np.where(~cond)[0]\n",
    "\n",
    "splits = (list(train_idx),list(valid_idx))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "cont,cat = cont_cat_split(df, 1, dep_var=dep_var)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 71,
   "metadata": {},
   "outputs": [],
   "source": [
    "to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 73,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(to.train),len(to.valid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "to.show(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 76,
   "metadata": {},
   "outputs": [],
   "source": [
    "to1 = TabularPandas(df, procs, ['state', 'ProductGroup', 'Drive_System', 'Enclosure'], [], y_names=dep_var, splits=splits)\n",
    "to1.show(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 78,
   "metadata": {},
   "outputs": [],
   "source": [
    "to.items.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 79,
   "metadata": {},
   "outputs": [],
   "source": [
    "to1.items[['state', 'ProductGroup', 'Drive_System', 'Enclosure']].head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 81,
   "metadata": {},
   "outputs": [],
   "source": [
    "to.classes['ProductSize']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 83,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_pickle(path/'to.pkl',to)"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 86,
   "metadata": {},
   "source": [
    "### Creating the Decision Tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 88,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "to = load_pickle(path/'to.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 89,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs,y = to.train.xs,to.train.y\n",
    "valid_xs,valid_y = to.valid.xs,to.valid.y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 91,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = DecisionTreeRegressor(max_leaf_nodes=4)\n",
    "m.fit(xs, y);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 93,
   "metadata": {},
   "outputs": [],
   "source": [
    "draw_tree(m, xs, size=10, leaves_parallel=True, precision=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 96,
   "metadata": {},
   "outputs": [],
   "source": [
    "samp_idx = np.random.permutation(len(y))[:500]\n",
    "dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,\n",
    "        fontname='DejaVu Sans', scale=1.6, label_fontsize=10,\n",
    "        orientation='LR')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 98,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs.loc[xs['YearMade']<1900, 'YearMade'] = 1950\n",
    "valid_xs.loc[valid_xs['YearMade']<1900, 'YearMade'] = 1950"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 100,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = DecisionTreeRegressor(max_leaf_nodes=4).fit(xs, y)\n",
    "\n",
    "dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,\n",
    "        fontname='DejaVu Sans', scale=1.6, label_fontsize=10,\n",
    "        orientation='LR')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 102,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = DecisionTreeRegressor()\n",
    "m.fit(xs, y);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 104,
   "metadata": {},
   "outputs": [],
   "source": [
    "def r_mse(pred,y): return round(math.sqrt(((pred-y)**2).mean()), 6)\n",
    "def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 105,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_rmse(m, xs, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 107,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_rmse(m, valid_xs, valid_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 109,
   "metadata": {},
   "outputs": [],
   "source": [
    "m.get_n_leaves(), len(xs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 111,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = DecisionTreeRegressor(min_samples_leaf=25)\n",
    "m.fit(to.train.xs, to.train.y)\n",
    "m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 113,
   "metadata": {},
   "outputs": [],
   "source": [
    "m.get_n_leaves()"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 117,
   "metadata": {},
   "source": [
    "### Categorical Variables"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 121,
   "metadata": {},
   "source": [
    "## Random Forests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 123,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "# pip install —pre -f https://sklearn-nightly.scdn8.secure.raxcdn.com scikit-learn —U"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 124,
   "metadata": {},
   "source": [
    "### Creating a Random Forest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 126,
   "metadata": {},
   "outputs": [],
   "source": [
    "def rf(xs, y, n_estimators=40, max_samples=200_000,\n",
    "       max_features=0.5, min_samples_leaf=5, **kwargs):\n",
    "    return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,\n",
    "        max_samples=max_samples, max_features=max_features,\n",
    "        min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 127,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = rf(xs, y);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 129,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 133,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds = np.stack([t.predict(valid_xs) for t in m.estimators_])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 135,
   "metadata": {},
   "outputs": [],
   "source": [
    "r_mse(preds.mean(0), valid_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 137,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(40)]);"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 139,
   "metadata": {},
   "source": [
    "### Out-of-Bag Error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 141,
   "metadata": {},
   "outputs": [],
   "source": [
    "r_mse(m.oob_prediction_, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 144,
   "metadata": {},
   "source": [
    "## Model Interpretation"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 146,
   "metadata": {},
   "source": [
    "### Tree Variance for Prediction Confidence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 148,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds = np.stack([t.predict(valid_xs) for t in m.estimators_])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 149,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 151,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds_std = preds.std(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 153,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds_std[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 155,
   "metadata": {},
   "source": [
    "### Feature Importance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 157,
   "metadata": {},
   "outputs": [],
   "source": [
    "def rf_feat_importance(m, df):\n",
    "    return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}\n",
    "                       ).sort_values('imp', ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 159,
   "metadata": {},
   "outputs": [],
   "source": [
    "fi = rf_feat_importance(m, xs)\n",
    "fi[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 161,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_fi(fi):\n",
    "    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)\n",
    "\n",
    "plot_fi(fi[:30]);"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 163,
   "metadata": {},
   "source": [
    "### Removing Low-Importance Variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 165,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_keep = fi[fi.imp>0.005].cols\n",
    "len(to_keep)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 167,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs_imp = xs[to_keep]\n",
    "valid_xs_imp = valid_xs[to_keep]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 168,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = rf(xs_imp, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 170,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_rmse(m, xs_imp, y), m_rmse(m, valid_xs_imp, valid_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 172,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(xs.columns), len(xs_imp.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 174,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_fi(rf_feat_importance(m, xs_imp));"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 176,
   "metadata": {},
   "source": [
    "### Removing Redundant Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 178,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster_columns(xs_imp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 180,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_oob(df):\n",
    "    m = RandomForestRegressor(n_estimators=40, min_samples_leaf=15,\n",
    "        max_samples=50000, max_features=0.5, n_jobs=-1, oob_score=True)\n",
    "    m.fit(df, y)\n",
    "    return m.oob_score_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 182,
   "metadata": {},
   "outputs": [],
   "source": [
    "get_oob(xs_imp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 184,
   "metadata": {},
   "outputs": [],
   "source": [
    "{c:get_oob(xs_imp.drop(c, axis=1)) for c in (\n",
    "    'saleYear', 'saleElapsed', 'ProductGroupDesc','ProductGroup',\n",
    "    'fiModelDesc', 'fiBaseModel',\n",
    "    'Hydraulics_Flow','Grouser_Tracks', 'Coupler_System')}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 186,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_drop = ['saleYear', 'ProductGroupDesc', 'fiBaseModel', 'Grouser_Tracks']\n",
    "get_oob(xs_imp.drop(to_drop, axis=1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 188,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs_final = xs_imp.drop(to_drop, axis=1)\n",
    "valid_xs_final = valid_xs_imp.drop(to_drop, axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 189,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_pickle(path/'xs_final.pkl', xs_final)\n",
    "save_pickle(path/'valid_xs_final.pkl', valid_xs_final)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 191,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs_final = load_pickle(path/'xs_final.pkl')\n",
    "valid_xs_final = load_pickle(path/'valid_xs_final.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 193,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = rf(xs_final, y)\n",
    "m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 195,
   "metadata": {},
   "source": [
    "### Partial Dependence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 197,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = valid_xs_final['ProductSize'].value_counts(sort=False).plot.barh()\n",
    "c = to.classes['ProductSize']\n",
    "plt.yticks(range(len(c)), c);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 199,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = valid_xs_final['YearMade'].hist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 201,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.inspection import plot_partial_dependence\n",
    "\n",
    "fig,ax = plt.subplots(figsize=(12, 4))\n",
    "plot_partial_dependence(m, valid_xs_final, ['YearMade','ProductSize'],\n",
    "                        grid_resolution=20, ax=ax);"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 203,
   "metadata": {},
   "source": [
    "### Data Leakage"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 205,
   "metadata": {},
   "source": [
    "### Tree Interpreter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 206,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "import warnings\n",
    "warnings.simplefilter('ignore', FutureWarning)\n",
    "\n",
    "from treeinterpreter import treeinterpreter\n",
    "from waterfall_chart import plot as waterfall"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 209,
   "metadata": {},
   "outputs": [],
   "source": [
    "row = valid_xs_final.iloc[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 211,
   "metadata": {},
   "outputs": [],
   "source": [
    "prediction,bias,contributions = treeinterpreter.predict(m, row.values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 213,
   "metadata": {},
   "outputs": [],
   "source": [
    "prediction[0], bias[0], contributions[0].sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 215,
   "metadata": {},
   "outputs": [],
   "source": [
    "waterfall(valid_xs_final.columns, contributions[0], threshold=0.08, \n",
    "          rotation_value=45,formatting='{:,.3f}');"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 218,
   "metadata": {},
   "source": [
    "## Extrapolation and Neural Networks"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 220,
   "metadata": {},
   "source": [
    "### The Extrapolation Problem"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 221,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 223,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_lin = torch.linspace(0,20, steps=40)\n",
    "y_lin = x_lin + torch.randn_like(x_lin)\n",
    "plt.scatter(x_lin, y_lin);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 225,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs_lin = x_lin.unsqueeze(1)\n",
    "x_lin.shape,xs_lin.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 227,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_lin[:,None].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 229,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_lin = RandomForestRegressor().fit(xs_lin[:30],y_lin[:30])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 231,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(x_lin, y_lin, 20)\n",
    "plt.scatter(x_lin, m_lin.predict(xs_lin), color='red', alpha=0.5);"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 233,
   "metadata": {},
   "source": [
    "### Finding Out-of-Domain Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 235,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_dom = pd.concat([xs_final, valid_xs_final])\n",
    "is_valid = np.array([0]*len(xs_final) + [1]*len(valid_xs_final))\n",
    "\n",
    "m = rf(df_dom, is_valid)\n",
    "rf_feat_importance(m, df_dom)[:6]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 237,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = rf(xs_final, y)\n",
    "print('orig', m_rmse(m, valid_xs_final, valid_y))\n",
    "\n",
    "for c in ('SalesID','saleElapsed','MachineID'):\n",
    "    m = rf(xs_final.drop(c,axis=1), y)\n",
    "    print(c, m_rmse(m, valid_xs_final.drop(c,axis=1), valid_y))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 239,
   "metadata": {},
   "outputs": [],
   "source": [
    "time_vars = ['SalesID','MachineID']\n",
    "xs_final_time = xs_final.drop(time_vars, axis=1)\n",
    "valid_xs_time = valid_xs_final.drop(time_vars, axis=1)\n",
    "\n",
    "m = rf(xs_final_time, y)\n",
    "m_rmse(m, valid_xs_time, valid_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 241,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs['saleYear'].hist();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 243,
   "metadata": {},
   "outputs": [],
   "source": [
    "filt = xs['saleYear']>2004\n",
    "xs_filt = xs_final_time[filt]\n",
    "y_filt = y[filt]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 244,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = rf(xs_filt, y_filt)\n",
    "m_rmse(m, xs_filt, y_filt), m_rmse(m, valid_xs_time, valid_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 246,
   "metadata": {},
   "source": [
    "### Using a Neural Network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 248,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_nn = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)\n",
    "df_nn['ProductSize'] = df_nn['ProductSize'].astype('category')\n",
    "df_nn['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)\n",
    "df_nn[dep_var] = np.log(df_nn[dep_var])\n",
    "df_nn = add_datepart(df_nn, 'saledate')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 250,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_nn_final = df_nn[list(xs_final_time.columns) + [dep_var]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 252,
   "metadata": {},
   "outputs": [],
   "source": [
    "cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 254,
   "metadata": {},
   "outputs": [],
   "source": [
    "cont_nn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 256,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_nn_final[cat_nn].nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 258,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs_filt2 = xs_filt.drop('fiModelDescriptor', axis=1)\n",
    "valid_xs_time2 = valid_xs_time.drop('fiModelDescriptor', axis=1)\n",
    "m2 = rf(xs_filt2, y_filt)\n",
    "m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 260,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_nn.remove('fiModelDescriptor')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 262,
   "metadata": {},
   "outputs": [],
   "source": [
    "procs_nn = [Categorify, FillMissing, Normalize]\n",
    "to_nn = TabularPandas(df_nn_final, procs_nn, cat_nn, cont_nn,\n",
    "                      splits=splits, y_names=dep_var)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 264,
   "metadata": {},
   "outputs": [],
   "source": [
    "dls = to_nn.dataloaders(1024)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 266,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = to_nn.train.y\n",
    "y.min(),y.max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 268,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn = tabular_learner(dls, y_range=(8,12), layers=[500,250],\n",
    "                        n_out=1, loss_func=F.mse_loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 269,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.lr_find()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 271,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.fit_one_cycle(5, 1e-2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 273,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds,targs = learn.get_preds()\n",
    "r_mse(preds,targs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 275,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.save('nn')"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 276,
   "metadata": {},
   "source": [
    "### Sidebar: fastai's Tabular Classes"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 278,
   "metadata": {},
   "source": [
    "### End sidebar"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 280,
   "metadata": {},
   "source": [
    "## Ensembling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 282,
   "metadata": {},
   "outputs": [],
   "source": [
    "rf_preds = m.predict(valid_xs_time)\n",
    "ens_preds = (to_np(preds.squeeze()) + rf_preds) /2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 284,
   "metadata": {},
   "outputs": [],
   "source": [
    "r_mse(ens_preds,valid_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 286,
   "metadata": {},
   "source": [
    "### Boosting"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 289,
   "metadata": {},
   "source": [
    "### Combining Embeddings with Other Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 293,
   "metadata": {},
   "source": [
    "## Conclusion: Our Advice for Tabular Modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 295,
   "metadata": {},
   "source": [
    "## Questionnaire"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 296,
   "metadata": {},
   "source": [
    "1. What is a continuous variable?\n",
    "1. What is a categorical variable?\n",
    "1. Provide two of the words that are used for the possible values of a categorical variable.\n",
    "1. What is a \"dense layer\"?\n",
    "1. How do entity embeddings reduce memory usage and speed up neural networks?\n",
    "1. What kinds of datasets are entity embeddings especially useful for?\n",
    "1. What are the two main families of machine learning algorithms?\n",
    "1. Why do some categorical columns need a special ordering in their classes? How do you do this in Pandas?\n",
    "1. Summarize what a decision tree algorithm does.\n",
    "1. Why is a date different from a regular categorical or continuous variable, and how can you preprocess it to allow it to be used in a model?\n",
    "1. Should you pick a random validation set in the bulldozer competition? If no, what kind of validation set should you pick?\n",
    "1. What is pickle and what is it useful for?\n",
    "1. How are `mse`, `samples`, and `values` calculated in the decision tree drawn in this chapter?\n",
    "1. How do we deal with outliers, before building a decision tree?\n",
    "1. How do we handle categorical variables in a decision tree?\n",
    "1. What is bagging?\n",
    "1. What is the difference between `max_samples` and `max_features` when creating a random forest?\n",
    "1. If you increase `n_estimators` to a very high value, can that lead to overfitting? Why or why not?\n",
    "1. In the section \"Creating a Random Forest\", just after <<max_features>>, why did `preds.mean(0)` give the same result as our random forest?\n",
    "1. What is \"out-of-bag-error\"?\n",
    "1. Make a list of reasons why a model's validation set error might be worse than the OOB error. How could you test your hypotheses?\n",
    "1. Explain why random forests are well suited to answering each of the following question:\n",
    "   - How confident are we in our predictions using a particular row of data?\n",
    "   - For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?\n",
    "   - Which columns are the strongest predictors?\n",
    "   - How do predictions vary as we vary these columns?\n",
    "1. What's the purpose of removing unimportant variables?\n",
    "1. What's a good type of plot for showing tree interpreter results?\n",
    "1. What is the \"extrapolation problem\"?\n",
    "1. How can you tell if your test or validation set is distributed in a different way than your training set?\n",
    "1. Why do we ensure `saleElapsed` is a continuous variable, even although it has less than 9,000 distinct values?\n",
    "1. What is \"boosting\"?\n",
    "1. How could we use embeddings with a random forest? Would we expect this to help?\n",
    "1. Why might we not always use a neural net for tabular modeling?"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 297,
   "metadata": {},
   "source": [
    "### Further Research"
   ]
  },
  {
   "cell_type": "markdown",
   "idx_": 298,
   "metadata": {},
   "source": [
    "1. Pick a competition on Kaggle with tabular data (current or past) and try to adapt the techniques seen in this chapter to get the best possible results. Compare your results to the private leaderboard.\n",
    "1. Implement the decision tree algorithm in this chapter from scratch yourself, and try it on the dataset you used in the first exercise.\n",
    "1. Use the embeddings from the neural net in this chapter in a random forest, and see if you can improve on the random forest results we saw.\n",
    "1. Explain what each line of the source of `TabularModel` does (with the exception of the `BatchNorm1d` and `Dropout` layers)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "idx_": 299,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4,
 "path_": "09_tabular.ipynb"
}