2020-03-06 18:19:03 +00:00
{
"cells": [
2020-09-03 22:51:00 +00:00
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 0,
2020-09-03 22:51:00 +00:00
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
2022-04-25 21:02:49 +00:00
"! [ -e /content ] && pip install -Uqq fastbook kaggle waterfallcharts treeinterpreter dtreeviz\n",
2020-09-03 22:51:00 +00:00
"import fastbook\n",
"fastbook.setup_book()"
]
},
2020-03-06 18:19:03 +00:00
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 1,
2020-09-03 22:58:27 +00:00
"metadata": {},
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"#hide\n",
2020-09-03 22:51:00 +00:00
"from fastbook import *\n",
2020-03-06 18:19:03 +00:00
"from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype\n",
2020-08-21 19:36:27 +00:00
"from fastai.tabular.all import *\n",
2020-03-06 18:19:03 +00:00
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.tree import DecisionTreeRegressor\n",
"from dtreeviz.trees import *\n",
"from IPython.display import Image, display_svg, SVG\n",
"\n",
"pd.options.display.max_rows = 20\n",
"pd.options.display.max_columns = 8"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 3,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"# Tabular Modeling Deep Dive"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 5,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"## Categorical Embeddings"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 21,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"## Beyond Deep Learning"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 25,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"## The Dataset"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 27,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
"### Kaggle Competitions"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 29,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"creds = ''"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 31,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"cred_path = Path('~/.kaggle/kaggle.json').expanduser()\n",
"if not cred_path.exists():\n",
" cred_path.parent.mkdir(exist_ok=True)\n",
2020-11-29 18:40:59 +00:00
" cred_path.write_text(creds)\n",
2020-03-06 18:19:03 +00:00
" cred_path.chmod(0o600)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 33,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
2022-04-25 05:35:12 +00:00
"comp = 'bluebook-for-bulldozers'\n",
"path = URLs.path(comp)\n",
2020-03-06 18:19:03 +00:00
"path"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 34,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"Path.BASE_PATH = path"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 36,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
2022-04-25 05:35:12 +00:00
"from kaggle import api\n",
"\n",
2020-03-06 18:19:03 +00:00
"if not path.exists():\n",
2020-11-29 18:40:59 +00:00
" path.mkdir(parents=true)\n",
2022-04-25 05:35:12 +00:00
" api.competition_download_cli(comp, path=path)\n",
" shutil.unpack_archive(str(path/f'{comp}.zip'), str(path))\n",
2020-03-06 18:19:03 +00:00
"\n",
"path.ls(file_type='text')"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 38,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Look at the Data"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 40,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 41,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"df.columns"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 43,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"df['ProductSize'].unique()"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 45,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 46,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"df['ProductSize'] = df['ProductSize'].astype('category')\n",
"df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 48,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"dep_var = 'SalePrice'"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 49,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"df[dep_var] = np.log(df[dep_var])"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 51,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"## Decision Trees"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 56,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Handling Dates"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 58,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"df = add_datepart(df, 'saledate')"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 60,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"df_test = pd.read_csv(path/'Test.csv', low_memory=False)\n",
"df_test = add_datepart(df_test, 'saledate')"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 62,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"' '.join(o for o in df.columns if o.startswith('sale'))"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 64,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
"### Using TabularPandas and TabularProc"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 66,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"procs = [Categorify, FillMissing]"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 68,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"cond = (df.saleYear<2011) | (df.saleMonth<10)\n",
"train_idx = np.where( cond)[0]\n",
"valid_idx = np.where(~cond)[0]\n",
"\n",
"splits = (list(train_idx),list(valid_idx))"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 70,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"cont,cat = cont_cat_split(df, 1, dep_var=dep_var)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 71,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 73,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"len(to.train),len(to.valid)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 75,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"to.show(3)"
]
},
2020-04-23 14:48:40 +00:00
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 76,
2020-04-23 14:48:40 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-04-23 14:48:40 +00:00
"source": [
"to1 = TabularPandas(df, procs, ['state', 'ProductGroup', 'Drive_System', 'Enclosure'], [], y_names=dep_var, splits=splits)\n",
"to1.show(3)"
]
},
2020-03-06 18:19:03 +00:00
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 78,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"to.items.head(3)"
]
},
2020-04-23 14:48:40 +00:00
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 79,
2020-04-23 14:48:40 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-04-23 14:48:40 +00:00
"source": [
"to1.items[['state', 'ProductGroup', 'Drive_System', 'Enclosure']].head(3)"
]
},
2020-03-06 18:19:03 +00:00
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 81,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"to.classes['ProductSize']"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 83,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
2020-11-29 18:40:59 +00:00
"save_pickle(path/'to.pkl',to)"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 86,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Creating the Decision Tree"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 88,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
2020-05-15 00:27:39 +00:00
"#hide\n",
2020-11-29 18:40:59 +00:00
"to = load_pickle(path/'to.pkl')"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 89,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"xs,y = to.train.xs,to.train.y\n",
"valid_xs,valid_y = to.valid.xs,to.valid.y"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 91,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"m = DecisionTreeRegressor(max_leaf_nodes=4)\n",
"m.fit(xs, y);"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 93,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
2020-11-29 18:40:59 +00:00
"draw_tree(m, xs, size=10, leaves_parallel=True, precision=2)"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 96,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"samp_idx = np.random.permutation(len(y))[:500]\n",
"dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,\n",
" fontname='DejaVu Sans', scale=1.6, label_fontsize=10,\n",
" orientation='LR')"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 98,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"xs.loc[xs['YearMade']<1900, 'YearMade'] = 1950\n",
"valid_xs.loc[valid_xs['YearMade']<1900, 'YearMade'] = 1950"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 100,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"m = DecisionTreeRegressor(max_leaf_nodes=4).fit(xs, y)\n",
"\n",
"dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,\n",
" fontname='DejaVu Sans', scale=1.6, label_fontsize=10,\n",
" orientation='LR')"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 102,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"m = DecisionTreeRegressor()\n",
"m.fit(xs, y);"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 104,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"def r_mse(pred,y): return round(math.sqrt(((pred-y)**2).mean()), 6)\n",
"def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 105,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"m_rmse(m, xs, y)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 107,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"m_rmse(m, valid_xs, valid_y)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 109,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"m.get_n_leaves(), len(xs)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 111,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"m = DecisionTreeRegressor(min_samples_leaf=25)\n",
"m.fit(to.train.xs, to.train.y)\n",
"m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 113,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"m.get_n_leaves()"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 117,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Categorical Variables"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 121,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"## Random Forests"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 123,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
2020-05-19 01:18:45 +00:00
"# pip install —pre -f https://sklearn-nightly.scdn8.secure.raxcdn.com scikit-learn —U"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 124,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Creating a Random Forest"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 126,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"def rf(xs, y, n_estimators=40, max_samples=200_000,\n",
" max_features=0.5, min_samples_leaf=5, **kwargs):\n",
" return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,\n",
" max_samples=max_samples, max_features=max_features,\n",
" min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 127,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"m = rf(xs, y);"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 129,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 133,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"preds = np.stack([t.predict(valid_xs) for t in m.estimators_])"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 135,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"r_mse(preds.mean(0), valid_y)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 137,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(40)]);"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 139,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Out-of-Bag Error"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 141,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"r_mse(m.oob_prediction_, y)"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 144,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"## Model Interpretation"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 146,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Tree Variance for Prediction Confidence"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 148,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"preds = np.stack([t.predict(valid_xs) for t in m.estimators_])"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 149,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"preds.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 151,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"preds_std = preds.std(0)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 153,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"preds_std[:5]"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 155,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Feature Importance"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 157,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"def rf_feat_importance(m, df):\n",
" return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}\n",
" ).sort_values('imp', ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 159,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"fi = rf_feat_importance(m, xs)\n",
"fi[:10]"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 161,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"def plot_fi(fi):\n",
" return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)\n",
"\n",
"plot_fi(fi[:30]);"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 163,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Removing Low-Importance Variables"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 165,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"to_keep = fi[fi.imp>0.005].cols\n",
"len(to_keep)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 167,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"xs_imp = xs[to_keep]\n",
"valid_xs_imp = valid_xs[to_keep]"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 168,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"m = rf(xs_imp, y)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 170,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"m_rmse(m, xs_imp, y), m_rmse(m, valid_xs_imp, valid_y)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 172,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"len(xs.columns), len(xs_imp.columns)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 174,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"plot_fi(rf_feat_importance(m, xs_imp));"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 176,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Removing Redundant Features"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 178,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"cluster_columns(xs_imp)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 180,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"def get_oob(df):\n",
" m = RandomForestRegressor(n_estimators=40, min_samples_leaf=15,\n",
" max_samples=50000, max_features=0.5, n_jobs=-1, oob_score=True)\n",
" m.fit(df, y)\n",
" return m.oob_score_"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 182,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"get_oob(xs_imp)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 184,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"{c:get_oob(xs_imp.drop(c, axis=1)) for c in (\n",
" 'saleYear', 'saleElapsed', 'ProductGroupDesc','ProductGroup',\n",
" 'fiModelDesc', 'fiBaseModel',\n",
" 'Hydraulics_Flow','Grouser_Tracks', 'Coupler_System')}"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 186,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"to_drop = ['saleYear', 'ProductGroupDesc', 'fiBaseModel', 'Grouser_Tracks']\n",
"get_oob(xs_imp.drop(to_drop, axis=1))"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 188,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"xs_final = xs_imp.drop(to_drop, axis=1)\n",
"valid_xs_final = valid_xs_imp.drop(to_drop, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 189,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
2020-11-29 18:40:59 +00:00
"save_pickle(path/'xs_final.pkl', xs_final)\n",
"save_pickle(path/'valid_xs_final.pkl', valid_xs_final)"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 191,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
2020-11-29 18:40:59 +00:00
"xs_final = load_pickle(path/'xs_final.pkl')\n",
"valid_xs_final = load_pickle(path/'valid_xs_final.pkl')"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 193,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"m = rf(xs_final, y)\n",
"m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 195,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Partial Dependence"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 197,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"p = valid_xs_final['ProductSize'].value_counts(sort=False).plot.barh()\n",
"c = to.classes['ProductSize']\n",
"plt.yticks(range(len(c)), c);"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 199,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"ax = valid_xs_final['YearMade'].hist()"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 201,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"from sklearn.inspection import plot_partial_dependence\n",
"\n",
"fig,ax = plt.subplots(figsize=(12, 4))\n",
"plot_partial_dependence(m, valid_xs_final, ['YearMade','ProductSize'],\n",
" grid_resolution=20, ax=ax);"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 203,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Data Leakage"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 205,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Tree Interpreter"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 206,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"import warnings\n",
"warnings.simplefilter('ignore', FutureWarning)\n",
"\n",
"from treeinterpreter import treeinterpreter\n",
"from waterfall_chart import plot as waterfall"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 209,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"row = valid_xs_final.iloc[:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 211,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"prediction,bias,contributions = treeinterpreter.predict(m, row.values)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 213,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"prediction[0], bias[0], contributions[0].sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 215,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"waterfall(valid_xs_final.columns, contributions[0], threshold=0.08, \n",
" rotation_value=45,formatting='{:,.3f}');"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 218,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"## Extrapolation and Neural Networks"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 220,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### The Extrapolation Problem"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 221,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"#hide\n",
"np.random.seed(42)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 223,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"x_lin = torch.linspace(0,20, steps=40)\n",
"y_lin = x_lin + torch.randn_like(x_lin)\n",
"plt.scatter(x_lin, y_lin);"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 225,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"xs_lin = x_lin.unsqueeze(1)\n",
"x_lin.shape,xs_lin.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 227,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"x_lin[:,None].shape"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 229,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"m_lin = RandomForestRegressor().fit(xs_lin[:30],y_lin[:30])"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 231,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"plt.scatter(x_lin, y_lin, 20)\n",
"plt.scatter(x_lin, m_lin.predict(xs_lin), color='red', alpha=0.5);"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 233,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-15 00:27:39 +00:00
"### Finding Out-of-Domain Data"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 235,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"df_dom = pd.concat([xs_final, valid_xs_final])\n",
"is_valid = np.array([0]*len(xs_final) + [1]*len(valid_xs_final))\n",
"\n",
"m = rf(df_dom, is_valid)\n",
"rf_feat_importance(m, df_dom)[:6]"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 237,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"m = rf(xs_final, y)\n",
"print('orig', m_rmse(m, valid_xs_final, valid_y))\n",
"\n",
"for c in ('SalesID','saleElapsed','MachineID'):\n",
" m = rf(xs_final.drop(c,axis=1), y)\n",
" print(c, m_rmse(m, valid_xs_final.drop(c,axis=1), valid_y))"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 239,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"time_vars = ['SalesID','MachineID']\n",
"xs_final_time = xs_final.drop(time_vars, axis=1)\n",
"valid_xs_time = valid_xs_final.drop(time_vars, axis=1)\n",
"\n",
"m = rf(xs_final_time, y)\n",
"m_rmse(m, valid_xs_time, valid_y)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 241,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"xs['saleYear'].hist();"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 243,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"filt = xs['saleYear']>2004\n",
"xs_filt = xs_final_time[filt]\n",
"y_filt = y[filt]"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 244,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"m = rf(xs_filt, y_filt)\n",
"m_rmse(m, xs_filt, y_filt), m_rmse(m, valid_xs_time, valid_y)"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 246,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Using a Neural Network"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 248,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"df_nn = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)\n",
"df_nn['ProductSize'] = df_nn['ProductSize'].astype('category')\n",
"df_nn['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)\n",
"df_nn[dep_var] = np.log(df_nn[dep_var])\n",
"df_nn = add_datepart(df_nn, 'saledate')"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 250,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"df_nn_final = df_nn[list(xs_final_time.columns) + [dep_var]]"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 252,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 254,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
2021-02-22 22:06:26 +00:00
"cont_nn"
2020-11-29 18:40:59 +00:00
]
},
2020-03-06 18:19:03 +00:00
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 256,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"df_nn_final[cat_nn].nunique()"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 258,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"xs_filt2 = xs_filt.drop('fiModelDescriptor', axis=1)\n",
"valid_xs_time2 = valid_xs_time.drop('fiModelDescriptor', axis=1)\n",
"m2 = rf(xs_filt2, y_filt)\n",
2020-11-29 18:40:59 +00:00
"m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 260,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"cat_nn.remove('fiModelDescriptor')"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 262,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"procs_nn = [Categorify, FillMissing, Normalize]\n",
"to_nn = TabularPandas(df_nn_final, procs_nn, cat_nn, cont_nn,\n",
" splits=splits, y_names=dep_var)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 264,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"dls = to_nn.dataloaders(1024)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 266,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"y = to_nn.train.y\n",
"y.min(),y.max()"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 268,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"learn = tabular_learner(dls, y_range=(8,12), layers=[500,250],\n",
" n_out=1, loss_func=F.mse_loss)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 269,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"learn.lr_find()"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 271,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"learn.fit_one_cycle(5, 1e-2)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 273,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"preds,targs = learn.get_preds()\n",
"r_mse(preds,targs)"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 275,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"learn.save('nn')"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 276,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Sidebar: fastai's Tabular Classes"
2020-04-23 14:48:40 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 278,
2020-04-23 14:48:40 +00:00
"metadata": {},
"source": [
"### End sidebar"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 280,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-15 00:27:39 +00:00
"## Ensembling"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 282,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": [
"rf_preds = m.predict(valid_xs_time)\n",
"ens_preds = (to_np(preds.squeeze()) + rf_preds) /2"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 284,
2020-03-06 18:19:03 +00:00
"metadata": {},
2020-09-03 22:58:27 +00:00
"outputs": [],
2020-03-06 18:19:03 +00:00
"source": [
"r_mse(ens_preds,valid_y)"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 286,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
"### Boosting"
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 289,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Combining Embeddings with Other Methods"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 293,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"## Conclusion: Our Advice for Tabular Modeling"
2020-03-06 18:19:03 +00:00
]
},
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 295,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
"## Questionnaire"
]
},
2020-03-18 00:34:07 +00:00
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 296,
2020-03-18 00:34:07 +00:00
"metadata": {},
"source": [
"1. What is a continuous variable?\n",
"1. What is a categorical variable?\n",
2020-05-15 00:27:39 +00:00
"1. Provide two of the words that are used for the possible values of a categorical variable.\n",
2020-03-18 00:34:07 +00:00
"1. What is a \"dense layer\"?\n",
"1. How do entity embeddings reduce memory usage and speed up neural networks?\n",
2020-05-15 00:27:39 +00:00
"1. What kinds of datasets are entity embeddings especially useful for?\n",
2020-03-18 00:34:07 +00:00
"1. What are the two main families of machine learning algorithms?\n",
2020-05-15 00:27:39 +00:00
"1. Why do some categorical columns need a special ordering in their classes? How do you do this in Pandas?\n",
2020-03-18 00:34:07 +00:00
"1. Summarize what a decision tree algorithm does.\n",
"1. Why is a date different from a regular categorical or continuous variable, and how can you preprocess it to allow it to be used in a model?\n",
"1. Should you pick a random validation set in the bulldozer competition? If no, what kind of validation set should you pick?\n",
"1. What is pickle and what is it useful for?\n",
"1. How are `mse`, `samples`, and `values` calculated in the decision tree drawn in this chapter?\n",
"1. How do we deal with outliers, before building a decision tree?\n",
"1. How do we handle categorical variables in a decision tree?\n",
"1. What is bagging?\n",
"1. What is the difference between `max_samples` and `max_features` when creating a random forest?\n",
"1. If you increase `n_estimators` to a very high value, can that lead to overfitting? Why or why not?\n",
2020-05-15 00:27:39 +00:00
"1. In the section \"Creating a Random Forest\", just after <<max_features>>, why did `preds.mean(0)` give the same result as our random forest?\n",
"1. What is \"out-of-bag-error\"?\n",
2020-03-18 00:34:07 +00:00
"1. Make a list of reasons why a model's validation set error might be worse than the OOB error. How could you test your hypotheses?\n",
2020-05-15 00:27:39 +00:00
"1. Explain why random forests are well suited to answering each of the following question:\n",
" - How confident are we in our predictions using a particular row of data?\n",
2020-03-18 00:34:07 +00:00
" - For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?\n",
" - Which columns are the strongest predictors?\n",
2020-05-15 00:27:39 +00:00
" - How do predictions vary as we vary these columns?\n",
2020-03-18 00:34:07 +00:00
"1. What's the purpose of removing unimportant variables?\n",
"1. What's a good type of plot for showing tree interpreter results?\n",
2020-05-15 00:27:39 +00:00
"1. What is the \"extrapolation problem\"?\n",
"1. How can you tell if your test or validation set is distributed in a different way than your training set?\n",
2021-02-22 22:06:26 +00:00
"1. Why do we ensure `saleElapsed` is a continuous variable, even although it has less than 9,000 distinct values?\n",
2020-05-15 00:27:39 +00:00
"1. What is \"boosting\"?\n",
2020-03-18 00:34:07 +00:00
"1. How could we use embeddings with a random forest? Would we expect this to help?\n",
"1. Why might we not always use a neural net for tabular modeling?"
]
},
2020-03-06 18:19:03 +00:00
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 297,
2020-03-06 18:19:03 +00:00
"metadata": {},
"source": [
2020-05-14 12:18:31 +00:00
"### Further Research"
2020-03-06 18:19:03 +00:00
]
},
2020-03-18 00:34:07 +00:00
{
"cell_type": "markdown",
2023-03-29 22:56:24 +00:00
"idx_": 298,
2020-03-18 00:34:07 +00:00
"metadata": {},
"source": [
2020-05-15 00:27:39 +00:00
"1. Pick a competition on Kaggle with tabular data (current or past) and try to adapt the techniques seen in this chapter to get the best possible results. Compare your results to the private leaderboard.\n",
2020-11-29 18:40:59 +00:00
"1. Implement the decision tree algorithm in this chapter from scratch yourself, and try it on the dataset you used in the first exercise.\n",
2020-03-18 00:34:07 +00:00
"1. Use the embeddings from the neural net in this chapter in a random forest, and see if you can improve on the random forest results we saw.\n",
"1. Explain what each line of the source of `TabularModel` does (with the exception of the `BatchNorm1d` and `Dropout` layers)."
]
},
2020-03-06 18:19:03 +00:00
{
"cell_type": "code",
"execution_count": null,
2023-03-29 22:56:24 +00:00
"idx_": 299,
2020-03-06 18:19:03 +00:00
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
2023-03-29 22:56:24 +00:00
"display_name": "python3",
2020-03-06 18:19:03 +00:00
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
2023-03-29 22:56:24 +00:00
"nbformat_minor": 4,
"path_": "09_tabular.ipynb"
2020-03-06 18:19:03 +00:00
}