mirror of
https://github.com/fastai/fastbook.git
synced 2025-04-04 18:00:48 +00:00
PRs
This commit is contained in:
parent
1202b218f4
commit
96012fa26c
106
09_tabular.ipynb
106
09_tabular.ipynb
@ -40,7 +40,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"TK Write introduction mentioning all machine learning techniques introduced in this cahpter."
|
||||
"Tabular modelling takes data in the form of a table (like a spreadsheet or CSV--comma separated values). The objective is to predict the value in one column, based on the values in the other columns."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -68,9 +68,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"At the end of 2015 the [Rossmann sales competition](https://www.kaggle.com/c/rossmann-store-sales) ran on Kaggle. Competitors were given a wide range of information about various stores in Germany, and were tasked with trying to predict their sales on a number of day. The goal was to help them to manage stock properly and to be able to properly satisfy the demand without holding unnecessary inventory. The official training set provided a lot of information about the stores. It was also permitted for competitors to use additional data, as long as that data was made public and available to all participants.\n",
|
||||
"At the end of 2015, the [Rossmann sales competition](https://www.kaggle.com/c/rossmann-store-sales) ran on Kaggle. Competitors were given a wide range of information about various stores in Germany, and were tasked with trying to predict sales on a number of days. The goal was to help them to manage stock properly and to be able to properly satisfy the demand without holding unnecessary inventory. The official training set provided a lot of information about the stores. It was also permitted for competitors to use additional data, as long as that data was made public and available to all participants.\n",
|
||||
"\n",
|
||||
"One of the gold medalists used deep learning, in one of the earliest known examples of a state of the art deep learning tabular model. Their method involved far less feature engineering based on domain knowledge than the other gold medalists. They wrote a paper, [Entity Embeddings of Categorical Variables](https://arxiv.org/abs/1604.06737), about their approach. In an online-only chapter on the book website we show how to replicate their approach from scratch and attain the same accuracy shown in the paper. In the abstract of the paper they say:"
|
||||
"One of the gold medalists used deep learning, in one of the earliest known examples of a state of the art deep learning tabular model. Their method involved far less feature engineering, based on domain knowledge, than the other gold medalists. They wrote a paper, [Entity Embeddings of Categorical Variables](https://arxiv.org/abs/1604.06737), about their approach. In an online-only chapter on the book website we show how to replicate their approach from scratch and attain the same accuracy shown in the paper. In the abstract of the paper they say:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -86,7 +86,7 @@
|
||||
"source": [
|
||||
"We have already noticed all of these points when we built our collaborative filtering model. We can clearly see that these insights go far beyond just collaborative filtering, however.\n",
|
||||
"\n",
|
||||
"The paper also points out that (as we discussed in the last chapter) an embedding layer is exactly equivalent to placing an ordinary linear layer after every one-hot encoded input layer. They used the diagram in <<entity_emb>> to show this equivalence. Note that \"dense layer\" is another term with the same meaning as \"linear layer\", the one-hot encoding layers represent inputs."
|
||||
"The paper also points out that (as we discussed in the last chapter) that an embedding layer is exactly equivalent to placing an ordinary linear layer after every one-hot encoded input layer. They used the diagram in <<entity_emb>> to show this equivalence. Note that \"dense layer\" is another term with the same meaning as \"linear layer\", the one-hot encoding layers represent inputs."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -150,11 +150,11 @@
|
||||
"source": [
|
||||
"What stands out in these two examples is that we provide the model fundamentally categorical data about discrete entities (German states or days of the week), and then the model learns an embedding for these entities which defines a continuous notion of distance between them. Because the embedding distance was learned based on real patterns in the data, that distance tends to match up with our intuitions.\n",
|
||||
"\n",
|
||||
"In addition, it is also valuable in its own right that embeddings are continuous. It is valuable because models are better at understanding continuous variables. This is unsurprising considering models are built of many continuous parameter weights, continuous activation values, all updated via gradient descent, a learning algorithm for finding the minimums of continuous functions.\n",
|
||||
"In addition, it is also valuable in its own right that embeddings are continuous. It is valuable because models are better at understanding continuous variables. This is unsurprising considering models are built of many continuous parameter weights and continuous activation values, which are updated via gradient descent, a learning algorithm for finding the minimums of continuous functions.\n",
|
||||
"\n",
|
||||
"Is is also valuable because we can combine our continuous embedding values with truly continuous input data in a straightforward manner: we just concatenate the variables, and feed the concatenation into our first dense layer. In other words, the raw categorical data is transformed by an embedding layer, before it interacts with the raw continuous input data. This is how fastai, and the entity embeddings paper, handle tabular models containing continuous and categorical variables.\n",
|
||||
"\n",
|
||||
"This concatenation approach is, for instance, how Google do their recommendations on Google Play, as they explained in their paper [Wide & Deep Learning for Recommender Systems](https://arxiv.org/abs/1606.07792), and as shown in <<google_recsys>> from their paper."
|
||||
"An example using this concatenation approach is how Google do their recommendations on Google Play, as they explained in their paper [Wide & Deep Learning for Recommender Systems](https://arxiv.org/abs/1606.07792), and as shown in this figure from their paper:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -186,7 +186,7 @@
|
||||
"source": [
|
||||
"Most machine learning courses will throw dozens of different algorithms at you, with a brief technical description of the math behind them and maybe a toy example. You're left confused by the enormous range of techniques shown and have little practical understanding of how to apply them.\n",
|
||||
"\n",
|
||||
"The good news is that modern machine learning can be distilled down to a couple of key techniques that are of very wide applicability. Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:\n",
|
||||
"The good news is that modern machine learning can be distilled down to a couple of key techniques that are widely applicable. Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:\n",
|
||||
"\n",
|
||||
"1. Ensembles of decision trees (i.e. Random Forests and Gradient Boosting Machines), mainly for structured data (such as you might find in a database table at most companies)\n",
|
||||
"1. Multi-layered neural networks learnt with SGD (i.e. shallow and/or deep learning), mainly for unstructured data (such as audio, vision, and natural language)"
|
||||
@ -207,7 +207,7 @@
|
||||
"- There are some high cardinality categorical variables that are very important (\"cardinality\" refers to the number of discrete levels representing categories, so a high cardinality categorical variable is something like a ZIP Code, which can take on thousands of possible levels)\n",
|
||||
"- There are some columns which contain data which would be best understood with a neural network, such as plaintext data.\n",
|
||||
"\n",
|
||||
"In practice, when we deal with datasets which meet these exceptional conditions, we would always try both decision tree ensembles and deep learning to see which works best. For instance, in our case of collaborative filtering you by definition have at least two high cardinality categorical variables, the users and the movies, so it is likely that deep learning would be a useful approach. But in practice things tend to be less cut and dried, and there will often be a mixture of high and low cardinality categorical variables and continuous variables.\n",
|
||||
"In practice, when we deal with datasets which meet these exceptional conditions, we would always try both decision tree ensembles and deep learning to see which works best. It is likely that deep learning would be a useful approach in our example of collaborative filtering, as you have at least two high cardinality categorical variables: the users and the movies. But in practice things tend to be less cut and dried, and there will often be a mixture of high and low cardinality categorical variables and continuous variables.\n",
|
||||
"\n",
|
||||
"Either way, it's clear that we are going to need to add decision tree ensembles to our modelling toolbox!"
|
||||
]
|
||||
@ -387,7 +387,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Kaggle provides info about some of the fields of our dataset; on the [Kaggle Data info](https://www.kaggle.com/c/bluebook-for-bulldozers/data) page they say that the key fields in `train.csv` are:\n",
|
||||
"Kaggle provides information about some of the fields of our dataset; on the [Kaggle Data info](https://www.kaggle.com/c/bluebook-for-bulldozers/data) page they say that the key fields in `train.csv` are:\n",
|
||||
"\n",
|
||||
"- SalesID: the unique identifier of the sale\n",
|
||||
"- MachineID: the unique identifier of a machine. A machine can be sold multiple times\n",
|
||||
@ -498,7 +498,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"It's important to note what metric is being used for a project. Generally, selecting the metric is an important part of the project setup. In many cases, choosing a good metric will require more than just selecting a variable that already exists. It is more like a design process. You should think carefully about which metric, or set of metric, actually measures the notion of model quality which matters to you. If no variable represents that metric, you should say if you can build the metric from the variables which are available.\n",
|
||||
"The most important data column is the dependent variable, that is the one we want to predict. Recall that a models metric is a function that reflects how good the predictions are. It's important to note what metric is being used for a project. Generally, selecting the metric is an important part of the project setup. In many cases, choosing a good metric will require more than just selecting a variable that already exists. It is more like a design process. You should think carefully about which metric, or set of metric, actually measures the notion of model quality which matters to you. If no variable represents that metric, you should see if you can build the metric from the variables which are available.\n",
|
||||
"\n",
|
||||
"However, in this case Kaggle tells us what metric to use: RMSLE (root mean squared log error) between the actual and predicted auction prices. Here we need do only a small amount of processing to use this: we take the log of the prices, so that m_rmse of that value will give us what we ultimately need."
|
||||
]
|
||||
@ -539,7 +539,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Decision tree ensembles, as the name suggests, rely on decision trees. So let's start there! A decision tree asks a series of binary (that is, yes or no) questions about the data, and on that basis makes a prediction. For instance, the first question might be \"was the equipment manufactured before 1990?\" The second question will depend on the result of the first question (which is why this is a tree); for equipment manufactured for 1990 the second question might be \"was the auction after 2005?\" And so forth…\n",
|
||||
"Decision tree ensembles, as the name suggests, rely on decision trees. So let's start there! A decision tree asks a series of binary (that is, yes or no) questions about the data. After each question the data at that part of the tree is split between a \"yes\" and a \"no\" branch. After one or more questions, either a prediction can be made on the basis of all previous answers or another question is required.\n",
|
||||
"\n",
|
||||
"TK: Adding a figure here might be useful\n",
|
||||
"\n",
|
||||
@ -564,7 +564,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"> A: Here's a productive question to ponder. If you consider that the procedure for defining decision tree essentially chooses _sequence of splitting questions about variables_, you might ask yourself, how do we know this procedure chooses the _correct sequence_? The rule is to choose the splitting question which produces the best split, and then to apply the same rule to groups that split produces, and so on (this is known in computer science as a \"greedy\" approach). Can you imagine a scenario in which asking a “less powerful” splitting question would enable a better split down the road (or should I say down the trunk!) and lead to a better result overall?"
|
||||
"> A: Here's a productive question to ponder. If you consider that the procedure for defining decision tree essentially chooses one _sequence of splitting questions about variables_, you might ask yourself, how do we know this procedure chooses the _correct sequence_? The rule is to choose the splitting question which produces the best split, and then to apply the same rule to groups that split produces, and so on (this is known in computer science as a \"greedy\" approach). Can you imagine a scenario in which asking a “less powerful” splitting question would enable a better split down the road (or should I say down the trunk!) and lead to a better result overall?"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -661,9 +661,9 @@
|
||||
"A second piece of preparatory processing is to be sure we can handle strings and missing data. Out of the box, sklearn cannot do either. Instead we will use fastai's class `TabularPandas`, which wraps a Pandas data frame and provides a few conveniences. To populate a `TabularPandas`, we will use two `TabularProc`s, `Categorify` and `FillMissing`. A `TabularProc` is like a regular `Transform`, except that:\n",
|
||||
"\n",
|
||||
"- It returns the exact same object that's passed to it, after modifying the object *in-place*, and\n",
|
||||
"- It runs the transform once, when data is first passed in, rather than lazily as the data is access.\n",
|
||||
"- It runs the transform once, when data is first passed in, rather than lazily as the data is accessed.\n",
|
||||
"\n",
|
||||
"`Categorify` is a `TabularProc` which replaces a column with a numeric categorical column. `FillMissing` is an `TabularProc` which replaces missing values with the median of the column, and creates a new boolean column that is set to True for any row where the value was missing. These two transforms are needed for nearly every tabular dataset you will use, so it's a good starting point for your data processing."
|
||||
"`Categorify` is a `TabularProc` which replaces a column with a numeric categorical column. `FillMissing` is a `TabularProc` which replaces missing values with the median of the column, and creates a new boolean column that is set to True for any row where the value was missing. These two transforms are needed for nearly every tabular dataset you will use, so it's a good starting point for your data processing."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -687,7 +687,7 @@
|
||||
"\n",
|
||||
"We don't get to see the test set. But we do want to define our validation data so that it has the same sort of relationship to the training data as the test set will have.\n",
|
||||
"\n",
|
||||
"In some cases, just randomly choosing a subset of your data points is will do that. This is not one of those cases, because it is a time series.\n",
|
||||
"In some cases, just randomly choosing a subset of your data points will do that. This is not one of those cases, because it is a time series.\n",
|
||||
"\n",
|
||||
"If you look at the date range represented in the test set, you will discover that it covers a six-month period from May 2012, which is later in time than any date in the training set. This is a good design, because the competition sponsor will want to ensure that a model is able to predict the future. But it means that if we are going to have a useful validation set, we also want the validation set to be later in time. The Kaggle training data ends in April 2012. So we will define a narrower training dataset which consists only of the Kaggle training data from before November 2011, and we define a validation set which is from after November 2011.\n",
|
||||
"\n",
|
||||
@ -1495,11 +1495,11 @@
|
||||
"\n",
|
||||
"The top node represents the *initial model* before any splits have been done, when all the data is in one group. This is the simplest possible model. It is the result of asking zero questions. It always predict the value to be the average value of the whole dataset. In this case, we can see it predicts a value of 10.10 for the logarithm of the sales price. It gives a mean squared error of 0.48. The square root of this is 0.69. Remember that unless you see *m_rmse* or a *root mean squared error* then the value you are looking at is before taking the square root, so it is just the average of the square of the differences. We can also see that there are 404,710 auction records in this group — that is the total size of our training set. The final piece of information shown here is the decision criterion for the very first split that was found, which is to split based on the `coupler_system` column.\n",
|
||||
"\n",
|
||||
"Moving down and too the left, this node shows us that there were 360,847 auction records for equipment where `coupler_system` was less than 0.5. The average value of our dependent variable in this group is 10.21. But moving down and to the right from the initial model would take us to the records where `coupler_system` was greater than 0.5.\n",
|
||||
"Moving down and to the left, this node shows us that there were 360,847 auction records for equipment where `coupler_system` was less than 0.5. The average value of our dependent variable in this group is 10.21. But moving down and to the right from the initial model would take us to the records where `coupler_system` was greater than 0.5.\n",
|
||||
"\n",
|
||||
"The bottom row contains our *leaf nodes*, the nodes with no answers coming out of them, because there are no more questiosn to be answered. At the far right of this row is the node for `coupler_system` greater than 0.5, and we can see that the average value is 9.21. So we can see the decision tree algorithm did find a single binary decision which separated high value from low value auction results. Looking only at `coupler_system` predicts if the average value of 9.21 vs 10.1. That's if we ask only one question.\n",
|
||||
"The bottom row contains our *leaf nodes*, the nodes with no answers coming out of them, because there are no more questions to be answered. At the far right of this row is the node for `coupler_system` greater than 0.5, and we can see that the average value is 9.21. So we can see the decision tree algorithm did find a single binary decision which separated high value from low value auction results. Asking only about `coupler_system` predicts an average value of 9.21 vs 10.1. That's if we ask only one question.\n",
|
||||
"\n",
|
||||
"Returning back to the top node after the first decision point, we can see that a second binary decision split has been made, based on asking whether `YearMade` is less than or equal to 1991.5. For the group where this is true (remember, this is now following two binary decisions, both `coupler_system`, and `YearMade`) the average value is 9.97, and there are 155,724 auction records in this group. For the group of auctions where this decision is false, the average value is 10.4, and there are 205,123 records. So again, we can see that the decision tree algorithm has successfully further split our more expensive auction records into two groups which differ in value significantly."
|
||||
"Returning back to the top node after the first decision point, we can see that a second binary decision split has been made, based on asking whether `YearMade` is less than or equal to 1991.5. For the group where this is true (remember, this is now following two binary decisions, both `coupler_system`, and `YearMade`) the average value is 9.97, and there are 155,724 auction records in this group. For the group of auctions where this decision is false, the average value is 10.4, and there are 205,123 records. So again, we can see that the decision tree algorithm has successfully split our more expensive auction records into two more groups which differ in value significantly."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -7140,7 +7140,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's now have the decision tree algorithm build a bigger tree (i.e we are not listing any stopping criteria such as `max_leaf_nodes`):"
|
||||
"Let's now have the decision tree algorithm build a bigger tree (i.e we are not passing in any stopping criteria such as `max_leaf_nodes`):"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -7313,7 +7313,9 @@
|
||||
"source": [
|
||||
"Much more reasonable!\n",
|
||||
"\n",
|
||||
"One thing that may have struck you during this process is that we have not done anything special to handle categorical variables."
|
||||
"Building a decision tree is a good way to create a model of our data. It is very flexible, since it can clearly handle nonlinear relationships and interactions between variables. But we can see there is a fundamental compromise between how well it generalises (which we can achieve by creating small trees) and how accurate it is on the training set (which we can achieve by using large trees).\n",
|
||||
"\n",
|
||||
"But, how do we get the best of both worlds? We'll show you right after we handle an important missing detail: how to handle categorical variables."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -7404,7 +7406,7 @@
|
||||
"source": [
|
||||
"We can create a random forest just like we created a decision tree. Except now, we are also specifying parameters which indicate how many trees should be in the forest, how we should subset the data items (the rows), and how we should subset the fields (the columns).\n",
|
||||
"\n",
|
||||
"In the function definition below, `n_estimators` defines the number of trees we want, and `max_samples` defines how many rows to sample for training each tree, and `max_features` defines how many columns to sample at each split point (where `0.5` means \"take half the total number of columns\"). We can also pass parameters for choosing when to stop splitting the tree nodes, effectively limiting the depth of tree, by including the same `min_samples_leaf` parameter we used in the last section. Finally, we pass `n_jobs=-1` to tell sklearn to use all our CPUs to build the trees in parallel. By creating a little function for this, we can more quickly try different variations in the rest of this chapter."
|
||||
"In the function definition below, `n_estimators` defines the number of trees we want, and `max_samples` defines how many rows to sample for training each tree, while `max_features` defines how many columns to sample at each split point (where `0.5` means \"take half the total number of columns\"). We can also pass parameters for choosing when to stop splitting the tree nodes, effectively limiting the depth of tree, by including the same `min_samples_leaf` parameter we used in the last section. Finally, we pass `n_jobs=-1` to tell sklearn to use all our CPUs to build the trees in parallel. By creating a little function for this, we can more quickly try different variations in the rest of this chapter."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -7632,7 +7634,7 @@
|
||||
"source": [
|
||||
"For tabular data, model interpretation is particularly important. For a given model, the things we are most likely to be interested in are:\n",
|
||||
"\n",
|
||||
"- How confident are we in our projections using a particular row of data?\n",
|
||||
"- How confident are we in our predictions using a particular row of data?\n",
|
||||
"- For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?\n",
|
||||
"- Which columns are the strongest predictors, which can we ignore?\n",
|
||||
"- Which columns are effectively redundant with each other, for purposes of prediction?\n",
|
||||
@ -7652,9 +7654,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We saw how the model averages predictions across the trees to get a prediction, that is, an estimate of the value. But how can we know the confidence of the estimate? One simple way is to use the standard deviation of predictions across the trees, instead of just the mean. This tells us the *relative* confidence of predictions. That is, for rows where trees give very different results, you would want to be more cautious of using those results, compared to cases where they are more consistent.\n",
|
||||
"We saw how the model averages the individual tree predictions to get an overall prediction, that is, an estimate of the value. But how can we know the confidence of the estimate? One simple way is to use the standard deviation of predictions across the trees, instead of just the mean. This tells us the *relative* confidence of predictions. That is, for rows where trees give very different results, you would want to be more cautious of using those results, compared to cases where they are more consistent.\n",
|
||||
"\n",
|
||||
"In the last lesson we saw how to get predictions over the validation set, using a Python list comprehension to do this for each tree in the forest:"
|
||||
"In the earlier section on creating a random forest, we saw how to get predictions over the validation set, using a Python list comprehension to do this for each tree in the forest:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -7690,7 +7692,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"No we have a prediction for every tree and every auction, for 160 trees and 7988 auctions in the validation set.\n",
|
||||
"Now we have a prediction for every tree and every auction, for 160 trees and 7988 auctions in the validation set.\n",
|
||||
"\n",
|
||||
"Using this we can get the standard deviation of the predictions over all the trees, for each auction:"
|
||||
]
|
||||
@ -7735,7 +7737,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As you can see, the confidence of the predictions varies widely. For some auctions, there is a low standard deviation because the trees agree. For others, it's higher, as the trees don't agree. This is information that would be useful to use in a production setting; for instance, if you were using this model to decide what items to bid on at auction, a low-confidence prediction may cause you to look more carefully into an item before you made a bid"
|
||||
"As you can see, the confidence of the predictions varies widely. For some auctions, there is a low standard deviation because the trees agree. For others, it's higher, as the trees don't agree. This is information that would be useful to use in a production setting; for instance, if you were using this model to decide what items to bid on at auction, a low-confidence prediction may cause you to look more carefully into an item before you made a bid."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -8068,7 +8070,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"One thing that makes this harder to interpret is that there seem to be some variables with very similar meanings. Let's try to remove redundent features. "
|
||||
"One thing that makes this harder to interpret is that there seem to be some variables with very similar meanings, for example `ProductGroup` and `ProductGroupDesc`. Let's try to remove redundent features. "
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -8078,15 +8080,6 @@
|
||||
"### Removing redundant features"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can do this using \"*hierarchical cluster analysis*\", which find pairs of columns that are the most similar, and replaces them with the average of those columns. It does this recursively, until there's just one column. It plots a \"*dendrogram*\", which shows which columns were combined in which order, and how far away they are from each other.\n",
|
||||
"\n",
|
||||
"Here's how it looks:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
@ -8120,9 +8113,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In the chart above, you can see pairs of columns that were extremely similar as the ones that were merged together early, far away from the \"root\" of the tree at the left. For example, the fields `ProductGroup` and `ProductGroupDesc` were merged quite early, as were `saleYear` and `saleElapsed`, and as were `fiModelDesc` and `fiBaseModel`. These might be so closely correlated they are practically synonyms for each other.\n",
|
||||
"In the chart above, you can see pairs of columns that were extremely similar as the ones that were merged together early, far away from the \"root\" of the tree at the left. Unsurprisingly, the fields `ProductGroup` and `ProductGroupDesc` were merged quite early, as were `saleYear` and `saleElapsed`, and as were `fiModelDesc` and `fiBaseModel`. These might be so closely correlated they are practically synonyms for each other.\n",
|
||||
"\n",
|
||||
"Let's try removing some of these closely related features to see if the model can be simplified without impacting the accuracy. First, we create a function that quickly trains a random forest and returns the OOB score, by using a lower `max_samples` and higher `min_samples_leaf` . The *score* is a number returned by sklearn that is 1.0 for a perfect model, and 0.0 for a random model. (In statistics it's called *R^2*, although the details aren't important for this explanation). We don't need it to be very accurate--we're just going to use it to compare different models, based on removing some of the possibly redundent columns."
|
||||
"Let's try removing some of these closely related features to see if the model can be simplified without impacting the accuracy. First, we create a function that quickly trains a random forest and returns the OOB score, by using a lower `max_samples` and higher `min_samples_leaf` . The *score* is a number returned by sklearn that is 1.0 for a perfect model, and 0.0 for a random model. (In statistics it's called *R^2*, although the details aren't important for this explanation). We don't need it to be very accurate--we're just going to use it to compare different models, based on removing some of the possibly redundant columns."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -8454,15 +8447,15 @@
|
||||
"\n",
|
||||
"Data leakage is subtle and can take many forms. In particular, missing values often represent data leakage.\n",
|
||||
"\n",
|
||||
"For instance, Jeremy competed in a Kaggle competition designed to predict which researchers would end up receiving research grants. The information was provided by a university, and included thousands of examples of research projects, along with information about the researchers involved, along with whether or not the grant was eventually accepted. The University hopes that they would be able to use models developed in this competition to help him rank which grant applications were most likely to succeed, so that they could prioritise their processing .\n",
|
||||
"For instance, Jeremy competed in a Kaggle competition designed to predict which researchers would end up receiving research grants. The information was provided by a university, and included thousands of examples of research projects, along with information about the researchers involved, along with whether or not the grant was eventually accepted. The University hoped that they would be able to use models developed in this competition to help them rank which grant applications were most likely to succeed, so that they could prioritise their processing.\n",
|
||||
"\n",
|
||||
"Jeremy used a random forest to model the data, and then used feature importance to find out which features were most predictive. He noticed three surprising things:\n",
|
||||
"\n",
|
||||
"- The model was able to correctly predict who would receive grants over 95% of the time\n",
|
||||
"- Apparently meaningless identifier columns were the most important predictors\n",
|
||||
"- The columns day of week and day of year were also highly predictive; for instance, the vast majority of grant applications dated on a Sunday were accepted, and many grant applications were dated on January 1 and were also accepted.\n",
|
||||
"- The columns day of week and day of year were also highly predictive; for instance, the vast majority of grant applications dated on a Sunday were accepted, and many accepted grant applications were dated on January 1.\n",
|
||||
"\n",
|
||||
"For the identifier columns, a partial dependence plots showed that when the information was missing the grant was almost always rejected. It turned out that much of this information, in practice at the University, was only filled in *after* a grant application was accepted. Often, for applications that were not accepted, it was just left blank. Therefore, this information was not something that was actually available at the time that the application was received, and would therefor not be available for a predictive model — it was data leakage.\n",
|
||||
"For the identifier columns, a partial dependence plots showed that when the information was missing the grant was almost always rejected. It turned out that in practice, the University only filled out much of this information *after* a grant application was accepted. Often, for applications that were not accepted, it was just left blank. Therefore, this information was not something that was actually available at the time that the application was received, and would therefor not be available for a predictive model — it was data leakage.\n",
|
||||
"\n",
|
||||
"In the same way, the final processing of successful applications was often done automatically as a batch at the end of the week, or the end of the year. It was this final processing date which ended up in the data, so again, this information, while predictive, was not actually available at the time that the application was received.\n",
|
||||
"\n",
|
||||
@ -8797,7 +8790,7 @@
|
||||
"source": [
|
||||
"We have a big problem! Our predictions outside of the domain that our training data covered are all too low. Have a think about why this is…\n",
|
||||
"\n",
|
||||
"Remember, a random forest is just the average of the predictions of a number of trees. And a tree simply predicts the average value of the rows in a leaf. Therefore, a tree and a random forest can never predict values outside of the range of the training data. This is particularly problematic for data where there is a trend over time, such as inflation, and you wish to make predictions for a future time.. Your predictions will be systematically to low.\n",
|
||||
"Remember, a random forest is just the average of the predictions of a number of trees. And a tree simply predicts the average value of the rows in a leaf. Therefore, a tree and a random forest can never predict values outside of the range of the training data. This is particularly problematic for data where there is a trend over time, such as inflation, and you wish to make predictions for a future time.. Your predictions will be systematically too low.\n",
|
||||
"\n",
|
||||
"But the problem is actually more general than just time variables. Random forests are not able to extrapolate outside of the types of data you have seen, in a more general sense. That's why we need to make sure our validation set does not contain out of domain data."
|
||||
]
|
||||
@ -9492,7 +9485,7 @@
|
||||
"source": [
|
||||
"In fastai, a tabular model is simply a model which takes columns of continuous or categorical data, and predicts a category (a classification model) or a continuous value (a regression model). Categorical independent variables are passed through an embedding, and concatenated, as we saw in the neural net we used for collaborative filtering, and then continuous variables are concatenated as well.\n",
|
||||
"\n",
|
||||
"The model created in `tabular_learner` is an object of class `TabularModel`. Take a look at the source for `tabular_learner` now (remember, that's `tabular_learner??` in jupyter). You'll see that like `collab_learner`, it first calls `get_emb_sz` to calculate appropriate embedding sizes (which you can override by using the `emb_szs` parameter, which is a dictionary containing any column names you want to set sizes for manually), and it sets a few other defaults. Other than that, it just creates the `TabularModel`, and passes that to `TabularLearner` (and note that `TabularLearner` is identical to `Learner`, except for a customized `predict` method).\n",
|
||||
"The model created in `tabular_learner` is an object of class `TabularModel`. Take a look at the source for `tabular_learner` now (remember, that's `tabular_learner??` in Jupyter). You'll see that like `collab_learner`, it first calls `get_emb_sz` to calculate appropriate embedding sizes (which you can override by using the `emb_szs` parameter, which is a dictionary containing any column names you want to set sizes for manually), and it sets a few other defaults. Other than that, it just creates the `TabularModel`, and passes that to `TabularLearner` (and note that `TabularLearner` is identical to `Learner`, except for a customized `predict` method).\n",
|
||||
"\n",
|
||||
"That means that really all the work is happening in `TabularModel`, so take a look at the source for that now. With the exception of the `BatchNorm1d` and `Dropout` layers (which we'll be learning about shortly) you now have the knowledge required to understand this whole class. Take a look at the discussion of `EmbeddingNN` at the end of the last chapter. Recall that it passed `n_cont=0` to `TabularModel`. We now can see why that was: because there are zero continuous variables (in fastai the `n_` prefix means \"number of\", and `cont` is an abbreviation for \"continuous\")."
|
||||
]
|
||||
@ -9747,8 +9740,33 @@
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.5"
|
||||
},
|
||||
"toc": {
|
||||
"base_numbering": 1,
|
||||
"nav_menu": {},
|
||||
"number_sections": false,
|
||||
"sideBar": true,
|
||||
"skip_h1_title": true,
|
||||
"title_cell": "Table of Contents",
|
||||
"title_sidebar": "Contents",
|
||||
"toc_cell": false,
|
||||
"toc_position": {},
|
||||
"toc_section_display": true,
|
||||
"toc_window_display": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
|
Loading…
Reference in New Issue
Block a user