Predicting Ames housing prices: pipeline workflow

Published on March 1, 202218 min read

The purpose of this notebook is to conduct a supervised learning analysis in which it is modelled for predicting a continuous value -using the Ames Housing Prices dataset- by regressing a house sale price onto different measures and ratings. Then, historical prices that act as response variables are leveraged to supervise the learning process of this model.

Pipelines -chained transformer and estimator objects that establish a workflow- and an ensemble of pipelines have been developed in order to perform this regression task. Every pipeline is given the same input, however, the estimators get different sets depending on the pre-processing objects (transformers in Scikit-learn) we assign to each pipeline.

The idea is to evaluate not just the regressor object but the entire pipelines using cross-validation to get optimized parameters for the regressors. Therefore, all the pre-processing through objects -scaling and/or imputed values with averages- will be performed for every single batch of the data during the evaluation process to avoid using information of the entire set at every validation stage, which would -to some extent- bias the results.

Setup

Data preprocessing

In this section I explain how the data are preprocessed right before being fed to the chains. All the classes that I use in the pipelines for preprocessing the data can be found here.

Missing data

As part of the workflow, the imputation of missing values inside the chain in every single pipeline will be given by an Imputer object, however, most of the missing values are deductible just considering other values in the same observation:

In the case of qualitative variables consider the following example, if an observation showed that there is no pool, i.e., PoolArea is zero, we would then assign NA to the pertinent observation in PoolQC if it was missing, as the data description states this is the value for the quality of the pool when there is no pool. And if in fact there was a pool, we would impute None so that we do not bias the result. Also, in relation to the pipelines that use sets with dummy variables, when the indicators are created this observation would not be represented by any of the PoolQC classes, and in relation to the pipelines that use categorical sets None would not be taken as a category.
In the case of quantitative variables, and following a similar example, if there was a missing value in GarageArea for a given observation and GarageQual was NA in the same row, then 0 would be imputed in GarageArea.

The rest of the missing values will be assigned by an Imputer object, which takes as one of the arguments either a number to fill the missing values or the type of average (mean or median). This average can also be computed after grouping by another predictor, passing the feature name to the groupby argument.

Outliers

The training set contains some observations with uncommon values in some of their predictor spaces and also in their response. The effects of this observations if we fed them to the learning model could invalidate the fit.

On the other hand, removing them could prevent the estimator from learning some peculiarities in the data and, in the end, from generalizing better.

I opted for not removing any observation of the set, for transforming some variables as well as for creating some others attempting to alleviate the impact of this outliers over the resulting fit.

In the figure below we see four of the predictors with some points that, to different extents, do not follow the general trend of the data, or have values that are not common in the predictor space:

We can graphically see the impact of some points on the regression fit and how each fit changes just removing one observation or a few of them.

Let's then take two of these variables we believe contain outliers and transform them. In the case of GrLivArea and LotArea, we can see the result of transforming both predictor and response below:

The presence of the same observations that were altering the regression before is now virtually not affecting the resulting fit. The points that were out of the general trend of the data are now following it and the points with values in the predictor space that were too big compared with the rest of the observations have now similar values thanks to the change of the scale after using a Box Cox and log transformation.

Feature engineering

There are some other aspects that may influence the price of a house, I have created new features from the variables that we already have to attempt to capture that information, and I have also made some aggregations combining some variables:

_Baths/Rooms (quantitative): ratio of bathrooms and rooms.
_Baths/SF (quantitative): number of bathrooms per square feet.
_BsmtUnfin_ratio (quantitative): ratio of unfinished basement square feet. Zero when there is no basement.
_Condition (quantitative): aggregation of both Condition1 and Condition2 to create indicator variables for every single condition. These are almost half the variables that would result from getting dummies without aggregating (17 versus 9) which is still less than if we were to create dummies from the aggregated variable (16 versus 9). We also use this variable to create the following predictors and then we remove it from the data set:
- _Connected (quantitative): houses adjacent to an arterial or feeder street.
- _Railed (quantitative): houses within 200' or adjacent to a railroad.
_Exterior (quantitative): the same as _Condition but with Exterior1 and Exterior2.

A catch is that when training a pipeline with a Random Forest regressor, the frame has Condition and Exterior variables sparsed. One could include this aggregations in the class to create dummies so that the pipe with the forest does not get this dummies. However, this is more computation during the optimization of the parameters in the other pipelines.
_Date_Sold (quantitative): I combine YrSold with MoSold into this feature. There is a clear pattern in house sales through the months during 5 years. There are more sales during the summer than during any other season, and the amount of sales and the price of the houses sold remain similar over time. We can see a box plot of it against SalePrice below:

_Living_Area (quantitative): total number of square feet in the house.
_LowQualFin_ratio (quantitative): ratio of low quality finished basement square feet in all floors.
_Rooms/SF (quantitative): number of rooms per square feet.
_Time2sale (quantitative): years passed between the date of building and the date of selling.
_Time2remod (quantitative): years passed between the date of building and the date of remodelling.
_Time_remod2sale (quantitative): years passed between the date of remodelling and the date of selling.
_TopZone (qualitative): indicates the houses located in FV (Floating Village Residential), RL (Residential Low Density) and RP (Residential Low Density Park) zones. Although there are no observations in the dataset with houses in RP zones, we include this zone in the set of zones with more expensive properties, elaborated from the following simple group:

_Total_Baths (quantitative): total number of bathrooms including half bathrooms above grade, and bathroom and half bathrooms in the basement.
_Unfinished (quantitative): houses with any of their predictors in the dataset that describe an unfinished part or area.

In some pipelines I add an object to create features with averages as part of the workflow. If I was to create this features beforehand, and therefore, add them to the set with which we feed the pipeline, I would be training and validating models with information from the whole set during the cross validation process.

Then, I aim to prevent this catch by creating features with averages from the subset fed to the estimator, which, in the cross-validation process to optimize the parameters, will be the training batch, or split.

Feature transformation

I have used the Box Cox family of transformations in scipy which, if does not get a fixed lambda, finds the lambda that maximises the log-likelihood function:

I include another object in the pipe to transform some features, but with a list of lambdas we have already computed above using the entire set.

Encoding qualitative variables

In order to be able to encode properly the data, no matter what the observations are in every validation set, a set of the categories will be needed:

Right before scaling the features with a sklearn StandardScaler object, I include either a Get_Dummies or a Get_Categorical object to encode the variables in the set but with all the categories from the entire set, computed above.

The reason for passing this list of categories is to prevent the estimators from being fed with different sets of encoded predictors in some folds during the cross-validation process: it is expected in small sets that when cross-validating, the batches of data after a split may not include the whole set of categories in the respective qualitative feature.

Modelling

I have written three classes that are used in this section:

PipeTunerCV, which is used to optimize the parameters passed from the estimators in the pipelines, prevailing the values of this parameters with lower cross-validated RMSE over the others. Also, I include some methods to plot the coefficients of pipelines with Ridge, the Lasso and ElasticNet regressors, and the feature importance of the Random Forest Regressor.
BoostingTunerCV, which is used to optimize the pre-arranged parameters of the estimators outside the pipelines and to plot the feature importance of the predictors and the learning curve of the model.
WeightedAverage: it is used to select the combination of pipelines that yields the best performance.

In order to assess how well the estimators generalize I use the square root of the mean squared error evaluated by cross-validation using 5 folds:

The dataset consists of many dimensions compared with the total number of observations, therefore, models capable of shrinking the coefficients of every predictor -and good with sparse data- may be a good starting point.

Whereas Ridge regression with an ℓ2 norm shrinks the coefficients towards zero and reduces the importance of highly correlated variables, the Lasso and ElasticNet -with an ℓ1 norm, and a compromise between both ℓ1 and ℓ2 norms respectively- are able as well of regularizing since when the control parameter of the ℓ1 penalty becomes sufficiently large, forces some of the coefficients to equal zero, subsequently discarding its associated predictor from the elaboration of a future prediction.

Ridge

Lasso

Elastic Net

Algorithms with kernel trick have also been used. A polynomial kernel in case of Kernel Ridge Regression, and an RBF kernel in case of Support Vector Regression. The parameters of this estimators are optimized by cross-validated grid search over different parameter grids defined below:

Kernel Ridge

Support Vector Regressor

Random Forest Regressor

I have also created a pipeline that contains a random forest regressor. Increasing the number of trees will slightly improve the performance, although it will still be worse than the performance of the preceding estimators for this dataset.

I assign 25 to n_iter to randomize the search as this will limit the length of every grid, if this value is lower than the length of the array of combinations, and sample from it randomly without replacement:

Gradient boosting regressors

eXtreme Gradient Boosting and Light Gradient Boosting Machine are the boosting regressors that are going to be explored. The former uses the depth-wise tree growth algorithm while the latter the leaf-wise tree growth algorithm.

Given that the number of observations in our training set is not large and this ensembles tend to overfit in small sets, the optimization of the parameters is a necessary task to be performed since otherwise our estimator would most likely follow too closely the training data and would yield bad predictions. In this regard, an approach to optimize -or tune- the parameters in both regressors has been written: BoostingTunerCV.

The main idea is to automate the process of cross-validated grid searches of all the parameters. This parameters are then included in different grids: firstly, the main parameters to control model complexity, a posteriori, parameters to add randomness subsampling features and observations -to make training robust to noise-, and finally, regularization parameters.

I have set the learning rate to 0.025 in the XGB regressor and 0.03 in the LGBM ensemble and have used the native xgb.cv and lgb.cv cross-validation functions to evaluate the corresponding models enabling early_stopping_round rounds.

In this way, the training stops in any of the batches if the square root of the mean squared error (selected metric) of the validation data doesn’t improve in the given rounds. The number of the last round will be the total number of estimators that are needed to perform an optimal task with the given parameters.

eXtreme Gradient Boosting

As noted above and as we can see next, the model tends to overfit the training data using the standard parameters even stopping the training if there is no improvement after 100 epochs (or iterations over the training set). The performance of the ensemble improves greatly as the number of estimators increases to 300. Around 800 the training error keeps decreasing substantially as opposed to the test rmse, i.e. the model is correcting errors in every subsequent tree that will not help when it comes to predict on new data.

The training can then be stopped earlier setting a lower early_stopping_rounds value to avoid overfitting, i.e. to keep a good generalization, as we can see on the plot on the right hand side:

When setting early_stopping_rounds = 5 there is still a similar test accuracy but the model does not follow so closely the training data. Although there is still overfitting, it is not so severe:

Since the parameters have yet not been optimized for the data set, this was only an approximation for the sake of introducing some inherent properties of this ensembles and to have an idea of the learning curve of our model and how we can partially attenuate overfitting keeping a low test rmse.

Below I'll start tuning the hyper parameters of the boosting machine so that the performance can be further improved:

As we can see in the learning curve, after optimizing the parameters of the model the overfitting problem has been considerably alleviated whilst keeping a similar test rmse, i.e., the model improves its testing error and does not follow so closely the training data:

Given that I do not standardize nor impute values with averages on the frame that is used to feed the gradient boosting regressors (in fact we leave np.nan as the specified missing value), the cross validated RMSE is exactly the same for the XGB as for the Pipeline where it is chained:

Light Gradient Boosting Machine

I have followed a similar procedure to optimize the hyperparameters of the LGBM, however, early_stopping_rounds has now been set to 10 and the learning rate to 0.03:

The Light Gradient Boosting Machine is performing slightly better on this set and also it allows a faster training compared with the XGB.

Residual plots

The following figure shows the residuals of every estimator after splitting the training set, fitting every estimator and predicting the validation set. Observations on the dark orange dashed line, or origin, are perfectly predicted (residual=0).

Despite the fact that there are no clear patterns in the validation data, there are still some observations hard to predict properly after training the models with 70% of the data (the rest of the data is left for model evaluation). However, all the estimators but the ensemble of trees perform quite well. The random forest regressor overfits significantly the training data, and, as a result, performs significantly worse than the other regressors.

$R^2$ is used to assess the accuracy of the model. This metric measures the proportion of variance that an estimator is able to explain.

Weighted average

Most of the predictions above are quite similar, but some others differ, so my intention has been to come up with a very simple ensemble computing the best average amongst pre-selected models predictions, i.e. some of this predictions will carry more importance, or weight, than others.

In order to select the best combination of weights for all the estimators, a frame of percentages between 0 and 1 every a given step is created so that all the weights of every row in the frame will sum up exactly one, this is, the full percentage of an estimation is computed. Finally, every column contains all the weights to be computed for every estimator.

For instance, let's say one of the rows is [0.25, 0.0, 0.35, 0.1, 0.3], this would be the weights needed to compute an estimation from taking 25% of the predictions of the first estimator, zero from the estimation of the second model, 35% from the third model, ten percent from the fourth model and 30% from the fifth. This weights are, therefore, multiplied by the predictions of every model. But this predictions are for a holdout set since this whole process is evaluated by k-fold cross-validation.

So, once the selected weights are multiplied by the predictions of this holdout set, the resulting average is scored with the function rmse and then the result of the metric is stored in one of the cells of the predictions frame which will, in the end of this process, contain one column per fold of the k-fold cross-validation and one row per possible combination of all the weights or rates pre-specified.

The rates are computed once the fit method of an object of this class is called. The higher the number of models included in the ensemble for a given step the longer it'll take to compute the combination of rates.

In conclusion, in case we created a Weighted_Average object of five estimators with a step of five, the weights used as an example above would be in one of the rows in the predictions data frame, and every value of this row would be its corresponding rmse of the respective holdout fold. The mean along this row would be the cross-validated square root of the mean squared error of a weighted average combination. The selected weights would be the ones for whose RMSE-CV is lower, i.e. the weights that yield the lowest RMSE-CV among 10626 combinations.

First, we include all the estimators -but XGBRegressor and RandomForestRegressor- with a step of 10 so that it does not take too long to create the combination of weights:

The pipelines with the the Lasso, Kernel Ridge, XGBRegressor and LGBMRegressor estimators are the ones to be selected in order to create another weighted average, but now with a step of 0.01 so that the combination of weights is computed every 1%:

Finally, this is the classification of all the pipelines and the most accurate ensemble of pipelines in this analysis (the best scorer is highlighted in blue):

Weighted_Average yields an ensemble of pipelines as the estimator with the lowest RMSE CV, namely 0.28 * Pipeline_Lasso + 0.08 * Pipeline_KernelRidge + 0.24 * Pipeline_XGBRegressor + 0.4 * Pipeline_LGBMRegressor.

Summary

The aim of this analysis was to develop an estimator or ensemble of estimators for regressing house prices using pipeline workflows so that it could be avoided to leak information from the sets to the transformer objects during the estimator training.

It is shown when training the models how in this data frame pipelines with gradient boosting machines or with regressors with non-linear kernel tend to generalize better.

Also that an ensemble of those and a pipeline with a lineal estimator is the most accurate estimator.

This is, even though some models may not perform best standing alone, they may still be able to capture some valid information in the data to make better predictions.