Predicting Ames housing prices: exploratory data analysis
An exploratory analysis is introduced in this notebook in order to come up with a good understanding of the Ames housing prices dataset. I'll make use of two simple forms of exploratory data visualizations: distribution (or distribution-based) plots and two-variable visualizations.
I am interested in knowing the shape of the data and the relations between the independent variables and the response. This visual exercise may enlighten us to insights and patterns which otherwise could have been harder to comprehend.
In this article
Setup
The dataset contains 79 values for every single observation, 80 in the training set considering the target value. There are 1460 observations in the training set and one less in the set that is used for testing the analysis, i.e., predicting the target variable, which is SalePrice
. The only column that will not be used is Id
.
(1460, 80)
Input variables have been classified by their levels of measurement (ratio, interval, ordinal, nominal):
The response: SalePrice
The plots on the left-hand side of the following figure show the distribution of all the prices that I will use to supervise the analysis and the probability plot of all this samples.
We can see positive skewness and kurtosis, there is a long tail to the right and the distribution is very peaked around the mean, so our response does not follow a normal distribution (shown in black).
However, this is partially rectified after a log-transformation as can be seen in the probability and distribution plots on the right-hand side: the distribution follows the diagonal closely in the corresponding probabilty plot and the density is closer to the normal shape.
The predictors
A graphical analysis will be performed first, looking at the predictors one by one, and then at the plots of the predictors as a function of the response. The figures are classified by level of measurement: ratio, interval, ordinal and nominal. I'll be using this two functions to plot the graphics.
Univariate analysis
This analysis is helpful to find out some particularities in the data:
- The vast majority of houses have less than 2000 square feet above grade living area and less than 20000 square feet of lot area.
- The distribution plot of
TotalBsmtSF
shows that most of the houses have a basement. - Most of the houses have between 5 and 7 rooms above grade, nonetheless, there are houses with up to 14 rooms above grade.
- There are a few houses with more than one kitchen above grade and some without bedrooms.
- Most of the properties have either 1 or 2 full baths but there are a few with 3 and some others without full bathrooms.
- Most of the properties have enough space for up to two cars in the garage.
- Most of the houses do not have a pool nor a porch.
- There are observations in
YearRemodAdd
from 1950 onwards and, given the huge difference in quantity from this year to the following fifty years, it is likely there is a lower bound in their values. Moreover, as new properties most likely do not need remodelling andYearRemodAdd
does not contain missing values, many of the values, apart from those for the houses withYearRemodAdd
of 1950, may have been somehow prescriptively imputed.
- Obviously, there is a tight relation between predictors as
GarageYrBlt
andYearBuilt
, orGarageArea
andGarageCars
, however, given the fact that most of the houses have garage and the distribution ofGarageYrBlt
andYearBuilt
is slightly different, it is likely that either the garage in many properties has been built later or that the garage unit and the house are separate properties and consequently both have most likely been built in different years.
Number of houses with garage: 1379
Number of houses built the same year as the garage: 1089
- There have been many more houses sold during the summer season. Also, the number of houses sold from year to year since 2006 has not changed vastly except for 2010, however, this is because the training set has records of houses that have been sold only until July 2010 as we can see below:
Last month with records in 2010: 7
- The vast majority of houses have central air conditioning and also gas forced warm air furnaces as their type of heating.
- Most of the houses have been built in a residential area with low density.
- Most of the properties are nearly flat (gentle land slope). This may obviously be due to the neighborhood or area where they have been built. However, there may be some exceptions that may distinguish the houses among others in the same area.
Some other possible distinctions are houses that have different ratings in OverallQual
or in OverallCond
or different LotArea
within the same neighborhood.
Since these differences might result in an increase or decrease of the price, we might need to create some features to explain this information in case our model wasn't able to capture or explain this proportion of variance in the response.
For example, outliers and patterns after plotting the residuals once the model is fit since this might indicate a deficiency in the analysis, such as a missing feature.
Bivariate analysis
The plots below help us to identify some observations that are out of the general trend of the rest of the data in each of the predictor and target spaces.
For example, we know from the box plot of GrLivArea
that it contains observations over the max whisker reach, i.e., with extreme value in the predictor space, but now, as we can see below, we can say that two out of this four suspected outliers are true outliers, in the sense that they do not follow the general trend of the data.
Another clear example is the property with over 6000 sf of basement in TotalBsmtSF
, which by the price might be one of the previously defined outliers.
On the other hand, the other two observations in GrLivArea
with more than 4000 sf are in accordance with the rest of the points or, at least, we could say they follow the trend of the rest of the houses.
Some houses have the same values in YearRemodAdd
as in YearBuilt
so they have not yet been remodeled, i.e. some of their values might have been prescriptively imputed. This may be due to the fact that the property has just been built or that it is in good condition.
However, it might not always be the case as we can see in the Series below. There are 39 properties in the training set with an OverallCond
lower than 5 that were built the same year that the data set specifies were remodeled:
The creation of some features might capture information that could explain the difference of time between
YearBuilt
andYrSold
, or betweenYearBuilt
andYearRemodAdd
, or betweenYearRemodAdd
andYrSold
. This is, there might be a pattern that could help in predicting the price.
As we can see below in both, left plot and DataFrame, the month with the lowest median of SalePrice
is April. January, August and November are the months when there is more variability in the price of the houses being sold.
Although there are only 1460 observations in this set, we can say the first half of the year and July are when there are more properties distant from the bulk of the data, this is, perhaps this represents the type of houses that are not usually sold during the rest of the year.
Also, October does not look like the best month for real estate agents. Neither was 2008 in general.
Although the median for 2010 is even lower than the median in 2008 there are only houses in this set until July, 2010 as was shown earlier.
Below are boxplots of some encoded ordinal predictors against SalePrice
. The encoding helps to understand how the order influences the sale price.
Most of the times there is a positive correlation between the price and the ordinal predictors, however, if we take into account only the predictors that measure conditions, more is not always better, or a better condition may not translate to a higher price.
In fact, 73.84% of the first 730 most expensive houses in the set (half the dataset) have an OverallCond
of five out of nine:
Or we can also say that 93% of the houses with an OverallQual
of 9 or higher have been rated with an OverallCond
of 5 out of 9:
As we can see below, and as expected, the price of the properties fluctuates greatly depending on the neighborhood and also on the type of dwelling (MSSubClass
) amongst others. For instance, 1 square feet is very differently rated in StoneBr as it is in IDOTRR.
We can see on the following table the prices per square feet on average in the most expensive and cheapest neighborhoods (considering only the price per square feet of the living area):
If we group by type of dwelling as well as neighborhood we can even see a bigger difference:
On average, the price of a square feet of a 2-story dwelling in Edwards is 5 times cheaper than it is in StoneBr. This is only an approximation for the sake of illustration since many other features in the set have an influence -to some extent- on the price of the square feet.
This is further explored in this supervised analysis.