Fear and loathing in Ames, Iowa

Ryland W Matthews
3 min readMay 22, 2021

OK that title might be a bit of hyperbole, but my second DSI project was overall more enjoyable than the first. Dare I say, a lot of it was even fun-fun. I think a lot of it was starting in the deep end, the slow-roll start of the first project kinda narrowed my perspective on creative ways I could approach it. Starting from a blank slate made me spend more time considering different possible angles of attack and that led to probably my favorite part of the second project: my model ended up being pretty good.

I started with the Data Dictionary for the dataset and started looking for relationships that made intuitive sense. I was already thinking at this point about ways I’d use hard metrics for discarding data that was likely noise, and focused on finding columns in the dataset in a more qualitative way. The first thing that stood out to me was that the timespan of the dataset covered 2006-2010, so it encapsulates the Recession, the housing market crash, and even includes the first year of recovery. On first look, I could see that the mean and median Sale Price for 2010 was much lower and I expected this to be a strong correlation in the data; I even tried making an Is2010 column with a Boolean value to give it more weight, as well as scaling the years to 0–4 to give them a more clear distinction, and despite all of that, the methods for filtering out columns with both correlation and LASSO regularización… both would regularly score the Year Sold column as one of the least influential in the whole dataset. Eventually, to maybe understand why a little bit better, I graphed out the Sale Price distribution for every year layered on top of each other, and although prices were definitely suppressed in 2010, they are distributed very similarly to the other years, and that’s the best explanation for why it doesn’t end up being an important variable in the model

I also just really like the look of this graph
We can see that although there were fewer sales in 2010 and the values were depressed, they follow a similar distribution to the previous years

Some of the feature engineering did make a huge difference, like combining all interior and exterior square footage, scale-encoding a couple of categorical columns and mean-encoding many others, as well as some qualitative analysis, like how although Overall Quality was highly correlated with price and Overall Condition showed almost no correlation, the interaction between the two ended up also being much more highly correlated than just the halfway point between the two. By the end of this process, I had almost a couple dozen more features with a Sale Price correlation greater than 0.5, and that’s when my model started taking a pretty good shape

Most of the features with the highest correlation to Sale Price were engineered features

Then I was ready to start feeding these features to my models, which involved a fairly straightforward Lasso pipeline, a more broad GridSearch, and then finally just a straight, unscaled Linear Regression, to have a more simple benchmark to compare the other models to. The thing that surprised me the most was how well the Linear Regression model ended up doing, especially in the train-test-split runs. The GridSearch pipeline ended up doing much better for the Kaggle competition involved in this project, I believe because I couldn’t be as selective with overall quality of the holdout Test set as I could with the test set from the train-test split; with filling null values with the column mean for the holdout Test set, the GridSearch was able to make slightly more accurate predictions despite this limitation

The models turned out to be pretty accurate, with a good balance between bias and variance

I’d say the single biggest difference in the aftermath of Project 02 compared to Project 01 is that I was very happy to have Project 01 behind me, to eventually fade into memory. Project 02 on the other hand, I look forward to coming back to and really polishing into something that is truly portfolio-ready

--

--

Ryland W Matthews
0 Followers

Data scientist in training by day, also data scientist in training by night