What Machine Learning Can and Can't Do: Zillow-Style House Price Prediction Challenge April 9, 2019

Author: Sammy Lee

Playground

Tags: data science, zillow, home prices, lambda, machine learning, prediction, school



We shall not cease from exploration, and the end of all our exploring will be to arrive where we started and know the place for the first time. - T.S. Eliot

(TLDR: This is an end-to-end machine learning post. If you lack the time or patience, you can get a taste of the algorithm developed by hitting the Playground button above to interact directly with the model.)

For my first post I'm going to emulate an End-to-End machine learning project from chapter 2 of Aurelion Geron's Hands-On Machine Learning with Scikit-Learn and Tensorflow.

My hope by the end of this post is to help the reader get an intuitive idea of what machine learning can and can't do.

We start with a checklist provided by Aurelion Geron that will guide the overall process:

  1. Look at the big picture
  2. Get the data
  3. Discover and visualize the data to gain insights
  4. Prepare the data for Machine Learning algorithms
  5. Select a model and train it
  6. Fine-tune your model
  7. Present your solution
  8. Launch, monitor, and maintain your system

1. The Big Picture

Imagine you work for a real estate company called BlueFin.

We are asked by our bosses to come up with a model for house prices in California. Apparently, the boss want to know if a computer software can do a better job at predicting house prices than some of the experts at the company.

(*side-note: In my opinion you should not deploy machine learning to production in situations where a human being can do the job cheaper and faster than a computer)

You can get the dataset here from Aurelion Geron's Github.

Part of looking at the Big Picture is to Frame the Problem: What is the business objective? In this case your model's predictions will feed into a downstream system. Of course, the overarching goal of any business is to to make a return on invested capital (ROIC) from all of its different business segments.

Another question to ask is this problem a Supervised or Unsupervised machine learning problem? Is it Classification or Regression?

In this instance, because we are given labeled data, it's a Supervised problem. Furthermore, because we are asked to predict a continuous value (house prices), and not a binary outcome, it's a (Multiple) Regression problem.

A.G.'s (Aurelion Geron) next suggestion is to Select a Performance Measure. A standard measure of regression performance is the mean-squared-error(MSE) or the average of the squared differences between actual value and predicted value. Taking the square root of the MSE(RMSE) will allow us to express the performance of our model in terms of the units of whatever we are predicting (in this case $ amounts). So in this case, the smaller the RMSE the better our model's performance.

2. Get the data

Those with Jupyter Notebook on their local machine or Google's Colab can follow from this point on.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

housing = pd.read_csv('https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv')
housing.head()

Housing DataFrame Feature Variables (Independent variables):

  • Longitude
  • Latitude
  • Housing median age
  • Total rooms
  • Total bedrooms
  • Population
  • Households
  • Median income
  • Ocean proximity

Target Variable (What we want to predict): Median house value

We can get a feel for the overall dataset by running:

housing.desribe()

Housing Descriptive Stats

Pandas also gives us a quick way to get a feel for the dataset by allowing us to visualize the data with just one line of code:

housing.hist(bins=50, figsize=(12, 8));

Housing Histogram

Notice that most of the variables have non-normal, skewed distributions which can have negative effects on model performance.

After loading the data, and having a peek at the dataset, the first and most important thing for data scientists to do is to split the dataset into Training and Test sets. The training set is what we create and train our machine learning algorithm from. The test set is putting money to mouth and seeing how well the algorithm performs on data it has yet to see.

This idea is fundamental.

If we're talking about classifiers, you can get 99.99% accuracy on the training set, and end up getting 50% on the test set.

If you take away anything from this I want you to take this: I don't like to think of myself as an ignorant person, but I know I am because of how frequently I am surprised with which the state of the world disproves my preconceived notions. The next door neighbors who you thought was super-annoying one day turn out to be really decent people. The quiet and unassuming guy you wrote off fought to defend this country. The teacher you really hated turned out to be the best because he prepared you for the real world.

Putting a model into production that hasn't been validated by the test set is like putting a person out into the world full of book learning, but zero street smarts and expecting them to succeed. It's just not gonna happen: You don't get good at dealing with people by studying psychology books, you get good at dealing with people by dealing with people.

What we want our machine learning model to do is capture enough of these experiences where the assumptions turn out to be both right and completely wrong, so it knows what to expect when it goes out into the wild.

By splitting the dataset and only exposing the algorithm to the test set at the very end, you avoid putting a model into production that ends up embarrassing you and your team.



Comments

*Users must login or register to write reviews for Gygantor or simply make comments