Predicting house prices in Stockholm using Tensorflow

For a while now, I had been wanting to combine artificial neural networks (ANN) and geographic information system. ANN's can be trained to predict any given condition say for recognising images or predicting how much a 50m2 house in Stockholm with 2 rooms will cost. However, an advanced ANN with geographical features (especially locations) should be a future project. As this was my first attempt, I chose the simplest "housing price prediction" example.  As there are numerous examples on internet it was a good start for troubleshooting also.  

WHY DO WE NEED ARTIFICIAL NEURAL NETWORKS

You might wonder why we need to train a neural network to predict house prices. We can simply check web pages for this, right?  There are many benefits to an ANN, but for this example let's say you want to sell your house that is 50m2 with 2 rooms and assume there aren't many houses that have these specifications (and once you add more parameters like garage, garden, building age... things get even more complicated). So what is the best price for your house? Will you check hundreds of online selling web pages or would you prefer to have Python and TensorFlow do the job for you say in 20 minutes?

PREPARING FOR ANN;  SCRAPING A WEB PAGE TO GET OUR DATA

Before creating an ANN I needed the data for house sizes, rooms and prices in Stockholm. I used Python3 and beautifulsoup to scrape data from Blocket (online web page for buying and selling). In order to have quick calculations, I restricted my data to 1000 houses. You can see the code in Github.  The data I retrieved looks like the  following. Each row holds information per house. So house #1 is 61m2, it has 2 rooms and is 1,295,000 SEK (about 135,568 Euro). House #2 is 44m2, it has 2 rooms and is 3,775,000 SEK. You get the point.

size(m2) , rooms,  price
61,  2,  1295000
44,  2,  3775000
80,  4,  3350000
79.5,  4,  3295000
80,  4,  3250000
94,  4,  2750000
70,  3,  3245000
70,  3,  3200000
...

CREATING TWO-VARIABLE LINEAR REGRESSION IN TENSORFLOW

TensorFlow (TF), an open source library for machine learning developed by Google Brain team and is a good start to learn how to form simple neural network models. After managing to form a one-variable linear regression in TF, I trained a two-variable linear regression. Two-variable linear regression means that we have two variables that affect our outcome. The equation is as following;

y = X1 * W1 + X2 * W2 + b

X1 = house sizes

X2 = number of rooms

W1 & W2 = weights or slopes. They  determine how much each variable affects the outcome and,

b = y intercept or constant  

The aim is to feed the model with data so that it understands the pattern between X1 and X2. So basically in every step (called epochs) your machine learns how to make better price predictions (y) by figuring out the best W1, W2 and b values.

The TF code can be seen here.  However, as this was my first trial, I'm not entirely sure if the model is entirely correct. Nevertheless, the results were like as following and we can see that in each step (epoch) TF is adjusting W1, W2 and b variables.

Epoch: 0010 cost= 3350779920384.000000000 W1= 28691.1 W2= 39272.6 b= 25952.6
Epoch: 0020 cost= 3328240254976.000000000 W1= 27202.7 W2= 74169.0 b= 50580.0
......
Epoch: 0980 cost= 2927946366976.000000000 W1= 8579.74 W2= 396538.0 b= 919026.0
Epoch: 0990 cost= 2927024930816.000000000 W1= 8576.84 W2= 396039.0 b= 921854.0
Epoch: 1000 cost= 2926118961152.000000000 W1= 8573.94 W2= 395549.0 b= 924639.0

After all these steps the ANN gives us the values for W1, W2 and b that are;
 

W1= 8573.94
W2= 395549.0
b= 924639.0

In the following graph we see all our data and the linear line our model creates for our equation.

blocket.png

LET'S PREDICT

Now, we trained the data for 1000 times (the more the better accuracy). So let's use our ANN to make our model to predict a price for the house we want to sell. Our house is 50m2 and has 2 rooms. So the model said that our W1, W2 and b values as following;

W1= 8573.94
W2= 395549.0
b= 924639.0

And we want to sell our house that is 50m2 with 2 rooms;
X1 = 50
X2 = 2

And our equation was;
y = X1 * W1 + X2 * W2 + b

So let’s find the value of y (price).

According to our model, our house is worth 2,144,434 SEK more than 2 million Swedish Krona. As the model is not perfect, this price might be a bit higher or lower in reality. But imagine adding more parameters into the equation like the age of the building, proximity  to public transportation and so on. Predictions using ANN will then become even more important for decision-makking.