Skip to content
Search
Generic filters
Exact matches only

AI and Real State: Predicting Rental Prices in Amsterdam

Deciding if an apartment is worth the price is never easy. Can machine-learning help us understand where we stand in the housing market?

Brunna Torino

Renting or buying a new house, whether you are a university student or a middle-class family, is always a daunting process that often seems impulsive or risky (a true economic market for lemons).

If the renting itself is already hard, doing it in Amsterdam doesn’t make it any easier. With increasing city regulations, long waitlists for student housing and overpopulation, renting an apartment in Amsterdam leaves many desperate for the first opportunity available, becoming vulnerable to scams and overpriced contracts.

In this tutorial, I will go through an entire Data Science project for the rental market of Amsterdam, from the basics of gathering data, data cleaning, visualization, up until using a machine-learning method to develop valuation models for the city’s houses. Feel free to adapt the code and apply the project in your city to understand a bit more of where you stand as a renter/buyer!

The actual statistical methods you employ in your analysis should be your value-judgment and I will link deeper explanations to all the methods I apply, so make sure you check them out!

With (high-quality) data, it’s the more, the merrier. Machine-learning is a special field of statistics where we apply computer algorithms to very very large datasets. After you have established the questions you want answered (should I rent? should I buy? should I move to another city?) you can start looking for websites that will contain the data that you need to answer them.

In my case, I wanted to find a good-value apartment in Amsterdam. Therefore, I searched for rental websites in the city of Amsterdam. Simple right? However, always check their terms of service and robots.txt to make sure that you are allowed to scrape their data respectfully. (we will talk about what this means later on in the tutorial). For this project, I will use the Amsterdam rental website Pararius.

Pararius rental page for amsterdam

Find the listing page you want to get your data from. Click inspect to learn more about how data is structured in their website. Once you click, the sidebar will show the HTML storage where the data is located. In the following example, the data I need is mainly inside the box <li class=search-list__item search-list__item — listing> this means that I will refer to this data using this information later in the code. At this moment, we don’t need to specify if we want price, location, zipcodes…we only want to know the “box” where all this data is.

Pararius rental page for amsterdam

Now, we are going to get started with the coding! As with any web-scraping, we will be starting with making requests to the website we chose previously so that they can provide us with the data we want.

We should keep in mind that making a request to a website is similar to refreshing our browser on that specific page: it adds traffic to their servers and may overwhelm them if done in robot high rates! I added random sleep times to the code so the script will stop a few seconds before scraping more pages.

If you are particularly interested in how requests and headers work, you can also make your requests with fully randomised headers. This will not be necessary for most websites (and should not break their TOS) but if you would like to add security and anonymity into your code, this is my solution for (as much as possible) anonymous headers:

After scraping all the desired pages, you can run:

len(houses)

to find out exactly how many house ads were collected in total. Because we are training a model with this data later on, you should aim for at least 1000 housing ads.

After you have gathered all the data, you should do a few print commands to make sure that everything worked out:

print(response) will print if the request was successful (i.e. we were not blocked by the website)

len(houses) will print how many house ads you successfully scraped

print(house_data[1]) will print the second ad block you scraped in HTML format. I always prefer to look at the second one because the first might contain headers and confusing bits for the next part of our analysis: the data cleaning.

Data Cleaning

When I execute print(house_data[1]) I get this in my Jupyter Lab:

Ok, don’t run away from this tutorial just yet. What you’re seeing here is the beautiful HTML script that was scraped from your listing website! Look further down into it:

We can actually recognise a few things there! The apartment in this ad block seems to be located at 1078 RA Amsterdam, it costs 1500 euros a month and is 60m²! For the data cleaning, we need to find this information in the soup of HTML and write down where they are located (just like we did earlier with the website!)

For example: to get the price, your need to search for <span class =listing-search-item__price> and to get the location you need: <div class =listing-search-item__location>. You should do the same with all the information that you need for your analysis. However, just looking for the HTML tag and class can return more information than you’d like. Make sure to try adding [0], [1],[2], …to test which parameter will give you exactly the line you are looking for.

Click on the block to see the code on GitHub

A lot of times, even after finding the right line, there will be extra characters that you want to clean out of the data frame, such as letters in the rent price, or whitespaces in the postcode. Check out the amazing regex tester (https://regex101.com/) to know how to str.replace() your problem away. A few examples:

  • to delete non-digit characters:
df["column name"].str.replace("d","")
  • to delete digit characters:
df["column name"].str.replace("D","")
  • to delete the word “new” :
df["column name"].str.replace("new","")

This is how our data frame looks like after cleaning:

It is not only important to have the data, but also know how to use it. What is important for tenants? What can make the difference in rental prices? Depending on the data you scraped earlier, these variables might be available to you:

  • Surface Size
  • Number of Bedrooms
  • If the apartment comes furnished (binary)
  • If the price is inclusive of utility bills (binary)
  • Distance to City Center
  • Trendiness of the neighbourhood
  • Rental Agency (binary)
  • Temporary vs. Long term contracts. (binary)

On this tutorial, we will focus on how measure these variables. I won’t be able to use furniture, utility bills or length of contracts for this analysis as that data is not available from the website I am using, but if those are available to you I highly recommend including them.

Location, location, location. Does the location of apartments really matter that much when renting a house? Many people in Amsterdam would agree, as driving cars is impractical in the charming medieval streets, public transport is not very affordable and biking under windy rain just sucks.

But how do you differentiate that effect on rental prices? Geolocation! In this tutorial, we will use Nominatim. We will use this to get the coordinates of every apartment in our dataset and later compare them with a (desirable) point in the city.

Here’s the code:

The idea here is to get a latitude and longitude for each address in your data frame, so we can calculate the distances as precisely as possible. This is what you should end up with:

It should be noted that the columns point, location, and altitude will be deleted. Point is only needed to obtain the latitude and longitude points and altitude is not needed since we are researching houses in The Netherlands!(but would be an interesting factor to take into account if you live in Switzerland, for example).

Let’s calculate the distance: by now, you should have chosen a point in your city to calculate the distance between the apartments and this specific point. For Amsterdam, I chose the Amsterdam Centraal station with the following coordinates (centre point):

After choosing my point and getting its coordinates (on Nominatim as well), you will create two columns with the latitude (52.370216) and the longitude (4.895168) of that point.

If you are analysing a bigger city that has multiple locations that are considered desirable, you can also run this code as many times as needed with different geographical points. (Don’t forget to change the column names so you don’t overwrite the previous point!).

For example, there is a financial district close to the Amsterdam Zuid station that could be equally (or even more) relevant to working tenants than living close to the city center. Measuring these various scenarios is more important if you are using methods similar to multiple linear regressions rather than machine-learning statistical algorithms because they are inherently better at recognising non-linear relationships and clusters. For this reason, I won’t include it in this analysis but it is an interesting factor to weight in depending on the statistical method being used.

Now that we have the geolocations of all the apartments in our dataset, we can further visualize how their rental prices are placed geographically and spot any trends that might be relevant for our analysis. We will do this with Google Maps, with the package gmaps for Jupyter:

conda install gmaps ## to install the google maps package

You will also need a Google Maps API key that is easily requested (and free for most purposes). You can click here to request it.

With the simple code below, we can request an interactive google maps on a Jupyter Notebook (sometimes it does not work correctly with JupyterLab), and output a heat map layer that will tell us where the highest rents of the city are.

We should get this beautiful geographical representation of rental prices in Amsterdam:

Analysis of the heat map: As expected, most of the higher rent prices are concentrated around the city center, specifically De Wallen, and also around the outer neighbourhoods such De Pijp and around the city park, Vondelpark, where there is a concentration of luxury houses. The further out neighbourhoods of Nieuw-West, Zuidoost, Ijburg and Noord appear to have lower rental prices represented in green. Location seems to have a strong effect on rental prices, but it is definitely not the sole factor.

For more examples and tutorials on gmaps click here.

We should also take a look on how the variables in our data frame relate to house prices. This is the code to check the relationship between house prices and surface area:

import matplotlib.pyplot as pltplt.figure()
plt.scatter(amsmodel1['surface'],amsmodel1['house_price'], s=20, edgecolor="black",c="darkorange", label="surface")
plt.xlabel("Surface Area")
plt.ylabel("House Price")
plt.title("Surface Area vs. House Price")
plt.legend()
plt.show()

When it comes to the surface area of the apartment, there is a very clear upwards sloping trend relationship between the two! However it should be noted that:

As the house gets bigger, the marginal price of an additional square meter decreases dramatically. Thus, almost always going for a bigger house (if you can afford it) will actually get you more bang for your buck.

This is code to check the relationship between house prices and number of bedrooms:

import matplotlib.pyplot as pltplt.figure()
plt.scatter(amsmodel1['bedrooms'],amsmodel1['house_price'], s=20, edgecolor="black",c="darkorange", label="bedrooms")
plt.xlabel("Bedrooms")
plt.ylabel("House Price")
plt.title("Bedrooms vs. House Price")
plt.legend()
plt.show()

In a somewhat similar conclusion to the surface area, the more bedrooms a house has, the higher the rental price should be. However, that is also not the only factor as there are 5-room apartments going for as little as $3000 a month and as high as $10000.

If you look into your data frame now, you will notice a lot of columns that were “step” columns in order to get more information about the houses. We can delete all these columns as we will not be using them anymore.

del df5['address']
del df5['address2']
del df5['altitude']
del df5['latitude']
del df5['longitude']
del df5['point']
del df5['lat2']
del df5['lon2']
del df5['coord1']
del df5['coord2']
del df5['location']

In every large city, a few neighbourhoods seem to be very popular (and thus have exceptionally high rents) even though they are not close to the city center, or necessarily populated by bigger apartments. An example that can be found in our Google Maps heat map would be De Pijp, that is outside the Amsterdam ring and mainly offers small and non-renovated apartments, but has a higher rents average than the apartments on the right side of the Centraal Station.

How do we measure the effect of popularity in a quantitative way?

One way is by using Yelp! Popular neighbourhoods tend to have popular bars and restaurants, with high ratings and potentially also high prices. Most people that pay higher rents to live in De Pijp justify it with:

“that’s where the life of the city is!”

“that’s where all the cool bars and restaurants are”.

Photo by Louis Hansel @shotsoflouis on Unsplash

Yelp can help us by giving us hundreds of restaurants and telling us exactly how well-rated they are and how much they cost in a standardised and easily quantifiable way: $ are cheap eats, $$ and $$$ are in the middle, and $$$$ are expensive. Furthermore, this also helps us understand the neighbourhoods that may not be popular at the moment, but are traditionally richer areas (with a lot of $$$$ restaurants) that consequently charge higher rents.

You will need to register for the Yelp API, and replace my api_key with yours, and also replace location with the city you are analysing. It is also possible to build a second outer loop to get data from different cities. If you have a data frame with the column city, you can transform that column into a list and iterate the requests over that list!

import requests
import json
api_key='your-api-key-here'
headers = {'Authorization': 'Bearer %s' % api_key}
url='https://api.yelp.com/v3/businesses/search'
## creating global empty lists so we don't overwrite them but keep adding data to them
rating = []
zipcode = []
cities2 = []
prices = []
## offset can be explained as the page number and with limit can gather max 1000 business data from Yelp every day
offset = 1
## loop to iterate over 7 pages of 50 businesses each = 350 businesses in Amsterdam
while offset <= 7:
params={'term':'Restaurants', 'location': 'amsterdam', 'limit': 50, 'offset': offset}
req = requests.get(url, params=params, headers=headers)
parsed = json.loads(req.text)
n = 0
while n <= 50:
try:
price_data = parsed["businesses"][n]['price']
ratings_data = parsed["businesses"][n]['rating']
zipcode_data = parsed["businesses"][n]["location"]["zip_code"]

rating.append(ratings_data)
zipcode.append(zipcode_data)
prices.append(price_data)

except:
## some of the data gathered are not going to have the necessary information
## so we skip those
pass
offset += 1

After gathering the data from Yelp, we need to match that with the rental data that we already have. Here’s how to do it:

After mapping the yelp data into our data frame, dropping any empty rows and checking how many columns we have, this is how the data frame looks like:

Now that we have a dataset with more information on yelp prices and ratings, we can visualize their relationships with house prices to understand whether this metric will likely improve our model or not.

The following code scatter plot yelp ratings against house prices:

import matplotlib.pyplot as pltplt.figure()
plt.scatter(amsmodel1['yelp_ratings'],amsmodel1['house_price'], s=20, edgecolor="black",c="darkorange", label = "yelp")
plt.xlabel("Yelp Ratings")
plt.ylabel("House Price")
plt.title("Yelp Ratings vs. House Price")
plt.legend()
plt.show()

The following code scatter plot yelp prices against house prices:

import matplotlib.pyplot as pltplt.figure()
plt.scatter(amsmodel1['yelp_prices'],amsmodel1['house_price'], s=20, edgecolor="black",c="darkorange", label="yelp")
plt.xlabel("Yelp Prices")
plt.ylabel("House Price")
plt.title("Yelp Prices vs. House Price")
plt.legend()
plt.show()

There definitely does not seem to be a strong linear relationship here when inspecting only the relationship between the two variables. Hopefully, we will be able to judge with the machine-learning algorithm if adding these variables have a resulting effect, albeit more complex that linear, on house prices after all.

Are we finally ready for to train the model? Nope. Our model will only take values in integer type, and right now we have a few columns that are categorical (or in other words, don’t make sense numerically but as categories for our apartments). We need to create something called dummy variables to represent these categories in a way that the computer can understand numerically.

Here’s a quick (but great) explanation about dummy variables and what they do: https://medium.com/@brian.collins0409/dummy-variables-done-right-588f58596aea

Here’s the code:

## creating dummy variables for the categorical columns "rental agency" and "postcode"dummies = pd.get_dummies(amsmodel1.postcode2,prefix=['p'])
amsmodel1 = pd.concat([amsmodel1,dummies],axis = 1)
dummies2 = pd.get_dummies(amsmodel1.rental_agency,prefix=['ag'])
amsmodel1 = pd.concat([amsmodel1,dummies2],axis = 1)
del amsmodel1['rental_agency']
del amsmodel1['postcode2']
del amsmodel1['postcode']
amsmodel1['house_price'] = pd.to_numeric(amsmodel1['house_price'])
amsmodel1 = amsmodel1.dropna()
len(amsmodel1.columns)

Now we have a staggering amount of 327 columns! In data science projects, we should criticise every step of our analysis to prevent bias and other misinterpretations. With data frames with high-dimensionality, we need to consider the curse of dimensionality which can confuse machine-learning methods due to the points being so far apart that they all look the same (and thus no conclusion/differentiation can really be made from the analysis). If our dataset suffers from this problem, it could decrease our accuracy results in the random forest algorithm.

The most accepted rule of thumb is that we should have at least 5 training data points for every feature in our dataset. In this project, we have:

training data set = 3376*80% 
ration training points/features = 2700.8/327 = 8.26

which (fortunately) passes our 5 training data to feature ratio rule!

At this point, our data frame should be looking like this:

We have 3376 rows, 327 columns, and data about the house prices, number of bedrooms, surface area, distance from the city center, mean yelp prices and ratings of their areas, dummy variables for postcodes and dummy variables for rental agencies.

The Random Forest algorithm is an ensemble learning method that builds decision trees, splitting the data and testing each decision to understand the weight of each feature, hopefully achieving the true effects of the features of dataset against the target (in this case, the house prices). Here is a great article about this algorithm.

With random forest, I will also combine the K-Fold (with K = 10) Cross Validation approach, which means we will be slicing the data into ten parts, training 9 parts and testing it against the 10th part, and using a different slice of that ten-piece pie as the testing data in each iteration, maximising the size of our dataset to obtain better results. More about Cross Validation here.

For this project, I will be using the open source package sklearn to apply the algorithm to the dataset. Here’s the code:

## RANDOM FOREST - KFOLD AND MODELfrom sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor

kf = KFold(n_splits=10,random_state=42,shuffle=True)
accuracies = []
for train_index, test_index in kf.split(features):

data_train = features[train_index]
target_train = target[train_index]
data_test = features[test_index]
target_test = target[test_index]
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42, criterion = 'mse', bootstrap=True)

rf.fit(data_train, target_train)

predictions = rf.predict(data_test)errors = abs(predictions - target_test)print('Mean Absolute Error:', round(np.mean(errors), 2))

mape = 100 * (errors / target_test)
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

accuracies.append(accuracy)average_accuracy = np.mean(accuracies)
print('Average accuracy:', average_accuracy)

After a few minutes, you should start getting a few results. This is what I got with the Amsterdam dataset:

An average accuracy of 93.88% is not bad, right? Let’s understand how the algorithm achieved this!

## SAVING THE DECISION TREE 

from sklearn.tree import export_graphviz
import pydot
tree = rf.estimators_[5]
export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')

If you open the image tree.png, you should get something like this: