What are the tangible benefits, and where can we learn these techniques?
There is huge pressure on the real estate industry to unlock the potential of big data and incorporate machine learning and evidence-based approaches in their workflows. In the KPMG Global PropTech Survey 2018, 49% of participants thought that artificial intelligence, big data, and data analysis were the technologies likely to have the biggest impact on the real estate industry in the long term.
Consequently, some forward-thinking top executives at real estate institutions with long operating histories are pushing their firms to unlock the potential of their decades of records of transaction, valuation, asset management, listing, and other data. Simultaneously, the data provision space is maturing (possibly even overcrowding) with successful startups such as HouseCanary and Reonomy joining established players like CoStar and Real Capital Analytics, making it possible for any company interested in real estate to quickly obtain large amounts of relevant data.
However, as noted in a recent NAIOP article, real estate professionals are facing challenges figuring out how to actually utilize data. The KPMG Global PropTech Survey 2019 confirms that 80% of firms still do not have “most or all” of their decision making led by data. The same report also hints at a “skills gap” — only 5% of real estate firms have transformation efforts led by someone with knowledge of data analytics.
So, how exactly can we apply data science to real estate? What are the tangible benefits? And, where can we learn the skills and techniques that will allow us to harness the potential of big data in real estate?
 Property Price Indices
Data science applications to investing have proliferated in finance, and today data-driven computer models account for up to 80% of trading, as reported in news articles and expert commentary. Unlike publicly listed equities, however, every reported transaction in real estate represents the exchange of a unique asset — and no two properties are ever identical. Even when two units in the same building are transacted, they can be drastically different, and pricing can be considerably different.
This presents a specific problem for real estate — how do we harness large data sets to understand individual sub-market performance? Taking simple averages of historical transactions can be biased if the types of properties transacted in each period vary, and there is subjectivity in determining what properties to include or exclude in the average, to the extent that different researchers could end up with different pictures of historical performance.
Data science methods present several solutions to the problem. Hedonic regression techniques (used in countries like Singapore) operate on the principle that individual characteristics of each property can be separately priced, to control for differences across assets.
Alternatively, analysis can be restricted to just comparing price changes on properties that are sold more than once. This is known as the repeat sales method — i.e., tracking the price change on the same asset over time. The US Case-Shiller indices are a well-known example of this technique.
Fundamentally, these methods allow users to exceed human capacity, by working on more data than any one person could manually make sense of to produce accurate signals of property market performance. Millions of rows of noisy transaction data can be combined with information about locality, property characteristics, demographics, and more, to produce granular sub-market indices. For example, indices can pinpoint property returns in specific postal districts e.g., a WC1-index or E1-index in London, or for specific property types e.g., a 2-bed-condo-index vs a 3-bed-condo-index, while taking into account the implications of all transactions in the full dataset. Indexation helps determine historical trends, and in turn, is useful in current pricing and future return estimation.
 Automated Valuation Models
Statistical approaches to valuation are gaining traction globally, with some examples being the Zillow Zestimate in the US, UrbanZoom in Singapore, and SkenarioLabs in Finland. The goal of any automated valuation model is to harness data to produce an estimate of a property’s market value — where it would transact between a willing buyer and seller, at arm’s length, without compulsion.
Approaches similar to those in indexation are employed, with potentially more advanced data science techniques being deployed to take advantage of online learning and ensemble methods. However, the final output is different. Rather than an index, the goal is a point (or range) estimate of an asset’s value. The direct benefit is greater precision on the fair market value of a property, produced instantaneously and at low cost. These valuations are useful not just to pricing properties, but also to assessing the mortgages and portfolios of loans backing these assets.
Automated valuation models help us understand the property market in the present, by helping to assess a fair transaction price for a deal today.
 Time Series Forecasting
Time series methods help us understand where property markets are headed tomorrow. There are clear benefits to better forecasts — better investment and development deal making, and higher financial returns.
If we have only one set of data, for example, a single market’s property price index, we can use Autoregressive Integrated Moving Average (ARIMA) models to make short-term predictions. This type of model can assess seasonal variations, and identify the trends and patterns in the data to make estimates of future performance.
More often, we will have a set of data, such as property prices indices for a few different (but related) markets, macroeconomic series such as GDP, unemployment, inflation, etc., financial indicators such as interest and mortgage rates, stock market indices, FX rates, and more. These variables all have causal influence on each other, and in this case, forecasts can be built using Vector Autoregression (VAR) and Vector Error Correction Models (VECM). By doing this, we take into account the evolution of a broad range of factors in producing predictions of the future of property prices.
A number of analytics providers e.g., HouseCanary and Real Estate Foresight use data science methods to make predictions of future real estate performance. In the realm of macroeconomic data, Capital Economics and Oxford Economics, among others, use some of these statistical methods to forecast the path of GDP, inflation, interest rates, and more — all of which are critical inputs into the real estate deal evaluation process.
 Cluster Analysis
Real estate performance can differ dramatically across locations. Different countries can vary due to divergent macroeconomic situations. Cities within the same country may vary due to local factors such as economic activity or supply. Within a city, some neighborhoods or sub-sectors (e.g., luxury condos vs. mass market houses) can also perform very differently.
Cluster analysis rigorously identifies patterns in the data, helping to determine which groups of properties are likely to perform more similarly, and which are more likely to diverge.
Another application of cluster analysis is to determine time periods in which property market performance might be more or less similar. Many real estate markets are heavily affected by government intervention. There may be many significant changes in legislation, causing pricing and investment behavior to vary over time. Cluster analysis can help identify pockets of time in which pricing performance is likely to be more similar.
Cluster analysis helps us build targeted models for each group (or time period), increasing accuracy. It can also be used to guide business strategy — by determining what segments of the market different teams should target, or what investment regime the market is likely to be in, leaders can make more profitable decisions backed by data.
 Geographic Information Systems (GIS)
Location is one of the most important factors in real estate analysis, and GIS tools such as Quantum GIS or ESRI’s ArcGIS help us visualize, understand, and analyze locality intelligence. With the rise of government open data sources, more information is available than ever before, from population migration by neighborhood to the location of public amenities, and more.
An example of a task solvable using GIS is to load all property transactions within a given year, and also load the location of all train stations, then automatically determine which properties are within a specified radius of the stations, and statistically test whether these properties have higher per square foot pricing than those further from the stations.
GIS can also be used to figure out commute times, or find properties matching certain specified criteria. Site selection e.g., finding good retail locations based on population characteristics, transport, and even competitor placement are all enabled by GIS.
A quote from a UK REIT sums it up best “If you don’t use data efficiently and effectively then you will miss a huge amount of value in your business/market. Others will not make this mistake and you will become increasingly uncompetitive. Ultimately the world is only going one way.” (KPMG Global PropTech Survey 2019)
This revolution has occurred in other industries. In finance, Bridgewater and Renaissance Technologies were early entrants in systematic investing, and have been hugely successful for decades. Today, data-driven quantitative investing in public markets is the norm rather than the exception. Even sports has become extremely statistically driven, and team draft decisions are often driven by sophisticated analysis and modeling. The movement towards greater reliance on data-driven decision making in real estate is to some extent inevitable.
For real estate, there are a number of commercial opportunities and trends to note.
That hypothesis that data itself is valuable is clear and well supported by the success of companies such as Teranet or Compstak. But, even collating, cleaning, and organizing data — public or otherwise — is a source of significant value, as seen by the rise of companies such as Cherre and Realyse. Converting raw data into usable analytics is another source of value, as done by companies like Walk Score or Local Logic.
There are an opportunity and challenge in appraisal from the rise of automated valuation. In some sectors, basic valuation may begin to shift towards heavier reliance on statistical models, which are cheaper and faster to execute. But at the same time, new business models are emerging, such as that of the instant real estate buyers. This, in turn, could have implications for those whose livelihood depends on intermediating these traditionally opaque and illiquid markets.
Forecasting and analysis are opening new opportunities in real estate investing. Skyline.ai is a leading — but nascent — example of using data-driven methods to invest, but there is a great room to grow. In finance, ⅓ of the USD 3 trillion hedge fund industry is run using quantitative strategies (Man Institute). Meanwhile, the top 100 real estate funds manage more than USD 3 trillion, but to date, there is still essentially none of this being quantitatively invested.
So, where can we learn these methods, and harness the potential for big data in real estate?
Most options are for general data science learning, without a specific focus on real estate and the techniques listed above. But, they do offer a good starting point from which one can continue to build, and try to self-learn the required additional methods.
In this field, General Assembly offers on campus courses (as well as online options) in multiple global locations, conducting both full-time and part time generalist data science courses. Coursera offers pre-recorded self-study videos on GIS as well as data science. There are also formal 1 to 2 year programs by universities.
Alternatively, PropertyQuants offers a rapid 11-week live interactive online course specifically on “Applying Data Science and Machine Learning to Real Estate.” It includes a bootcamp module to help participants get started in the world of programming and data science, which is followed by real estate data science and GIS modules, thus covering all the major techniques listed above.
Participants also get to work on a capstone project of their choosing, producing real world analysis using the methods taught in the course. Classes are supplemented by 1:1 meetings and graded assignments, to ensure participants fully understand the course material. This is perhaps the only course today that specifically focuses on the applications of data science to real estate.
The real estate industry is likely just at the beginning of a significant shift towards greater use of data and data-driven decision making. There are huge opportunities that are now starting to be unlocked by various startups and forward-thinking institutions. There is a range of concrete methods — as outlined above — to apply data science to real estate, to help move from millions of rows of data to granular understandings of past, present, and future real estate submarket performance, and make superior investment and business decisions.
However, the required skills may often be absent across a good percentage of the industry. There is now the opportunity to learn these techniques and methods — specifically for real estate — and investing the time to upgrade could benefit a range of participants. Real estate researchers could begin to use data and machine learning to produce game-changing insights and unlock the value of large datasets. Those in the Proptech industry (or even investing in Proptech) could do well to understand these methods better and build (or invest in) disruptive activities. Finally, real estate investors who learn these methods could use data-driven approaches to find exceptional opportunities and beat the market.