Accessing and analysing Six Nations Rugby data, as well as predicting the scores of the remaining postponed matches
The Six Nations Rugby Championship is the annual rugby union competition contested by the national teams of England, France, Ireland, Italy, Scotland, and Wales. The tournament was scheduled to conclude on 14 March 2020; however, due to the COVID-19 pandemic, Italy’s penultimate match against Ireland and all of the final matches were postponed with the intention of being rescheduled. It was the first time more than one match had been delayed since the outbreak of the foot-and-mouth disease in 2001. Given this, the Welsh rugby fan and data scientist in me became curious as to the data that surrounds the championship and what is possible to achieve with it. Here’s a piece that I wrote about my analysis on Six Nations Rugby data, as well as predicting the scores of the 2020 postponed matches.
The first and foremost problem to overcome was to obtain some data I could use to model rugby matches. Given that the championship started in 1882, I decided early that looking into every possible match that has ever been would take some considerable time and effort. As the championship officially became the Six Nations in 2000, I turned to Google and started looking for 10 years worth of rugby data.
After looking online, I came across the ESPN Scrum website which contained basic data for all the Six Nations matches going back to 1882.
What immediately stood out was the fact that the data is in tabular format. When I see this word ‘table’, I can’t help but internally squeal with joy. Tables are our friend and makes working with data so much cleaner and less of a pain. Saying that, this is only true when the source of the data is well structured.
Let’s check out what more does the website have to offer. When you click on a championship year, in this case, 2020, you’re redirected to another page which contains a table with the results to date. We can see that the last four matches between Ireland and Italy, Wales and Scotland, Italy and England, and France and Ireland have no score associated with them as they were postponed. These are the ones we want to predict.
What’s important to consider at this stage is the available data; that is, are the variables useful for modelling matches. What we currently know is, for each team, whether the match was played at home or away, the score, and whether the team won or lost the match. At this stage, these attributes might be enough to make reasonable predictions. However, it’d be interesting to focus on team-level variables, such as the overall number of caps per team, and player-level variables, such as their height, their weight, their recent performances, etc. We might add these later if we can find a reliable data resource with easily accessible data. But first, let’s extract this information from the ESPN website.
The ESPN website doesn’t offer any kind of downloadable data, so it was time to dive into the webpage’s source and brush up on my web scraping skills!
What’s nice about this website is that the HTML table for each of the scoreboards between 2000–2020 use the same ID. This means that, given all 20 pages, I can loop over the URLs and apply the same pre-processing steps to access each table.
First, I copied all the scoreboard URLs between 2000–2020 and stored them in an array. Then, I iterate over the array, making GET requests to fetch the raw HTML content from each page. Using BeautifulSoup, the HTML content is parsed to Python so I can find the specific scoreboard table using its ID
Now that I have all the HTML tables in string form, I can parse it to Pandas using the
read_html function. Because this function outputs a list of dataframes, I can concat them into one main dataframe using the
concat function. I then delete the irrelevant columns that have snuck in through the HTML.
Some of the rows have empty dates, which is also reflected in the tables displayed on the ESPN website. I can fill those missing dates with the last known date in the column by using the
To predict the scores achieved by each country in the postponed matches, we’re going to have to separate each match record so that each team is represented on an individual row. Before we do that, we have to also consider that the order of the names of the countries represents which country was playing at home or away. As the names of the teams have a dash as a constant delimiter, we can split the string based on the dash and parse the left name into the
home_team column and the right into the
Now that the teams are split, we also have to remove the scores from the team names. We can use Regex to extract the numbers from the team names and parse them into the
We also want to know the name of the ground that the match was played in. Given that this data is in another table on another page, I’m going to play the lazy card and just assume that the matches have been played in the same stadium in each country during the past 10 years. I can associate the name of the ground to the home team by iterating over them and assigning the value it corresponds to from my stadium dictionary.
Based on this last dataframe, we know that if the
home_team_score is not null, the game was played, and if it is, then those are the postponed ones. We can split our data into training and testing based on this rule.
Let’s focus on the training data for a second. For the next part of this blog, I’m going to need to know which country won the match. This won’t be included as a variable in the game score prediction. To do this, I iterate over the training dataframe and associate the winning team to a
Finally, to split the data so that each country is a row within the dataframe, I can select the countries into separate frames based on whether they were playing home or away and append them on top of one of another.
The main focus of this blog is to see whether we can make reasonable match score predictions. However, I think it’s also interesting to quickly see whether there are any underlying patterns in the data. One of the things I’m interested in looking at is whether there are some correlations between whether a team wins more matches if they play at home in comparison to when they play away.
There are fruitful studies surrounding home advantage. That is, are there any benefits that the home team is said to gain over the away team? These benefits can be attributed to the psychological effects of supporting fans on the competitors or the referees, the differences in time zones or climates, tiredness after travelling, and many other attributes. So, I’m wondering whether 10 years worth of Six Nations data reflects this.
correlation function is used to find pairwise correlations between the columns of a dataframe. Because this function ignores any non-numerical data, they must first be mapped to some kind of numerical representation. Pandas has another built-in function which can handle this for you. When you cast the data type of a column as a
category, the column will translate the non-numerical data into categorical representations. So in this case, if we’re to look at whether there’s a correlation between when a team plays at home or away and whether they win or lose, these columns will translate
win as 1 and
lose as 0, for example.
If we apply the
correlation function across the entire dataset, we get a correlation score of 12%. Correlation aims to maximise its value; that is, a higher score means there’s a strong relationship between the attributes. So in this case, a low score of 12% shows that there isn’t much of a home advantage.
Now, I’m not 100% convinced that this is true. When my home country, Wales, are playing at home in Cardiff’s Millennium Stadium, I am sure that when the Welsh crowd are singing Calon Lân at the top of their lungs that it ignites some kind of fire in the Welsh rugby team. I’m even getting goosebumps thinking about it myself! So, let’s check this.
If we focus on Wales’s performance across 10 years worth of Six Nations matches, we can plot some results. First, let’s look at their performance in total. In 10 years, Wales have played 100 matches in the championship. Out of these 100 matches, Wales have won 52 matches, lost 45, and drawn 3.
If we look at the results from the 49 games Wales have played at home, they have won 28, lost 20, and drawn 1.
Lastly, if we look at the results form the 51 games Wales have played away, they have won 24, lost 25, and drawn 2.
Given these results, it’s safe to say that the
correlation metric reflects the same output. Wales have won only 4 more games when they have played at home in comparison to when they were playing away. But this doesn’t cancel out home advantage. This only tells us the correlations within this dataset. There are several other rugby tournaments which we could include in the analysis that might change this outcome.
Setting home advantage to one side, the next thing I wanted to look at was predicting the scores of the remaining postponed matches. There are two supervised learning methods. Regression is used to predict a continuous value, while classification predicts discrete outputs. For instance, predicting the price of a house in pounds is a regression problem, whereas predicting whether a tumour is malignant or benign is a classification problem. Since the final score of a rugby game can technically be any positive number (or even zero), we’ll look into regression methods.
First, let’s add a label to the training and testing so we know which dataset is which. We’ll then combine them and encode the
team, stadium and
home_or_away columns. This is to ensure that we have the same columns and encodings in each set after we split them again. It’s important to note here that I don’t include the
date the games were played as a variable in the modelling. At this stage, I just want to be able to predict the scores of the remaining games based on the scores of the previous ones. Also, as the championship is generally played during the same time each year, I don’t think it’d make much difference to the modelling. If it was seasonal, it might be something interesting to look at.
Once we’ve encoded the data and split it back again so that we have training and testing sets, we’ll need to split our training data once again so that we have a validation set. That is, we’ll take a sample of data from the training set so that we can evaluate our model’s performance on predicting the scores for those samples as we already know the answer. To do this, we’ll take a sample of 30% of the training set. So out of 610 records, that’s 183 samples. We’ll also identify which columns are the attributes and which are the labels. The attributes are the independent variables, whilst the labels are the dependent variables whose values we want to predict. Now that we’ve encoded our data, we have 14 attributes, and we want to predict the score depending upon these. Therefore, our attributes are set as the X variable and the
score column is set as the y variable.
Now to train a model. Although this problem is a regression problem, it is not a linear one. It’s not expected that the scores achieved by each team are to increase after each match they play. So that cancels those methods out. In this case, I decided to use a RandomForestRegressor due to the algorithm’s ease of use and relative accuracy as well as its decent handling of reducing overfitting compared to standard decision trees. The Random Forest algorithm creates several decision trees with some randomness injected into the feature weights. These decision trees are then combined to create a forest (hence a random forest of decision trees) which is used for final analysis. The algorithm supports both classification as well as regression, making it very flexible for diverse applications.
Before we build and train our model, we first need to set some hyperparameters. These parameters are often the most challenging to set as there generally isn’t a perfect value for these settings. A general rule of thumb is to initially stick with the default values, and then once a model is trained and tested, start tweaking the values using a trial-and-error method until you achieve the best result. For this model, I found that 50 n-estimators and a max_depth of 4 was a nice set of values to provide the best result. More details on these specific settings can be found in the official scikit-learn documentation.
I’ll fit the model on the
X_train (the attributes) and the
y_train (the labels) and then predict the scores
y_pred of the validation samples
X_val that we removed from the training. I’ll then compare the predictions to the actual scores
y_test by measuring the Root Mean Square Error (RMSE).
The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data. That is, how close are the observed data points to the model’s predicted values. It also has the useful property of being in the same unit as the variable you’re trying to predict; so in this case, points.
The RSME of the model resulted in 10.1 (10) points… good? bad? Definitely not perfect. As most of the teams score less than 40 points per game, an error of 10 points each way isn’t particularly accurate. As demonstrated here, the model does its fair share of over predicting scores over the full range of outcomes. The model seems to perform better when predicting lower scoring games, but not significantly better. Perhaps if we removed those high scoring matches out of the picture, we might see a reduction in error in the model. But we want to capture those possibilities and wouldn’t consider them as outliers. This is something to investigate further.
Let’s see what scores the model predicts for the postponed matches. We can apply the model on the
X_test and append the predictions as a column to the testing dataframe so it’s easier to read.
Ok. The model hasn’t given unrealistic predictions! England is predicted to win 36 to 13 against Italy. As much as it pains me to say this, that seems like a respectful prediction considering England was currently in the lead to win the whole tournament. Ireland looks like they’re predicted to lose 19 to 25 against France. Again, this doesn’t seem like an unrealistic prediction. After beating England and Italy by ~10 points in each match, France is joint top of the leader board along with England. Wales is predicted to win 25 to 16 against Scotland (yay! 🏴). After only winning against Italy, this means we would’ve avoided that wooden spoon! Lastly, Ireland is predicted to win 25 to 11 against Italy.
So, what have I learnt from this analysis?
Nonlinear regression problems are not always easy. Although the RSME of the model isn’t as low as I’d like it to be, I suspect this is because of the variability in the data. Sometimes Wales go out on the pitch full steam ahead and win with over 40 points, and then other times, they really don’t! This is something I’d like to investigate further. I’d also like to include other variables into the prediction, like weather conditions, who the referee was because they are always to blame for Wales’ losses.. and more!
However, the predicted scores achieved by the model weren’t that unrealistic. Based on the tournament’s score in 2020, the predicted outputs followed a similar trend. Overall, if no bonus points were awarded, it would look like both France and England would be possible joint winners of the 2020 Rugby Six Nations.
For the full notebook, check out my GitHub repo below: https://github.com/LowriWilliams/Rugby_Six_Nations_2020
If you enjoyed following this post, don’t forget to like and share.