Skip to content
Search
Generic filters
Exact matches only

Are series getting worse over time?

Now, it is time to make your own visualizations!

First, we present the third-party libraries used in this article. Since these libraries are not part of the Python Standard Library, we need to install them separately. This can easily be achieved using the pip package. Verify that all libraries are installed before executing the code in your computer.

Then, we provide the code used to scrape the data from IMDb web page and to generate the heatmaps, so you can easily make your own visualizations!

Request

Request is a third-party library that allows you to send HTTP requests using Python. In this article, we will only use the request library to get the HTML code of a web page.

Beautiful Soup

When performing data analysis, we are not always able to download the data in csv format, or to access it via an Application Programming Interface (API). In some cases, we have to obtain the data directly from the web page. That is when Beautiful Soup comes in handy.

Beautiful Soup is a Python library for extracting data from HTML and XML documents, and has many useful functions to scrape information from web pages (e.g. to extract hyperlinks, text from tags, or images).

Pandas

Pandas is a Python open source library for data science that allows us to easily work with structured data, such as csv files, SQL tables, or Excel spreadsheets. It provides tools for reading and writing data in different formats, for carrying out exploratory analysis, and for data cleaning.

Seaborn

Seaborn is a Python data visualization library based on Matplotlib. In comparison to Matplotlib, Seaborn allows you to make plots using fewer lines of code, and provides more sophisticated visualization tools such as heatmaps, box plots, and violin plots. In addition, the visualizations look much better! ❤️

To scrape the data from IMDb, we create a function (episodes_rates) that accepts as input the web page of a serie ( Episode Guide), returning a data frame where the rows represent the seasons and the columns the episodes as follows:

episodes_rates function

Let’s understand the code step by step 👌

1. We use the request module to obtain the HTML code of the web page as a string. First, we send a GET request to the specified url using .get(url) method. Then, we get the HTML text of the page using the .text attribute.

# obtain the html code as a string 
response = requests.get(url_serie + url_season)
html = response.text

As we can observe, the url consists of two parts: url_serie and url_season. Url_serie is the input of the function (episodes_rates) and remains unchanged while url_season (e.g. ?season=1, ?season=2, …) changes as we select different seasons.

2. We pass the string html into the BeautifulSoup constructor, obtaining a BeautifulSoup type object. Now, we can easily extract information from this object using the methods available in Beautiful Soup.

# create a BeautifulSoup object
soup = bs4.BeautifulSoup(html, “html.parser”)

3. Next, we explore the HTML code with Chrome Dev Tools. Chrome Dev Tools is a set of web developer tools built directly into Google Chrome, allowing you to easily inspect the HTML code of a page. We can easily access Chrome Dev Tools by clicking Ctr + Shift + I or by selecting More Tools > Developer Tools.

In the elements tab, we can inspect and edit the HTML and CSS of a page. To display the HTML code for an element (e.g. episode rate), you can click the inspect element button (square with the arrow) and select the item in the browser as follows:

As we can observe, the element of interest (episode rate) is inside the span tag with a class attribute (“ipl-rating-star__rating”). This span tag is inside a division tag with a class attribute (“ipl-rating-star small”).

We use the method .find_all() to obtain all tags with the class attribute (“ipl-rating-star small”). This function returns a bs4.element.ResultSet object. Then, we loop through this object (division tags), accessing to the span tag that contains the score (“ipl-rating-star__rating”) using the .find() method. To obtain the text between the tags, we use the .text attribute as follows.

rates_season = {}
# we obtain all division tags with the class attribute “ipl-rating-star small”
division_tags = soup.find_all(class_=”ipl-rating-star small”)

# we loop through the tags and extract the scores
# we create a dictionary with the scores
for index, tag in enumerate(division_tags):
rate = tag.find(class_=”ipl-rating-star__rating”).text
episode = ‘Episode_’ + str(index + 1)
# we insert the score in the dictionary
rates_season[episode] = float(rate)

# we append the dictionary to a list
rates_all.append(rates_season)

As shown above, we insert the scores in a dictionary, being the keys the episode number and the values the score of the episode. Finally, we append this dictionary to a list.

4. Once we have collected all scores of a season, we have to access the url of the next season to continue collecting scores. First, we check whether another season is available (anchor tag — id=”load_next_episodes”). If so, we access the url contained in the href attribute using the .get() method.

# get next season anchor tag
next_season = soup.find(“a”, id=”load_next_episodes”)
# if next_season equal to None break the loop
if not next_season:
break
# if next_season is not equal to None, we access the url
url_season = next_season.get(‘href’)

As shown above, if there is not anchor tag with id=”load_next_episodes”, meaning there is not another season available, the while loop is terminated.

5. Finally, we obtain a list of dictionaries, where each dictionary contains the scores of one season. We use this list of dictionaries to create a pandas data frame as follows:

df = pd.DataFrame(rates_all, index=list(map(lambda x: ‘Season_’ + str(x+1), range(num_season))))

And voilà! We obtain a data frame, providing only a web page 💪 Next, we use this data frame to create a heatmap using Seaborn.

A heatmap is a data visualization technique where the value of each data point is indicated using colors (variation of hue or intensity). We can easily create a heatmap with Seaborn using the seaborn.heatmap() function.

As shown below, we create a function webpage_to_heatmap that accepts as input the web page of the series ( Episode Guide), the colormap of the image as well as the title, returning a heatmap visualization. As you can observe, we use the data frame returned by the function episodes_rates as input to the seaborn.heatmap() function.

webpage_to_heatmap

Now, Let’s pick a series and visualize its scores!

First, we go to the episode guide of a series, and we copy the url until the word episode as follows:

Then, we pick a colormap and a title for the visualization. Finally, we create the heatmap with the webpage_to_heatmap function.

As you can see, Mad Men increases its quality over time! In addition, last chapters of the seasons are rated better than the first ones.

Now, it is time to make your own visualizations!

Amanda 💜