Skip to content
Search
Generic filters
Exact matches only

2 Easy Ways to Get Tables From a Website with Pandas

Now, let’s try getting this table with summary statistics for the Tesla stock:

Yahoo Finance summary table for Tesla stock

We’ll try the same code as before:

pd.read_html('https://finance.yahoo.com/quote/TSLA?p=TSLA')
Raw output of read_html #2

It looks like we got all the data we need, but there are two elements in the list now. This is because the table we see in the screenshot above is separated into two different tables in the HTML source code. We could do the same index trick as before, but if you want to combine both tables into one, all you need to do is concatenate the two list elements like this:

separate = pd.read_html('https://finance.yahoo.com/quote/TSLA?p=TSLA')
pd.concat([separate[0],separate[1]])
Output of pd.concat of two list elements from read_html

There’s plenty more you could do to process this data for analysis- just renaming the column headers would be a great start. But getting this far took about 12 seconds, which is great if you just need test data from a static site.

Here’s a table with S&P 500 company information we can try to get:

S&P500 information from datahub.io

The data is distributed under an ODC license, which means it’s free to share, create, and adapt the data on the site. I was initially going to use this site for my read_html example, but after I ran the function for the third time, I was greeted with an error.

pd.read_html('https://datahub.io/core/s-and-p-500-companies')
HTTP 403 error from trying to read_html datahub.io

The HTTP 403 error happens when you try to access a webpage and the site successfully understands your request, but will not authorize it. This can occur when you try to access a site that you don’t have access to.

In this case, you can access the site from your browser, but the site won’t let you access it from a script. Many sites have rules about scraping on their “robots.txt” file, which you can find by appending “/robots.txt” after the top-level domain of the site’s URL. For example, Facebook’s would be “https://facebook.com/robots.txt”.

To avoid an error like this, you might be tempted to copy the data onto an Excel sheet, then load that file with the pd.read_excel function.

Instead, pandas offers a feature that allows you to copy data directly from your clipboard! The read_clipboard function has this description:

Read text from clipboard and pass to read_csv

If you’ve used pandas before, you’ve probably used pd.read_csv to get a local file for use in data analysis. The read_clipboard function just takes the text you have copied and treats it as if it were a csv. It will return a DataFrame based on the text you copied.

To get the S&P 500 table from datahub.io, select and copy the table from your browser, then enter the code below.

pd.read_clipboard()
Output of pd.read_clipboard

Perfect! We’ve got a ready to use DataFrame, exactly as seen from the website!