Skip to content
Search
Generic filters
Exact matches only

Automating data collection with Python on GCP

The first step is to gather the list of Ticker Symbols (the stock identifier for each security) for the relevant list of companies; in this case, I am using the SP500 Index components, whose list is readily available online at the relevant Wikipedia page.

In many cases the list of ticker symbols may already be present in csv/excel format and automatically downloaded from other sources; in this case, please skip ahead to step 2.

From the table, you can identify the Symbol corresponding to each security, which you will use later in order to fetch stock price data from the API. This is a useful table to keep as reference in the target database as you can then periodically run the script to collect data from this table and obtain the latest view on the SP500 members. Given the market capitalization of these companies, changes are infrequent, but it is good practice to update this periodically to account for firms entering/exiting the Index. Once a quarter could be a considered a good update frequency for this kind of use case.

In the below steps the script will be executed monthly, but the target frequency should be adapted to fit the task you are looking to accomplish.

To gather the table of member firms, you will need to write a Python script. The requests and Beautiful Soup Python packages are good options for scraping the Wikipedia Page and extracting the table element from its HTML code.

Let’s now go through the relevant Python code sections to put in place within the script in order to achieve this.

  1. Import relevant Packages: As a first step , you will need to import the relevant Python packages and define the url variable, which points to the online data source. You should also import the pandas package (“import pandas as pd”) as you will need it later on.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url="https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

2. Scrape data: You then need to fetch the relevant url content with the request.get method. Then, you can extract its text with the .text method, and finally parse it into a Beatiful Soup object.

r = requests.get(url,timeout=2.5)
r_html = r.text
soup = BeautifulSoup(r_html, 'html.parser')
components_table = soup.find_all(id="constituents")
headers_html = soup.find_all("th")
df_columns=[item.text.rstrip("n") for item in headers_html]

In order to locate the table within the HTML, you can inspect the code of the page (this can be done by clicking Ctrl + Shift + I on Google Chrome) and see that the table element you are interested in has an ID of constituents, which you can use to store table contents into the components_table variable. You can then use .find_all to locate all elements in a page with a given ID. In this specific page, there are two tables with that ID, referenced by the components_table object. Given this, we will only work with the first one, which we specifically select at step 3.

You can also use the same method (soup.find_all) to identify table headers, which you will use as column names in our dataset, and store them into a list variable, named df_columns.

3. Store data into a Pandas Dataframe: next, you need to isolate the first table by indexing the components_table object (which resides at index 0), and then find all the rows within the body of the table, which are not headers; go ahead and store the results into data_rows.

Then you can iterate over the rows, split the text content of every row into a list element, filter for null values and store it into a list of name stock, which you will have to then append into a list of list named rows.

You can then turn rows into a pandas DataFrame object, parsing component_headers as column names. Component_headers is nothing but a sublist of df_columns, which indentifies the columns pertaining to the first of the two tables with the ID of constituents.

components_headers=df_columns[:9]data_rows=components_table[0].find("tbody").find_all("tr")[1:]rows=[]for row in range(len(data_rows)):

stock=list(filter(None,data_rows[row].text.split("n")))
rows.append(stock)

S_P_500_stocks=pd.DataFrame(rows,columns=components_headers)

4. Save dataset into csv file: as shown in the code below, you will then remove the redundant column named SEC Filings, and use the .to_csv method to save the dataset locally. You can now use this table to fetch stock price data from the API.

S_P_500_stocks.drop("SEC filings",inplace=True,axis=1)S_P_500_stocks.to_csv(r"/home/edo_romani1/my-python-directory/SP500stocks.csv",index=False)

Tiingo is a financial markets API which provides a variety of information relating to publicly-listed companies over several markets; in this example, we are interested in collecting the daily historical stock price information (Date, Open, Close, High, Low, Volume) and will therefore refer to the related Tiingo API documentation.

Let’s break the code down step by step and see how we can gather 20+ years of daily stock price information for each of the SP500 member firms.

1.Import relevant Packages: As before, proceed to import the necessary packages. This time, you’ll additionally need to import the datetime module, as you will use later on to define the time window along which we want to extract data from the API.

import requests
import pandas as pd
import datetime
from datetime import datetime,timedelta

2. Define API Token: to interact with the API, you need to get an API Key, or token, in order to identify ourselves as registered Tiingo API users. With Tiingo, you can easily set up an account and get your token. Once done, all you need to do is declare it in the code and set it as a variable, such as the illustrative example below.

token1="123RUXBJKSJKFJD23Y348"

3. Import the SP500 members table yousaved as a csv file in Step 1; this can be easily done with the .read_csv method in pandas.

SP=pd.read_csv(r"/home/edo_romani1/my-python-repository/SP500stocks.csv")

4. Call the Tiingo API and get historical stock price data

You first have to set up a list of the ticker symbols to iterate over in order to make an API call for each of them, storing them in a variable ticker_symbols, which you can extend with the addition of the SP500 Index value (“SPY”) in the following line.

You can also set up the latest day (end_date variable) for which you want to call data from the API, as for each stock you will be calling the stock price history and so a time window to partition by must be set. Set the end_date to yesterday (this is calculated below with the use of the datetime and timedelta modules — datetime.now.date() gets you today’s date, to which you subtract 1 day using the timedelta function, in order to get today-1 days= yesterday).

Note: Tiingo has hourly and daily API call limits which were handled with the usage of the rate_limiter package used as a context manager. This is important to know if you are making many calls per hour/day.

Lastly, you need to initialize an empty list data1, which you will use to store the historical dataset for each stock price / API call.

ticker_symbols=list(SP["Symbol"])
ticker_symbols.append("SPY")
end_date=str(datetime.now().date()-timedelta(days=1))

The next lines of code are the ones where you will actually call the Tiingo API to get data for each ticker symbol of interest. First, you need to set the symbol, end_date and token parameter in the Tiingo url to submit into the request.get method; you then need to store the response into a variable r, which corresponds to the csv file containing the stock price history for the given ticker. The following few lines in the code below convert such values into a list then a dataframe df, which you will then append to the data1 collection before moving on to the next ticker_symbol in iterative fashion.

Note: the try clause is used to take handle timeout errors caused by the requests package, the following except clause repeated the same code with an extended timeout time to handle longer responses; the code is equal to the below with the only difference being the value of the timeout parameter (which was doubled), and was thus left out for simplicity.

data1=[]
for symbol in ticker_symbols:
#with rate_limiter:
try:
url = "https://api.tiingo.com/tiingo/daily/{}/prices?startDate=2000-01-03&endDate={}&format=csv&token={}".format(symbol,end_date,token1)
r = requests.get(url,timeout=10) rows=r.text.split("n")[1:-1] df=pd.DataFrame(rows)
df=df.iloc[:,0].str.split(",",13,expand=True)
df["Symbol"]=symbol
data1.append(df)

5. Save dataset into csv file: As a final step, you can use the .concat method to concatenate the list of dataframes into a single final dataset, before finally saving it into a csv file like before.

df_final=pd.concat(data1)
df_final.drop(df_final.iloc[:,6:13],axis=1,inplace=True)
df_final.to_csv(r"/home/edo_romani1/my-python-repository/historicalSP_quotes.csv",index=False)

Below is the csv view from the final output, where we can observe the initial date and the relative price data for the first firm in the sample.

Output csv file containing stock price history for SP500 members; source: Author

Now that you have two fully functioning Python scripts which get stock data from the Tiingo API, let’s see how you can automate their running with the use of the Google Cloud Platform (GCP), so that every day in which the market’s open you can gather the latest quotes of the prior day. I will display how we can make use of Google’s Compute Engine, Cloud Functions, Cloud Storage and GoogleBigQuery to set up a simple data ingestion workflow which will be based off of these two scripts running on the cloud, and thus moving away from our locale machines.