Skip to content
Search
Generic filters
Exact matches only

5 Top Tips for Data Scraping Using Selenium

5 Selenium Best Practise Tips

Tip 1: Place the webdriver executable in PATH

To begin our web scraping task, we must first navigate to the following page, ‘’. This step can be achieved in as little as three lines of code. First we import the webdriver from selenium, create an instance of the chrome webdriver, and finally call the get method on the webdriver object named driver.

To make this code short and readable, the chromedriver executable can be placed in a user chosen folder. This destination can then be add to PATH under your environmental variables. The webdriver is then ready to go, simply using webdriver.Chrome() with no arguments passed to Chrome in the parentheses.

Tip 2. Find any webelement using the console

When we have navigated to the web-page, we would like to find the search box, click on it and start typing for the ‘coronavirus global updates’.

To find this webelement, we can simply, right-click on chrome and select inspect. In the top left corner of the page that opens up when we inspect, we can use the cursor to hover over and select web elements of interest. As shown, the search box has an input tag, with an id value of ‘orb-search-q’.

How can we make sure this the only search element we are interested in?

We can simply select the console tab window, and then type two dollar signs, followed by parentheses and quotations. Inside the quotes, we write the tag input followed by square brackets. In those square brackets we can add the id and its value.

Format to find CSS selectors

$$('tag[attribute="attribute value"]')

As shown, an array of only one element is returned. We can be confident we now have the right search box to click on and start typing our search queries.

The content of the quotes within the console is a valid CSS selector, and we can use it in our script to find the webelement.

This leads to the next tip.

Tip 3: Powerful data scraping One-liners: ActionChains and Keys

We can now call the find_element_by_css_selector method on the webdriver object driver.

We want our webdriver to move to this webelement, click on it, type our search query ‘Global coronavirus updates’ and press enter.

This can easily be accomplished using the ActionChains and Keys classes from selenium. We simply pass driver to ActionChains, and method chain using the methods, move_to_element, click, send_keys to type input and key_downwith Keys.ENTER passed to imitate enter. To run this command, add the perform method at the end of the ActionChain.

Running the ActionChain takes us here:

Tip 4: Capturing the data

The webelement shown returns an image, headline, sub-headline, and some accessory information such as the date published.

How can we capture just the headline from each story?

If we type the webelement shown below in the console it will return a list of 10 web elements. We want to extract just the headline from each of these stories.

To do this, we can simple iterate over the 10 stories. In order to do this, we call the find_elements_by_css_selectorwith the web element passed. This method returns a list-like object that we can iterate over.

We can assign this to the variable name top_titles, and iterate over them using a for loop. In the for loop we can find the element associated with each headline using Tip number 2, and extract the text by calling .text on the webelement.

In addition to printing to the terminal console, we can also write to a .txt file so we have a permanent copy of the headlines whenever we run the script.

Tip 5: headless webdriver

When we run the script to extract the headlines, a browser window will pop up and run as shown in the video below.

Whilst this can be interesting to watch, for the most part, this may not be desired.

To remove the browser, import the Options class from the selenium module, create an instance of this class, and call the add_argumentmethod on this instance with the string argument ‘ — headless’ passed. Finally, in the webdriver, under the options parameter, add the variable which points to the headless browser.

Bonus Tip: Add waits to find elements in cases of slow connection

Webdriver wait, By, and expected_conditions

To make sure webscraping is successful, we can introduce a wait into our script. This feature can be particularly useful in cases where web pages load slowly. To do this, we import the three classes shown.

What is nice about introducing waits is that, when they are constructed they can almost be written as a sentence. Furthermore they can search for the webelement for as long as we choose. If the web element is found earlier, the script simply executes earlier too.

Here, we pass the WebDriverWait class, our driver object, tell it to wait a maximum of 10 seconds, until the element is located. In the until method, we pass theuExpectedConditons class with the alias EC, and call the presence of element located method on it. We then pass this method a locator tuple, detailing what element we are searching for (By.CSS_SELECTOR), and the webelement.

A wait will make your scripts more robust and less susceptible to Timeout Exceptions.

The script for these examples is shown altogether here.