Skip to content
Search
Generic filters
Exact matches only

6 Tricks to Overcome Them

Photo by Sebastian Herrmann on Unsplash

When Selenium and BeautifulSoup are just not enough.

Aw Khai Sheng

It was truly a dream come true when I first discovered some of Python’s web-scraping libraries. Think of all the things one could do! The possibilities were endless. Needless to say, my hopes were dashed when the pages I wanted to scrape were straight out of hell.

After hours of scouring StackOverflow, below are 6 simple ways I overcame some of the challenges I faced when I began automating my web processes:

The Problem: Sometimes the buttons you want to locate may be hidden, possibly due to annoying popups. And it gets even more annoying when you can’t know for sure when these pop-ups are gonna pop. Whereas other times your problems may occur due to a slow internet connection…

The Solution: The road to smooth web-scraping is paved with exceptions — and learning how to deal with them! It’s always handy to know that you have a backup plan whenever any of these exceptions occur. First import them:

from selenium.common.exceptions import NoSuchElementExceptionfrom selenium.common.exceptions import StaleElementReferenceExceptionfrom selenium.common.exceptions import ElementClickInterceptedExceptionfrom selenium.common.exceptions import ElementNotInteractableException

And then use try/except blocks to solve the problems that may occur!

Referring back to the previous problem: Imagine a slow internet connection was causing a NoSuchElementException. One such method as mentioned above would be a try/except block making your script wait 10 seconds:

try: 
button = driver.find_element_by_id('button-id')
button.click()
except NoSuchElementException:
sleep(10)
button = driver.find_element_by_id('button-id')
button.click()

This is great and all, and probably solves the problem (if it doesn’t then you’ll really need to look into fixing that router), but it means having to wait 10 solid seconds every time you want to click that button.

The Solution: How bout a more elegant way? Use a WebDriverWait, and specify the button’s visibility as a condition! The second the button is located, your browser will stop waiting and allow you to click on it. This may save you precious seconds every single time you want to find a button:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, 'insert_button_xpath_here')))

The Problem: Now you’re filling out a form, and then you realise that Selenium’s send_keys() option is helping, but not helping . Sure, it types at top speed, but you soon wonder if you would actually be faster doing the copy-paste on your own. Self-doubt kicks in as you see your huge chunk of text slowly string along the page, line by line.

The Solution: Input the text by executing a script! It’s much easier, and requires no imports either.

text = text.replace('n', '\n')script='''document.getElementById("source").value='{}';'''.format(text)driver.execute_script(script)

Now your text value is immediately set and the text appears almost instantaneously in the form! Take note that you should first add an additional backslash ‘’ in front of new-lines or it won’t work. This is especially useful if you’re using huge chunks of text.

The Problem: Some websites are so poorly formatted, or finding some buttons with their IDs (or any other attributes) just does not seem to work.

The Solution: In a frustrating situation like this, do not panic! There are other ways around this. Maybe we can use that.

In a previous experience like this, I first found the buttons I knew I could definitely identify, then figured out the relative position of the buttons I wanted to click, but couldn’t do so.

Because with ActionChains, you can first find a working button, and then offset the click by chosen coordinates to click on anywhere on the screen! This works beautifully most of the time, but finding the coordinates is the crucial step.

from selenium.webdriver.common.action_chains import ActionChainsac = ActionChains(driver)ac.move_to_element(driver.find_element_by_id('insert_id_here').move_by_offset(x,y).click().perform()

you can click anywhere on the screen!

The Problem: Finally, things are going great, you’re scraping all the information you need…until you get hit with the CAPTCHAs.

The Solution: Scrape responsibly! Introduce random wait times in between actions to prevent overloading on the server. More than that, it prevents your script’s repetitive pattern from being detected.

from time import sleep
import random
sleepTimes = [2.1, 2.8, 3.2]
sleep(random.choice(sleepTimes))

The Problem: A novice mistake I was making at the beginning was not introducing a backup option in the case that my scraping randomly failed halfway (with the possibility of my internet cutting off halfway). I was only saving my scraped data to a file at the of the script. Big mistake there.

The Solution: There are many ways to backup your data. Some ways that worked for me:

  1. Saving each “entry” of my data to a JSON file (or any other convenient format). This kept my mind at ease as I knew that every loop/page that my script was running was saved somewhere on my computer, and if anything failed I would still have my data. This is a great way to save on computer memory as well.
  2. Surrounding the loop in a try/except block that saved all my data into a file if anything went wrong.