Web scraping is a process that involves retrieval of data from different website sources. Firms extract data in order to analyze it, migrate the data to a data repository (data warehouse) or use it in their businesses.
Before scraping, we need :
Let’s write our scraper!
First we will import selenium webdriver package in our python file:
from selenium import webdriver
Now, create an instance of google chrome. It will help our program to open url in google chrome.
driver = webdriver.Chrome()
Important: If Python can’t find Chrome/Firefox, you can just tell Python where it is when you’re loading it up.
options = webdriver.ChromeOptions() #add chrome browser exe path. options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe" #add chromedriver.exe path driver=webdriver.Chrome(options=options,executable_path="C:/Python/Scripts/chromedriver.exe")
Now access google chrome and open website. Also, chrome knows that we are accessing it through automated software.
We just opened a URL from python.
How to extract data from web page
We will inspect 2 items (buyer name and price) on web page and understand how we can extract them.
- Buyer Name: Inspecting the buyer-name, we can see the highlighted text represents the code for buyer-name.
Right click on buyer name and then click inspect.The code will be open:
Now right click on it and copy Xpath
The XML path (XPath)for the buyer name is shown below.
Selenium has “find_element_by_xpath” function and we will use it to pass our XPath and get a selenium element. Once we get element, we can extract the text inside our XPath using the ‘text’ function. In our case the text is basically the buyer name (‘Carson Busses’).
- Price: Similar to buyer name, we will now inspect the price.
Now right click on it and copy Xpath
The XML path (XPath)for price is shown below.
So, how do we extract price from the above XPath?
Again we use “find_element_by_xpath” function and we will pass our XPath and get a selenium element. Once we get element, we can extract the text inside our XPath using the ‘text’ function. In our case the text is basically the price ($29.95).
Here is the code:
from selenium import webdriver driver = webdriver.Chrome() url="http://econpy.pythonanywhere.com/ex/001.html" driver.get(url) buyer_name = driver.find_element_by_xpath('/html/body/div/div').text price = driver.find_element_by_xpath('/html/body/div/span').text print('Buyer : '+buyer_name) print('Price : '+price)
When we run this code, we will get buyer name and price
We just learnt how to scrape different elements from a web page.
How to recursively extract all buyers name and prices
We noticed that each buyer has title “buyer-name”. So we will change our XPATH.
// represents any xpath that has div and title attribute “buyer-name” within the div tag.
Now we will use ”find_elements_by_xpath” function and pass our Xpath in it.We will get list of selenium elements. We will use index to get each element.
Similarly we have to change XPATH of price:
Here is the code to extract all elements:
from selenium import webdriver driver = webdriver.Chrome() url="http://econpy.pythonanywhere.com/ex/001.html" driver.get(url) buyer_name = driver.find_elements_by_xpath('//div[@title="buyer-name"]') price = driver.find_elements_by_xpath('//span[@class="item-price"]') number_of_buyers = len(buyer_name) for x in range(number_of_buyers): print('Buyer : '+buyer_name[x].text)
Finally, you must noticed that website has pagination. So, we can recursively go to next pages by simply changing the page numbers in the URL to extract data.
E.g. to extract data from page number 4 we simply change 001 with 004 in URL. This process will take some time depending on the computational power of your computer.
To extract all pages, find code on Github