Web Scraping

Web scraping is a process that involves retrieval of data from different website sources. Firms extract data in order to analyze it, migrate the data to a data repository (data warehouse) or use it in their businesses.

Pre-Requisite

Before scraping, we need :

  1. Python
  2. Selenium Webdriver

Let’s write our scraper!

First we will import selenium webdriver package in our python file:

from selenium import webdriver

Now, create an instance of google chrome. It will help our program to open url in google chrome.

driver = webdriver.Chrome()

Important: If Python can’t find Chrome/Firefox, you can just tell Python where it is when you’re loading it up.

options = webdriver.ChromeOptions()
#add chrome browser exe path.
options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe"
#add chromedriver.exe path
driver=webdriver.Chrome(options=options,executable_path="C:/Python/Scripts/chromedriver.exe")

Now access google chrome and open website. Also, chrome knows that we are accessing it through automated software.

driver.get("http://econpy.pythonanywhere.com/ex/001.html")
selenium

We just opened a URL from python.

How to extract data from web page

We will inspect 2 items (buyer name and price) on web page and understand how we can extract them.

  1. Buyer Name: Inspecting the buyer-name, we can see the highlighted text represents the code for buyer-name.
    Right click on buyer name and then click inspect.The code will be open:
    python


    Now right click on it and copy Xpath

    smartspidering.png


    The XML path (XPath)for the buyer name is shown below.

    /html/body/div[2]/div


    Selenium has “find_element_by_xpath” function and we will use it to pass our XPath and get a selenium element. Once we get element, we can extract the text inside our XPath using the ‘text’ function. In our case the text is basically the buyer name (‘Carson Busses’).

  2. Price: Similar to buyer name, we will now inspect the price.


    smartspidering.png


    Now right click on it and copy Xpath

    smartspidering


    The XML path (XPath)for price is shown below.

    /html/body/div[2]/span

    So, how do we extract price from the above XPath?

    Again we use “find_element_by_xpath” function and we will pass our XPath and get a selenium element. Once we get element, we can extract the text inside our XPath using the ‘text’ function. In our case the text is basically the price ($29.95).

Here is the code:

from selenium import webdriver
driver = webdriver.Chrome()
url="http://econpy.pythonanywhere.com/ex/001.html"
driver.get(url)
buyer_name = driver.find_element_by_xpath('/html/body/div[2]/div').text
price = driver.find_element_by_xpath('/html/body/div[2]/span').text
print('Buyer : '+buyer_name)
print('Price : '+price)

When we run this code, we will get buyer name and price

We just learnt how to scrape different elements from a web page

How to recursively extract all buyers name and prices

We noticed that each buyer has title “buyer-name”. So we will change our XPATH.

//div[@title=”buyer-name”]

// represents any xpath that has div and title attribute “buyer-name” within the div tag.

Now we will use ”find_elements_by_xpath” function and pass our Xpath in it.We will get list of selenium elements. We will use index to get each element.

Similarly we have to change XPATH of price:

//span[@class=”item-price”]

Here is the code to extract all elements:

from selenium import webdriver
driver = webdriver.Chrome()
url="http://econpy.pythonanywhere.com/ex/001.html"
driver.get(url)
buyer_name = driver.find_elements_by_xpath('//div[@title="buyer-name"]')
price = driver.find_elements_by_xpath('//span[@class="item-price"]')
number_of_buyers = len(buyer_name) 
for x in range(number_of_buyers):
    print('Buyer : '+buyer_name[x].text)
webelement


Finally, you must noticed that website has pagination. So, we can recursively go to next pages by simply changing the page numbers in the URL to extract data.
E.g. to extract data from page number 4 we simply change 001 with 004 in URL. This process will take some time depending on the computational power of your computer.
To extract all pages, find code on Github


6 Comments

Jason · May 13, 2019 at 8:37 pm

I couldn’t resist commenting. Very well written! I’ll right
away grasp your rss as I can’t in finding your email subscription link or e-newsletter service.
Do you have any? Please allow me recognise so that I may just subscribe.
Thanks. I am sure this article has touched all the internet visitors, its
really really nice post on building up new blog. http://Nestle.com/

    Fahad Khalid · May 14, 2019 at 11:06 pm

    Thank you so much for appreciation.I will add subscription so you can get news of latest post.

Situs Domino Terpercaya · July 4, 2019 at 12:15 pm

If you are going for most excellent contents,
simply visit this site daily for the reason that it provides feature contents, thanks

pkv Poker · July 11, 2019 at 11:39 am

Keep this going please, great job!

Johnson · August 4, 2019 at 11:23 am

Great Article

Bypass Restriction · August 24, 2019 at 5:33 pm

Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *