Introduction

Sometimes when scraping website through Python, the site block your IP address for sending too many requests to their server. So, to bypass this restriction, you have to use proxies. Proxies ensure that your IP address will not get block by a website that you are trying to access. Luckily there are easy-to-use resources that make all of this possible.

proxies in python

Pre-Requisite

Before using proxies, we need :

  1. Python
  2. Selenium Webdriver

Use Proxies in Python 3 without Authentication

If you are using selenium webdriver, you can open chrome webdriver using proxy by configuring proxy-server . For configuration, we need proxies. There are many websites who provides free proxies on the internet. One such site is https://free-proxy-list.net/. Open it and pick a proxy.

Here is my proxy

IP: 207.148.124.248 Port: 8080

Note: That proxy probably won’t work when you test it. You should pick another proxy from the website if it doesn’t work.


Now let’s open https://whatismyipaddress.com/ and test if the site is open through the proxy

from selenium import webdriver
PROXY = "207.148.124.248:8080" # IP:PORT or HOST:PORT
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
options.add_argument('--proxy-server=http://%s' % PROXY)
driver = webdriver.Chrome(options=options)
url="https://whatismyipaddress.com/"
driver.get(url) 
smartspidering


You can see that the site is open through the proxy and my IP address is also change. Let’s open site through a pool of IP addresses. We’ll get a list of proxies from https://free-proxy-list.net/ by manually copy and pasting, or write script that creates this list automatically. We can also use private proxies but for this we have to buy these proxies. Once we have the list of proxies, the rest is easy.
Below is the code that automatically scrapes proxies. (This code might be change if the website updates its structure)

from selenium import webdriver
def get_proxies(driver)
    url = "https://free-proxy-list.net/"
    driver.get(url)
    proxies = []
    proxy_table = driver.find_elements_by_xpath('//*[@id="proxylisttable"]/tbody/tr')
    for x in proxy_table:
        row_data = x.find_elements_by_tag_name('td')
        proxy = row_data[0].text+":"+row_data[1].text
        proxies.append(proxy)
    return proxies

The function get_proxies will return a list of proxies.

['139.59.109.156:8080', '46.4.96.137:3128', '45.55.9.218:3128', '35.247.152.119:3128', '198.13.38.227:8888', '80.211.237.76:3128', '68.183.152.14:8080', '66.42.114.113:8080', '36.90.14.215:8080', '194.15.36.215:8080', '157.230.161.164:8080', '207.148.124.248:8080', '149.28.140.248:8080', '93.188.165.80:8080', '18.217.49.2:3128', '142.93.173.132:8080', '206.189.216.18:3128', '45.32.33.117:3128', '110.34.39.58:8080', '113.254.227.150:80']

Now that we have the list of proxies, we’ll go ahead and rotate it.

from selenium import webdriver
from time import sleep
def get_proxies(driver):
    url = "https://free-proxy-list.net/"
    driver.get(url)
    proxies = []
    proxy_table = driver.find_elements_by_xpath('//*[@id="proxylisttable"]/tbody/tr')
    for x in proxy_table:
        row_data = x.find_elements_by_tag_name('td')
        proxy = row_data[0].text+":"+row_data[1].text
        proxies.append(proxy)
    return proxies

driver = webdriver.Chrome()
proxies = get_proxies(driver)
driver.close()
for proxy in proxies:
    PROXY = proxy
    options = webdriver.ChromeOptions()
    options.add_argument("--start-maximized")
    options.add_argument('--proxy-server=http://%s' % PROXY)
    driver = webdriver.Chrome(options=options)
    url="https://whatismyipaddress.com/"
    driver.get(url)
    sleep(10)
    #Mostly free proxies will get proxy server errors.
    driver.close()

For Paid Proxies:

If you want to do large-scale data extraction, we will prefer you to purchase some good proxies.
Use this code if you have paid proxies:

smartspidering

You can find source code on Github


1 Comment

Steve Johnson · June 26, 2019 at 8:35 pm

Great article

Leave a Reply

Your email address will not be published. Required fields are marked *