Scrape Amazon Data With Python

Just go through Amazon.com’s anti-bot mechanism as well as use Python to crawl Amazon.com pricing data automatically.

amazon

You can’t crawl Amazon.com easily these days. Amazon would block the HTTP requests as well as make the automation an unsuccessful trial.

For instance, run the given code:

import requests
page = requests.get('https://www.amazon.com/')
print(page.text)

Your request would get a sincere welcome from Amazon.com including this rather than its actual web page HTML:

amazon

Therefore, suppose you don’t need to use APIs (not for free). Also study that in 2019, the US court had fully legalized web scraping.

Collecting amazon.com public data is completely legal.

It is not easy to extract Amazon.com data, but it’s not impossible either. In this blog, we will share our practices. We hope this might be useful to you doesn’t matter if you are a buyer, seller, or Data Scientist that requires the latest raw price data.

Be Pleasant to Your Server

Anyways, the scraping target is supported by smaller or larger companies, we haven’t send requests in the Multi-Process as well as Multi-Thread asynchronous styling.

Smaller servers without security can be attached by the waves of different HTTP requests, as well as well-protected websites would block your IP Addresses if not banned.

Adding a smaller interval between different HTTP requests could be a solution before you start a crawler, for instance:

import time
time.sleep(0.1) # sleep for 0.1 second

Discover the Web Browser Checking

To prevent huge data scraping, many websites apply severe anti-bot policies. For example, ask the request browser to run a Javascript piece and other complex methods. To discover these checking, the easiest result would be utilizing headless browsers, including:

pyppeteer, a Python version of puppeteer.
Selenium using Python.

These headless browsers work like Firefox or Chrome without showing on a web page as well as could be well-controlled with code.

For a few pages that need complex interactions as well as require human eyes & hand assistance, you might even think about creating a Chrome extension for capturing web data as well as send that to any local running services

However, there is one disadvantage of headless browsers: its enormous resource use, both RAM and CPU. All HTTP requests are sent from the real Web Browsers.

For amazon pricing data, we will utilize other solutions. Let’s continue.

Use of a Cloudscraper Package

A cloud scraper package is made for bypassing Cloudflare’s anti-bot pages (identified as IUAM or “I’m Under Attack Mode”). We have found that it works for various websites, as well as it works well using Amazon’s pages also.

To install that:

pip install cloudscraper

Run the quick test:

import cloudscraper
scraper = cloudscraper.create_scraper()
page = scraper.get('https://www.amazon.com/')
print(page.text)

Now you should see something like that rather than any warm API reference welcome.

amazon

A cloudscraper module offers many features and one of them is browser type configuration. For instance, we need amazon.com to revert HTML content for Chrome with Windows (against mobile version). We can reset the scraper example like:

scraper = cloudscraper.create_scraper(
    browser={
        'browser': 'chrome',
        'platform': 'windows',
        'desktop': True
    }
)

Many other combinations are there from its source readme page.

Captcha, Captcha, & Captcha

Solving a captcha is terrible for data scraping. You would face that sooner or later.

amazon

For minimizing the effects of captcha, we have used the following tactics:

Tactic #1. Rather than Solving That, We Have Ignored That

Whenever our crawler sees any captcha, we back a URL string in the URL queue. Then shuffle a queue to randomize the order (for avoiding repeated sending the similar URLs in the shorter timespan).

import random
random.shuffle(url_list)

Tactic #2. Apply Random Sleep

Apply a random sleep after each page request. After so many tries, we have found sleep between 0 to 1 seconds works also. Not very slow as well as it also activates the minimum captcha.

time.sleep(random.uniform(0+sleep_base,1+sleep_base))

In the given code, we have also defined a sleep_base variable. When a crawler triggers any captcha, we add 2 seconds with sleep_base. (sleep_base+=2).

Tactic #3. Use Several Scraper Examples

Using one scraper example will start activating the captcha after about 200 requests, as well as more triggered, doesn’t matter how many seconds we have added to sleep_base. With multiple scrapers, examples can alleviate it efficiently. Here, we have used 30 scraper examples.

scrapers = []
for _ in range(30):
    scraper = cloudscraper.create_scraper(
        browser={
            'browser': 'chrome',
            'platform': 'windows',
            'desktop': True
        }
    )
scrapers.append(scraper)

Using that, take one example from a scraper list and then randomly shuffle a scraper listing (or randomize a picker index).

Parse Made HTML Page with BeautifulSoup

Many tools are there to assist parsing HTML text, so you could even utilize Regular Expression for scraping the key data you need. We have found BeautifulSoup a beautiful convenient tool for navigating HTML elements.

For installing the packages:

pip install beautifulsoup4
pip install html5lib

Using the assistance of html5lib, you could allow an HTML5 parser using BeautifulSoup. Let’s see the use with two easy (as well as real) samples.

With find function for retrieving data from different elements having particular Id.

from bs4 import BeautifulSoup
soup      = BeautifulSoup(page.text,'html5lib')
sku_title = soup.find(id='productTitle').get_text().strip()

Use the select function for navigating elements with the CSS style selector.

chip = soup.select('.po-graphics_coprocessor > td')
[1].get_text().strip()

It is unimportant to get more use from its documents.

How Do I Get the Solution Works Whenever We Read That?

We am not certain for how long the solution would survive, might be one month, a year, or 10 years. While we are writing this, a crawler empowered by the solution is running on our server 24*7.

If data is refreshed for today, it means a crawler is working well and the solution is applicable.

amazon5

We will update this blog whenever new problems come as well as are solved.

For more information about scraping Amazon.com data using Python, contact X-Byte Enterprise Crawling or ask for a free quote!

Send Message

    Send Message