We search lots of things on internet everyday in order to buy something, to compare one product to another, how do we decide that one particular product is better that the other? – We directly hit onto the reviews and see how much stars/ positive feedback has been given to the product, right!!
In this blog we’re going to scrape reviews from amazon.com. Not only just review but also how much stars it got, who posted the review, etc.
We will be saving data in an excel spreadsheet (csv). Here are detailed data-fields that we are going to extract:
1. Review Title
2. Rating
3. Reviewer Name
4. Review Description/Review Content
5. Helpful Count
So let’s get started.
We prefer Scrapy – a python framework for a large-scale web scraping. Along with that some other packages will be required in order to Amazon data scraping.
- Requests – to send request of a url
- pandas – to export csv
- pymysql – to connect mysql server and store data there
- math – to implement mathematical operations
As you know, you can always install such packages just like below with pip or conda.
OR
Let’s define Start URL to extract seller links
Let’s first see what it’s like to scrape Amazon reviews for one product.
We are taking URL: https://www.amazon.com/dp/B07N9255CG
It will look like below image.
Now if we get to the review section, it’ll look like the image below. It may have some different names in reviews.
But if you closely inspect those requests going on the back while loading the page and play a little with the next and previous page of review, you might notice that there’s a post request loading that contains all the content in the page.
Here we’ll have a look at payload & headers required for a successful response. If you have proper inspected all the pages, you’ll know the difference between shifting the page and how it reflects on the requests passed for it.
NEXT PAGE --- PAGE 2
https://www.amazon.com/hz/reviews-render/ajax/reviews/get/ref=cm_cr_arp_d_paging_btm_
next_2
Headers:
accept: text/html,*/*
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
content-type: application/x-www-form-urlencoded;charset=UTF-8
origin: https://www.amazon.com
referer: https://www.amazon.com/Moto-Alexa-Hands-Free-camera-included/productreviews/B07N9255CG?ie=UTF8&reviewerType=all_reviews
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/81.0.4044.113 Safari/537.36
x-requested-with: XMLHttpRequest
Payload:
reviewerType: all_reviews
pageNumber: 2
shouldAppend: undefined
reftag: cm_cr_arp_d_paging_btm_next_2
pageSize: 10
asin: B07N9255CG
PREVIOUS PAGE --- PAGE 1
https://www.amazon.com/hz/reviewsrender/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_prev
_1
Headers:
accept: text/html,*/*
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
content-type: application/x-www-form-urlencoded;charset=UTF-8
origin: https://www.amazon.com
referer: https://www.amazon.com/Moto-Alexa-Hands-Free-camera-included/
productreviews/B07N9255CG/
ref=cm_cr_arp_d_paging_btm_next_2?
ie=UTF8&reviewerType=all_reviews& pageNumber=2
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36
x-requested-with: XMLHttpRequest
Payload:
reviewerType: all_reviews
pageNumber: 2
shouldAppend: undefined
reftag: cm_cr_arp_d_paging_btm_next_2
pageSize: 10
asin: B07N9255CG
Want to Scrape Amazon reviews ?
The Main Part : CODE/script
There are two different ways to make a script:
1. Create a whole scrapy project
2. Just create a bunch of files in a folder to narrow down size of project
As in the last tutorial we showed you a whole scrapy project and details to create and modify that. Well, we’re going the most possible narrowed way this time. Yes, just a bunch of files and all the reviews in amazon will be right there!!
As we are using scrapy & python to extract all the reviews, it’s easy, rather to be said convenient to take the road of xpath.
The most important part of xpath is to capture a pattern. Because to copy same xpath from google inspect window and paste that, it’s pretty simple but very old school and also not at all efficient every time.
Here’s what we’re going to do, we’ll observe xpath for same field, let say “Review Title” and see how it creates a pattern or something like that to narrow down the xpath.
There are two examples of a similar xpath below.
(Review-1)
(Review-2)
As you can see there are similar attributes to the tag which contains the information about “Review Title”.
Hence, resulting xpath for Review Title will be,
- //a[contains(@class,”review-title-content”)]/span/text()
Just like this we’ve listed all xpaths for all the fields we are going to scrape.
- Review Title : //a[contains(@class,”review-title-content”)]/span/text()
- Rating : //a[contains(@title,”out of 5 stars”)]/@title
- Reviewer Name : //div[@id=”cm_cr-review_list”]//span[@class=”a-profile-name”]/text()
- Review Description/Review Content : //span[contains(@class,”review-text-content”)]/span/text()
- Helpful Count : /span[contains(@class,”cr-vote-text”)]/text()
Obviously, some stripping and joining to the end results in some xpath is indeed important in order to get perfect data. Also, don’t forget to remove extra white spaces.
Alright,
Now we have seen how to move across the pages and also how to extract information from them, Time to assemble those all!!
Below is the whole code for extraction of all reviews for one product!!!
def parse_hotels(driver):
import math, requests, json, pymysql
from scrapy.http import HtmlResponse
import pandas as pd
con = pymysql.connect ( 'localhost', 'root', 'password','database' )
raw_dataframe = [ ]
res = requests.get( 'https://www.amazon.com/Moto-Alexa-Hands-Free-camera-included/ product-reviews/B07N9255CG?ie=UTF8&reviewerType.all_reviews' )
response = HtmlResponse( url=res.url,body=res.content )
product_name = response.xpath( '//h1/a/text()').extract_first( default=' ' ).strip()
total_reviews = response.xpath('//span[contains(text(),"Showing")]/text()').extract_first(default='').strip().split()[-2]]
total_pages = math.ceil(int(total_reviews)/10)
for i in range(0,total_pages):
url = f"https//www.amazon.com/hz/reviews-render/ajax/reviews/get/ref=cm_crarp_d_paging_btm_next_{str(i+2)}"
head = {'accept': 'text/html, */*',
'accept-encoding': 'gzip,deflate,br',
'accept-language': 'en-US,en;q=0.9',
'content-type': 'application/x-www-form-urlencoded;charset=UTF-8', 'origin': 'https://www.amazon.com,
'referer':response.url,
'user-agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KWH, like Gecko) Chrome/81.0.4044.113 Safari/537.36',
'x- requested-with': 'XMLHttpRequest'
}
payload = {'reviewerType':'all_reviews'
'pageNumber': i+2,
'shouldAppend': 'undefined',
'reftag': f'cm_crarp_d_paging_btm_next_{str(i+2))',
'pageSize': 10,
'asin': '807N9255C',
}
res = requests.post(url,headers=head,data=json.dumps(payload))
response = HtmlResponse(url=res.url, body=res.content)
loop = response.xpath('//div[contains(@class,"a-section review")]')
for part in loop:
review_title = part.xpath('.//a[contains(@Class,"review-title-content")]/span/text()').extract_first(default=' ').strip()
rating =part.xpath('.//a[contains(@title,"out of 5 stars")]/@title').extract_first(default=' ').strip().split()[0].strip()
reviewername = part.xpath('.//span[@class."a-profile-name']/text()').extract_first(default=' ').strip()
description =''.join(part. xpath('.//span[contains(@class,"review-text-content")]/span/text()') .extract()).strip()
helpful_count =part.xpath('.//span[contains(@class,"cr-vote-text")]/ text()').extract_first(default ='').strip().split()[0].strip()
raw_dataframe.append([product_name,review_title,rating,reviewer_name, description,helpful_count])
df =pd.Dataframe,(raw_dataframe,columns['Product Name','Review Title','Review Rating','Reviewer Name','Description','Helpful Count' ]),
#inserting into mySQL table
df.to_sql("review_table",if_exists='append',con=con)
#exporting csv
df.to_csv("amazon reviews.csv",index=None)
Pressure Points while scraping Amazon Reviews
- The whole process looks very easy to implement but there can be some issues while executing that such as response issue, captcha issue. To bypass the same, you should always keep some proxies or vpns handy. So that the process can be a whole lot smoother.
- Also there are some times when the website changes its structure. If the extraction is going to be a long run for you then, you should always keep error logs in your script or an error alert would also work. So that you can be aware of that the moment structure is changed.
Conclusion
Any kind of amazon review scraping is a lot helpful. Why? Read below cases.
- To monitor views on products by customers if you are a seller on the same website
- Also to monitor, other party sellers
- To create a dataset which is used for a research whether for academic purpose or industrial purpose?