Selling products online on Amazon is the easiest, the most convenient and an inexpensive way of starting a business online. In addition, we can also scrape seller information from Amazon. In this tutorial, we will build an Amazon Scraper for extracting seller’s details. We will build this simple web scraper using Python.
Here, we are going to scrape details of all the sellers available on website – amazon. Such as,
- Seller Name
- Category (If specified)
- URL/ link of the seller-page
- Address
- Phone Number
So let’s get started.
Let’s prefer amazon scraper python framework for large scale web scraping. To scrape amazon, you need to have some modules and dependencies installed or set up in your desktop such as,
- scrapy : A fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath.
- re : module that is used to work with Regular Expressions
- pandas : (Python Data Analysis Library) – providing high-performance, easy-to-use data structures and data analysis tools
- pymysql : an interface for connecting to a MySQL database server from Python
- phonenumbers : A Python phone number parsing and formatting library.
- platform : is used to access the underlying platform’s data, such as, hardware, operating system, and interpreter version information.
As you know, you can always install such packages just like below with pip or conda.
or
Creating our amazon spider
To create a basic scrapy project, we suggest creating a respective folder for that. Navigate the path of such folder in command prompt and you can create scrapy project just by executing following:
It’s a basic scrapy project, which as you know can be created by “scrapy startproject <_project name_>“. Here the project name is amazon_seller.
We can proceed by creating a spider to extract seller links, so that we can get started on scrapping which can be created by “scrapy genspider <_spider name_> <_url to scrape_>”.
Here’s the command for that.
The flow of scrapy project looks like this image below. However, this is created specifically for amazon seller extraction. Hence, it has one file extra (databaseconfig.py). Why? – We’ll discuss it further.
Let’s go through the flow of scrapy project and see how it works.
Here, in directory “\amazon_seller\amazon_seller\” there is an additional file named “databaseconfig.py” as we discussed previously.
- As you know, some connection strings and values often need to be defined more than one time in a project.
- So why don’t we define them in one universal file and keep calling them any time we need.
- The file – databaseconfig.py is created with this concept only.
The file contains some of variables and values listed below.
1. Values required to form a database connection strings
2. Names of schema tables
3. Schema table creation strings
4. Some file paths (to direct the process where to save output csv at the end of execution)
Here is a snippet for the same.
host = “localhost”
username = “root”
passwd = “your password here”
db=”amzon_seller”
table_name1 = “seller_list”
table_name2 = “seller_info”
table1_create_table = “”” CREATE TABLEIF NOT EXISTS%s
(Id int NOT NULL AUTO_INCREMENT,
seller_name varchar(100)NOT NULL,
seller_category varchar(100)NOT NULL,
seller_url varchar(255)NOT NULL,
status varchar(10)DEFAULT ‘pending’,….
Want to uncover Amazon seller details with scraping ?
Let’s define Start URL to extract seller links
We have a spider created to extract seller links – seller_links.py
There is a small process to understand how the start URL is formed. Because hey!, we’re not going to have all the sellers for each and every category listed all together in front of us in amazon, right ?
- Here is the trick, when you are about to extract sellers from amazon, you don’t need to go inside every single product to check for sellers.
-
Instead, what you can do is, search for amazon sellers in browser and it will directly provide list of sellers for you. Just like below. Link:
-
But, as you can see, it is for category “Beauty:Hair Care:Hair Care Products:Hair Loss Products:”minoxidil”” (Translated in English) only.
- So what if we want to extract sellers from other categories?
- Hence, we’re going to take one id for each category of amazon found in category page. Let’s move to category page of amazon to understand all of this.
- So, when you inspect each category link you will get one id. See below.
-
Here the category “Camera & Photo” is chosen to extract under “Electronics & Computers”. While inspecting that category, the id was found after “&node=<_id_>”
- Just like this one, ids for each category can be found. (Except for the categories whose products are sold by amazon itself.)
- In the script, we have gathered these ids and put those in the link for seller list page we just saw before. It’s like below.
- The link can be changed by just editing id at the very end. i.e.,
- So, this will be our Start URL in with which will be different for each category available on amazon.
All Geared Up To Extract Links Of Sellers Now?
If you have understood the process of creating and changing start URLs, we can now move further to use them.
The spider to extract links of sellers is created by method we discussed above. Here the spider is named – seller_links.py.
We have created 3 functions in seller_links.py to make the process faster and smoother, instead of writing all the code in one function.
- There are total 3 functions in spider created for link extraction.
1. parse()
In this function, a request is being sent for seller link page with each category-id. See the screenshot below.
for amazon_category in amazon_categories;
yield scrapy.Request(
url=”https://www.amazon.de/mn/search/other?_encoding=UTF8&language=en_
GB&page-l&pickerToList=enc-merchantbin&rh=n%3A+amazon_category,callback
=self.parse_next”
method = “GET”,
meta = {‘amazon_category’ :amazon_category}
)
All category-ids are stored in variable amazon_categories.
2. parse_next()
This function executes two tasks mainly:
1) Filters category whose products are sold by amazon itself.
2) If not 1), then it collects links of alphabets given in the page which contains seller links starting with respected alphabets.
This is because the page shows Top sellers for that particular category only. Below is a screenshot for that.
3. seller_link_extract()
Response of seller link is parsed and following details are extracted from that.
- Seller-category
- Seller-name
- Seller-page-url
def seller_link_extract(self,response);
item = AmazonSellerLinkitem()
item[‘seller_category’] = “_” .join(response.xpath(‘//*@class=”a-row-a-spacing-base”]//a
/text().extract())
sellers = responsive.xpath(‘//[@class=”s-see-all-indexbar-column”]//a’)
for seller in sellers:
item[‘seller_name’] = seller.xpath(‘,/@title’).extract_first()
item[‘sellers_page_url’] = “https://www.amazon.de/sp?_encoding=UTF&asian=&isCBA
=&marketplaceID=&orderID=&seller=” +str(seller.xpath(‘,/@href’).re(‘6%3A(.*?)&’)[0])
yielditem
Let’s extract the Seller Details
Spider created for seller information extraction is – get_seller_info.py
As we have stored all the URLs, we can now use them to extract sellers’ data. There are total 2 functions in spider created for data extraction.
1. start_requests()
Fetches all the links and sends request to the next function. That’s it.
2. parse()
-
The response we got from the final seller page link is going to be parsed here. The details needed are obtained with xpath and the data is cleaned with help of Regular Expressions (if needed).Here is a snippet of the function parse ().
try:
merchantinfo[‘BusinessAddress’]= “,”join(response.xpath(‘//*[contains(text().”Business
Address;”) or contains(text(),”Geschatsadresse:”)]/following-sibling::ul//texi()’)extract())
except Exception as e;
merchantinfo[‘BusinessAddress’]=”
try:
merchantinfo[‘PhoneNumber’] = response.xpath(‘//*@class=”a-coluirnn a-span6″]//**
[contains(text(),”Phone number.”) or contains(text(),”Telefonnurnmer;”)]/parents:span/
text()).extract_first() if not merchantinfo[‘PhoneNumber’]:
try:
merchantinfo[‘PhoneNumber’] = refindall(b”Telefon:(.*?)”, response.body)[0]
except:
merchantinfo[‘PhoneNumber’] = refindall(b”Tel\+Fax\.:(.*?)”,response.body)[0]
exceptException as e:
merchantinfo[‘PhoneNumber’] = ”
-
To export all the data extracted in .csv format, module-pandas is used. You can also add your headers to the csv. Below is a snippet for that.
try:
con = pyrmysql.connect(dbc.host,dbc.username,dbc.passwd)
Name = dbc.csv + ‘amazon_data.csv’
qry = “select*from amazon_seller.seller_info”
df = pd.read_sql(qry,con)
df.columns = [‘Id’,’SellerName’,’Category’,’SellerPage’,’BusinessAddress’,’PhoneNumber”
Email’]
df.to_csv(Name,index=None)
print(‘CSV file generated”)
except Exception as e:
print(e)
-
After the spider is finished processing, a csv file will be generated in the given path. Here is what output would look like.
Conclusion
This is it. Now, Sellers’ information has been extracted!! So, by this way, you can execute the code and extract the information from any amazon website.
However, there are some things you need to take care of while performing final execution. Also, Amazon or any Amazon website does not respond as quickly and as smoothly as we expect. Hence, you need to try various proxies or list of various user-agents and applying them dynamically/randomly and see if it works. I have used both here alternately.
Second thing, particularly for this requirement – the phone numbers and email which differs in types and patterns enormously. Make sure that you have got them covered by any means. For phone-number, a module called “phonenumbers” is used. Give it a check for once.
We hope this tutorial gave you a better idea on how to scrape seller information from Amazon or similar e-commerce websites. As a company, we understand e-commerce data having worked with it before. If you are interested in professional help with scraping complex websites, let us know, and we will be glad to help you.
Looking For Amazon Seller Data?
Connect With Us For Reliable And Enterprise Level Data