big-basket-web-scraping-made-easy-using-python

In this tutorial blog, we will find out web scraping methods that allow us to scrape helpful data from all websites using BeautifulSoup library from Python.

What is Web Scraping?

In definition, web scraping is the mechanism of gathering a huge amount of data from webpages as well as store data in any necessary formats that further assists us in doing analysis of scraped data. Beautifulsoup is Python’s package to parse HTML as well as XML documents that helps in scraping the data very easily.

Here are the steps that we will use to scrape data with python:

  • Initially, we will get a website URL for scraping.
  • Examine the page.
  • Search data that we wish to scrape.
  • Write the Python code as well as run it.
  • Store information in the necessary format.

BeautifulSoup is the most sophisticated web scraping library that parses the XML or HTML content from webpages.

During this learning procedure, we will analyze the two data scraping sections given below where we would explore additional functionalities about BeautifulSoup:

  1. Scrape a Bigbasket Grocery site for scraping product data and store the data in the CSV or JSON file.
  2. Scrape tabular form data from the website as well as make that ready like Dataframe utilizing pandas.

Before we move ahead, let’s go through the fundamentals of HTML as well as how to review the webpage yourself.

HTML is the markup language utilized to structure a web page. This offers tags like <li> to do listing, <div> to do division, <p> for paragraphs and more.

 
sample-html-document.jpg

Follow these steps for inspecting the webpage:

Initially, open a website URL in the browser.

Right-click on a page and choose ‘Inspect’.

A window called ‘Chrome DevTools’ will open at a side of this page where you can observe the HTML of the webpage.

How to Scrape Bigbasket Data Easily Using Python?

In the code, we would utilize BeautifulSoup for downloading the HTML content from the given website and perform web scraping.

Let’s get begin!

1. Scrape Bigbasket Website:

Here, we will go through a step-by-step procedure of scraping product data like Product’s Name, Product Quantity, Brand Name, Pricing, Product Description, etc. from this site with BeautifulSoup as well as store data in the CSV file.

Step 1: Install and Import the required libraries into Jupyter notebook.

pip install BeautifulSoup
pip install requests
from bs4 import BeautifulSoup as bs 
import requests   # importing requests module to open a URL

Step 2: Describe the EAN code listing for which we require to scrape data as well as assign that to the variable named ‘eanCodeLists’

eanCodeLists = [126906,40139631,40041188,40075201,40053874,1204742,40046735,40100963,40067874,40045943]

Let’s check an EAN code-40053874 for getting Product’s Name, Product Quantity, Brand Name, Pricing, Product Description, etc. and later utilize for the loop to repeat the given list for getting all the products data.

Step 3: After that, open the given URL with requests.get() technique that makes the HTTP request to the web page.

step-3
urlopen = requests.get('https://www.bigbasket.com/pd/40053874').textt

Step 4: Utilize BeautifulSoup for parsing HTML as well as assign the variable named ‘soup’

step-4
soup = bs(urlopen,'html.parser')
Output:
Output.jpg

Step 5: After that, let’s open a URL https://www.bigbasket.com/pd/40053874 in the browser as well as right-click on content that we require as well as get the equivalent HTML tags. Then we will utilize these tags within the code to find the necessary data.

after-that-let-open-a-url.jpg

Then, right-click on a field named ‘Weikfield Chilli Vinegar, 200 g’ for getting tag names. It will provide us Brand’s name, Product’s Name, as well as Quantity. Please see the below image.

inspect-code.jpg
<h1 class="GrE04" style="-webkit-line-clamp:initial">Weikfield Chilli Vinegar, 200 g </h1>

It’s time to utilize Beautifulsoup fore referring these tags as well as assign to the variable named ‘ProductInfo’.

ProductInfo = soup.find("h1", {"class": "GrE04"}).text  # .text will give us the text underlying that HTML element

Step 6: Now, it’s time to utilize split() method for getting the results below:

Now, split(‘ ‘,1)[1] provides ‘Chilli Vinegar, 200 g ‘ as well as split(‘,’)[0] splits utilizing ‘,’ as well as provides ‘Chilli Vinegar’

Step 7: For getting the Price & Products description

to-get-price-and-product-description
Pricing field tags: <td data-qa=”productPrice” class=”IyLvo”>Rs <! — →35</td>
Product description’s field tags: <div class=”_26MFu “><style …

Therefore, we are now having,

ProductName= Chilli Vinegar
BrandName= Weikfield
ProducQty = 200 g
ProductPrice= Rs 35
ProductDesc = The spiciness of a fresh green chilli diffusing its heat into sharp vinegar makes this spicy vinegar a unique fusion of spicy and sour notes.

Step 8: Now, we can get all the EAN codes data using the given code as well as for loop.

step-8

Step 9: Use pandas for storing data into Dataframe.

step-9

Step 10: To conclude, save data into JSON and CSV files into local directory.

step-10
step-10.1
2. Scrape Tabular Format’s Data
scraping-tabular-format-data-step-2

Here, we will extract the website https://www.ssa.gov/OACT/babynames/decades/names2010s.html that has data in tabular format having ‘200’ very popular names of male as well as female babies that are born during time period of 2010–2018 in the USA. (It is a sample data depending on the Social Security Card app data on March 2019).

Step 1: Import libraries as well as use BeautifulSoup for parsing HTML content.

scraping-tabular-format-data-step-2.1
import requests
from bs4 import BeautifulSoup as bs
url = requests.get('https://www.ssa.gov/OACT/babynames/decades/names2010s.html').text
soup = bs(url,'html.parser')

Step 2: Let’s utilize <table class=”t-stripe”> for scraping table data.

scraping-tabular-format-data-step-2.2
table_content = soup.find('table',{'class':'t-stripe'})

Here, we utilize tag names ‘td’ that represents the table data (or data cell), ‘th’ (or table header) as well as ‘tr’ (or table rows). Now, we will utilize ‘tr’ tag from the ‘table_content’ that has the combination of ‘th’ and ‘td’.

data = table_content.findAll('tr')[0:202] #returns all 200 rows including header

Step 3: Now, it’s time to utilize loop for repeating those 200 rows as well as get data in the ‘list’ variable named ‘rows_data’.

Initially, let’s check ‘data’ length with len(data)

data-range

and this is its code…

coding

Step 4: Now, it’s time to utilize pandas for storing data into Dataframe.

table

Step 5: Now, we can do some operations within this data as well as get a few insights.

Let’s observe how many times the name ‘Samuel’ gets used.

df[df['Male_Name'] == 'Samuel'][['Male_Name','Male_Number','Rank']]
step-5
Conclusion

That’s how, we can utilize web scraping services with Python for scraping any website as well as extracting some important data, which can be utilized for doing any analysis. Some important use cases about web scraping services include:

  • Businesses, Market Analysis, E-Commerce, Competition Monitoring, Price Comparison
  • Collecting Data from Different Resources for Analysis
  • Getting Latest News Reports
  • Marketing
  • Media
  • Travel Companies Used for Collecting Live Tracking Data
  • Weather Forecasting

Send Message

    Send Message