how python is used to extract amazon audible books data

Defining Web Scraping

Web scraping (also known as data scraping) is a method of collecting data from the web. This information is often kept in a local folder so that it may be modified and evaluated as needed. Web scraping is essentially the same as copying and pasting material from a website into an Excel spreadsheet, but on a much smaller scale.

Python is the greatest language for scraping Amazon data from any website, and it includes various libraries such as BeautifulSoup, Scrapy, and others. However, before scraping any website, please read the terms and conditions.

Audible is an online audiobook and podcast business based in the United States that allows customers to buy and listen to audiobooks and other spoken-word material.

Now let’s scrape Amazon audio books information in the Business and Career category. Visit audible.in to learn more about the website by right-clicking anywhere on the page and selecting the ‘Inspect’ option, which is accessible in practically all current browsers.

Requests, BeautifulSoup4, Pandas, and other programs will be used. Please review the documentation provided if you are unsure.

Installing the Required Python Packages and Importing It to Use

!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet
!pip install pandas --upgrade --quietimport requests as rq
from bs4 import BeautifulSoup as bs
import pandas as pd

Downloading the WebPage Using Requests

Download the web page using Requests

Suppose you have data for 51 audio books per page and there are almost 25 pages. Below given is the script that will help you to scrape the required data.

def get_pageno(pageno):
    pageno= str(pageno)
    # Construct the URL
    books_pageno_url = 'https://www.audible.in/search?node=21881793031&pageSize=50&sort=&page=' + pageno
    
    # Get the HTML page content using requests
    response = rq.get(books_pageno_url)
    
    # Ensure that the reponse is valid
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + books_pageno_url)
    
    # Construct a beautiful soup document
    doc = bs(response.text)
    
    return doc

In the above code,

 

  • The movie page will be downloaded using requests as rq.
  • Validate if the requests will be successful using the .status_code= 200
  • The webpage will be stored in ‘doc’ with the use of BeautifulSoup as bs.

Data Extraction

Data Extraction

Let’s obtain the first page using the get_pageno(pageno) method and scrape the data from it.

Data-Extraction-2

Each

tag, as can be seen, holds a set of memory for a single book, such as the Book Name, Ratings, Price, Cover Image, Author, Length, Language, and Links.

1. Book Title

book Title

The names are listed in text format behind the tag, and we can obtain it using the code below.

We’ve gathered all the book links in a list manner, as we can see from the get_book_ links(book contents) function.

def get_book_links(book_contents):
    base_url='https://www.audible.in'
    book_links=[]
    for tag in book_contents:
        a_tag_name= tag.h3.find_all('a', recursive=False)
        url= a_tag_name[0]['href'].strip()
        book_link= base_url+url
        book_links.append(book_link)
    return book_links
 
get_book_links(book_contents)
['https://www.audible.in/pd/Project-Management-Podcast/B08JKMDQXR',
 'https://www.audible.in/pd/Podcast/B08JKMBWLR',
 'https://www.audible.in/pd/Talk-with-Ted-A-Show-About-Nothing-Podcast/B08JKMB9FQ',
 'https://www.audible.in/pd/Communication-Charm-Influence-Negotiation-Presence-Charisma-Podcast/B08JKLZYL4',
 'https://www.audible.in/pd/The-Elephant-in-the-Room-Podcast/B08JKLZ2NV',
 'https://www.audible.in/pd/Podcast/B08JKLDGTK',
 'https://www.audible.in/pd/The-SaaS-Podcast-SaaS-Startups-Growth-Hacking-Entrepreneurship-Podcast/B08JKL8Z8M',
 'https://www.audible.in/pd/Podcast/B08JKL6BVL',
 'https://www.audible.in/pd/The-Case-Interview-Podcast-Podcast/B08JKKVH8T',
 'https://www.audible.in/pd/Augmented-the-industry-40-podcast-Podcast/B08JKKPMQK',
 'https://www.audible.in/pd/Machine-Learning-Engineered-Podcast/B08JKKMMC4',
 'https://www.audible.in/pd/a16z-Podcast-Podcast/B08JKKCL3M',
 'https://www.audible.in/pd/Adventures-in-Machine-Learning-Podcast/B08JKKBLYJ',
 'https://www.audible.in/pd/Startup-Era-Show-Startup-Business-Entrepreneurship-Digital-Marketing-Podcast/B08JKK1L5B',
 'https://www.audible.in/pd/Classic-Influence-Podcast-Timeless-Lessons-from-the-Legends-Podcast/B08JKJXWR8',
 'https://www.audible.in/pd/Case-Interview-Preparation-Management-Consulting-Strategy-Critical-Thinking-Podcast/B08JKJVY5Y',
 'https://www.audible.in/pd/The-McKinsey-Podcast-Podcast/B08JKJG4S4',
 'https://www.audible.in/pd/SEO-Secrets-for-Explosive-Growth-in-Website-Traffic-Leads-and-Sales-from-Search-Podcast/B08JKJBPNJ',
 'https://www.audible.in/pd/Girlboss-Audiobook/B07B4GRNRB',
 'https://www.audible.in/pd/Warren-Buffett-Audiobook/1663712441',
 'https://www.audible.in/pd/Negocie-como-se-sua-vida-dependesse-disso-Never-Split-the-Difference-Audiobook/B09SGRXH57',
 'https://www.audible.in/pd/Hire-with-Your-Head-4th-Edition-Audiobook/B09HDRJLNS',
 'https://www.audible.in/pd/What-It-Means-to-Be-a-Courageous-Leader-Audiobook/B09FQG2NJ4',
 'https://www.audible.in/pd/Hyperfocus-German-edition-Audiobook/3962673725',
 'https://www.audible.in/pd/The-Innovation-Ultimatum-Audiobook/B088C1PKLB',
 'https://www.audible.in/pd/Frontiers-in-Social-Innovation-Audiobook/B09SRDN2M7',
 'https://www.audible.in/pd/Converted-Audiobook/B097CJTBF8',
 'https://www.audible.in/pd/Woke-Inc-Audiobook/B09FB4THKC',
 'https://www.audible.in/pd/The-Work-Life-Balance-Myth-Audiobook/B09CHBWZXQ',
 'https://www.audible.in/pd/The-Antidote-to-Suffering-Audiobook/1639297235',
 'https://www.audible.in/pd/Think-Like-Zuck-Audiobook/1638418853',
 'https://www.audible.in/pd/Ganbatte-Audiobook/B09W358N9Q',
 'https://www.audible.in/pd/What-Is-Six-Sigma-Audiobook/1638419590',
 'https://www.audible.in/pd/Market-Research-Like-a-Pro-Audiobook/B09HL69XQB',
 'https://www.audible.in/pd/Your-Greatest-Asset-Audiobook/B09GL64SRN',
 'https://www.audible.in/pd/Diversity-Intelligence-Audiobook/B09HN87FYQ',
 'https://www.audible.in/pd/Como-chegar-ao-sim-Getting-to-Yes-Audiobook/B09SGS2ZKW',
 'https://www.audible.in/pd/The-Man-Who-Mistook-His-Job-for-His-Life-Audiobook/B098KG39LL',
 'https://www.audible.in/pd/From-Paycheck-to-Purpose-Audiobook/B09KM7XPL2',
 'https://www.audible.in/pd/The-Business-Playbook-Audiobook/B09S8HL9K5',
 'https://www.audible.in/pd/The-Samsung-Way-Audiobook/1638418217',
 'https://www.audible.in/pd/Emotional-Intelligence-Audiobook/B09DJ1N6T6',
 'https://www.audible.in/pd/The-Dumb-Things-Smart-People-Do-with-Their-Money-Audiobook/B07MCW1PCZ',
 'https://www.audible.in/pd/Reach-for-a-Star-and-Generate-Ideas-for-Innovation-Audiobook/B09TFJ8HTJ',
 'https://www.audible.in/pd/Audiobook/B09SNPTMV7',
 'https://www.audible.in/pd/Savoir-se-vendre-Know-How-to-Sell-Audiobook/B09SBV4TZD',
 'https://www.audible.in/pd/How-To-Get-Your-Act-Together-Audiobook/0241553563',
 'https://www.audible.in/pd/The-8th-Habit-Live-Audiobook/B079VJQRJR',
 'https://www.audible.in/pd/Indian-Startup-Stories-Podcast/B08K5JP1NY',
 'https://www.audible.in/pd/Profit-over-Privacy-Audiobook/B09VMPWMPX']

2. Book URLs

Book-URLs

The ‘href’ property of the same tag carries the book’s URL. Using the code below, we may get the audio book’s URL.

def get_book_links(book_contents):
    base_url='https://www.audible.in'
    book_links=[]
    for tag in book_contents:
        a_tag_name= tag.h3.find_all('a', recursive=False)
        url= a_tag_name[0]['href'].strip()
        book_link= base_url+url
        book_links.append(book_link)
    return book_links
 
get_book_links(book_contents)
['https://www.audible.in/pd/Project-Management-Podcast/B08JKMDQXR',
 'https://www.audible.in/pd/Podcast/B08JKMBWLR',
 'https://www.audible.in/pd/Talk-with-Ted-A-Show-About-Nothing-Podcast/B08JKMB9FQ',
 'https://www.audible.in/pd/Communication-Charm-Influence-Negotiation-Presence-Charisma-Podcast/B08JKLZYL4',
 'https://www.audible.in/pd/The-Elephant-in-the-Room-Podcast/B08JKLZ2NV',
 'https://www.audible.in/pd/Podcast/B08JKLDGTK',
 'https://www.audible.in/pd/The-SaaS-Podcast-SaaS-Startups-Growth-Hacking-Entrepreneurship-Podcast/B08JKL8Z8M',
 'https://www.audible.in/pd/Podcast/B08JKL6BVL',
 'https://www.audible.in/pd/The-Case-Interview-Podcast-Podcast/B08JKKVH8T',
 'https://www.audible.in/pd/Augmented-the-industry-40-podcast-Podcast/B08JKKPMQK',
 'https://www.audible.in/pd/Machine-Learning-Engineered-Podcast/B08JKKMMC4',
 'https://www.audible.in/pd/a16z-Podcast-Podcast/B08JKKCL3M',
 'https://www.audible.in/pd/Adventures-in-Machine-Learning-Podcast/B08JKKBLYJ',
 'https://www.audible.in/pd/Startup-Era-Show-Startup-Business-Entrepreneurship-Digital-Marketing-Podcast/B08JKK1L5B',
 'https://www.audible.in/pd/Classic-Influence-Podcast-Timeless-Lessons-from-the-Legends-Podcast/B08JKJXWR8',
 'https://www.audible.in/pd/Case-Interview-Preparation-Management-Consulting-Strategy-Critical-Thinking-Podcast/B08JKJVY5Y',
 'https://www.audible.in/pd/The-McKinsey-Podcast-Podcast/B08JKJG4S4',
 'https://www.audible.in/pd/SEO-Secrets-for-Explosive-Growth-in-Website-Traffic-Leads-and-Sales-from-Search-Podcast/B08JKJBPNJ',
 'https://www.audible.in/pd/Girlboss-Audiobook/B07B4GRNRB',
 'https://www.audible.in/pd/Warren-Buffett-Audiobook/1663712441',
 'https://www.audible.in/pd/Negocie-como-se-sua-vida-dependesse-disso-Never-Split-the-Difference-Audiobook/B09SGRXH57',
 'https://www.audible.in/pd/Hire-with-Your-Head-4th-Edition-Audiobook/B09HDRJLNS',
 'https://www.audible.in/pd/What-It-Means-to-Be-a-Courageous-Leader-Audiobook/B09FQG2NJ4',
 'https://www.audible.in/pd/Hyperfocus-German-edition-Audiobook/3962673725',
 'https://www.audible.in/pd/The-Innovation-Ultimatum-Audiobook/B088C1PKLB',
 'https://www.audible.in/pd/Frontiers-in-Social-Innovation-Audiobook/B09SRDN2M7',
 'https://www.audible.in/pd/Converted-Audiobook/B097CJTBF8',
 'https://www.audible.in/pd/Woke-Inc-Audiobook/B09FB4THKC',
 'https://www.audible.in/pd/The-Work-Life-Balance-Myth-Audiobook/B09CHBWZXQ',
 'https://www.audible.in/pd/The-Antidote-to-Suffering-Audiobook/1639297235',
 'https://www.audible.in/pd/Think-Like-Zuck-Audiobook/1638418853',
 'https://www.audible.in/pd/Ganbatte-Audiobook/B09W358N9Q',
 'https://www.audible.in/pd/What-Is-Six-Sigma-Audiobook/1638419590',
 'https://www.audible.in/pd/Market-Research-Like-a-Pro-Audiobook/B09HL69XQB',
 'https://www.audible.in/pd/Your-Greatest-Asset-Audiobook/B09GL64SRN',
 'https://www.audible.in/pd/Diversity-Intelligence-Audiobook/B09HN87FYQ',
 'https://www.audible.in/pd/Como-chegar-ao-sim-Getting-to-Yes-Audiobook/B09SGS2ZKW',
 'https://www.audible.in/pd/The-Man-Who-Mistook-His-Job-for-His-Life-Audiobook/B098KG39LL',
 'https://www.audible.in/pd/From-Paycheck-to-Purpose-Audiobook/B09KM7XPL2',
 'https://www.audible.in/pd/The-Business-Playbook-Audiobook/B09S8HL9K5',
 'https://www.audible.in/pd/The-Samsung-Way-Audiobook/1638418217',
 'https://www.audible.in/pd/Emotional-Intelligence-Audiobook/B09DJ1N6T6',
 'https://www.audible.in/pd/The-Dumb-Things-Smart-People-Do-with-Their-Money-Audiobook/B07MCW1PCZ',
 'https://www.audible.in/pd/Reach-for-a-Star-and-Generate-Ideas-for-Innovation-Audiobook/B09TFJ8HTJ',
 'https://www.audible.in/pd/Audiobook/B09SNPTMV7',
 'https://www.audible.in/pd/Savoir-se-vendre-Know-How-to-Sell-Audiobook/B09SBV4TZD',
 'https://www.audible.in/pd/How-To-Get-Your-Act-Together-Audiobook/0241553563',
 'https://www.audible.in/pd/The-8th-Habit-Live-Audiobook/B079VJQRJR',
 'https://www.audible.in/pd/Indian-Startup-Stories-Podcast/B08K5JP1NY',
 'https://www.audible.in/pd/Profit-over-Privacy-Audiobook/B09VMPWMPX']

3. Audio Books Duration

Audio Books Duration

There’s also another <li>tag; now, class =’bc-list-item runtimeLabel’ need be given to extract the anticipated tag. There is another tag span> after inputting the tag. Using the code below, we can get the duration of the audio book.

None,
 None,
 None,
 None,
 'Length: 4 hrs and 40 mins',
 'Length: 9 hrs and 50 mins',
 'Length: 9 hrs and 30 mins',
 'Length: 9 hrs and 56 mins',
 'Length: 31 mins',
 'Length: 7 hrs and 40 mins',
 'Length: 10 hrs and 54 mins',
 'Length: 14 hrs and 33 mins',
 'Length: 3 hrs and 19 mins',
 'Length: 10 hrs and 26 mins',
 'Length: 7 hrs and 6 mins',
 'Length: 7 hrs and 14 mins',
 'Length: 6 hrs and 31 mins',
 'Length: 2 hrs and 54 mins',
 'Length: 2 hrs and 17 mins',
 'Length: 48 mins',
 'Length: 4 hrs and 4 mins',
 'Length: 7 hrs and 21 mins',
 'Length: 8 hrs and 6 mins',
 'Length: 8 hrs and 45 mins',
 'Length: 6 hrs and 4 mins',
 'Length: 3 hrs and 17 mins',
 'Length: 8 hrs and 45 mins',
 'Length: 6 hrs and 38 mins',
 'Length: 9 hrs and 16 mins',
 'Length: 4 hrs and 19 mins',
 'Length: 34 hrs and 56 mins',
 'Length: 3 hrs and 6 mins',
 'Length: 6 hrs and 59 mins',
 'Length: 45 mins',
 None,
 'Length: 6 hrs and 52 mins']

We’ve gathered all the lengths using the get book length (book contents) method. If the duration of the book information isn’t accessible, using try then we may avoid errors by using except methods.

4. Book Authors

Book Authors

There is another <li>tag, similar to Book length, but this time the class name must be given to extract the desired tag, i.e. the class =’bc-list-item authorLabel’ tag. There is another tag <span> after inputting the tag. Using the code below, you can get the author’s name.

We gathered all of the author names using the get_written_by(book contents) method.

5. Audio Book Description

Audio Book Description
  • and
  • indicates description fields.

    Let us fetch information using the below script

    try:
                description_tag = about_tag.find('span').text.strip()
                description.append(description_tag)
            except AttributeError:
                description.append(None)
        return description
     
    get_description(book_contents)
    [None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     'Inside the Ultimate Money Mind',
     "Um ex-agente do FBI revela as técnicas da agência para convencer as pessoas [A Former FBI Agent Reveals the Agency's Techniques to Convince People]",
     'Using Performance-Based Hiring to Build Outstanding Diverse Teams',
     None,
     'Wie man weniger arbeitet und mehr erreicht',
     'How Six Strategic Technologies Will Reshape Every Business in the 2020s',
     'The Essential Handbook for Creating, Deploying, and Sustaining Creative Solutions to Systemic Problems',
     "The Data-Driven Way to Win Customers' Hearts",
     'Inside the Social Justice Scam',
     'Rethinking Your Optimal Balance for Success',
     'How Compassionate Connected Care Can Improve Safety, Quality, and Experience',
     'The Japanese Art of Always Moving Forward',
     None,
     'The Easiest Guide to Market Research',
     'Creative Vision and Empowered Communication',
     'How to Create a Culture of Inclusion for Your Business',
     'Como negociar acordos sem fazer concessões [How to Negotiate Agreements Without Making Concessions]',
     'How to Thrive at Work by Leaving Your Emotional Baggage Behind',
     'The Clear Path to Doing Work You Love',
     'How to Document and Delegate What You Do So Your Company Can Grow Beyond You',
     'Transformational Management Strategies from the World Leader in Innovation and Design',
     'A Simple and Actionable Guide to Increasing Performance, Engagement and Ownership',
     'Thirteen Ways to Right Your Financial Wrongs',
     'Tools for Creating Your Innovation Project for Business or School',
     'オンラインコース制作はこれ一冊',
     'Le plus grand vendeur du monde [The Biggest Seller in the World]',
     'A Judgement-Free Guide to Diversity and Inclusion for Straight White Men',
     None,
     None,
     'How Surveillance Advertising Conquered the Internet']
    

    We gathered all the author names using the get_description(book contents) method.

     

    6. Audio Book Language

    Audio Book Language

    The description field is represented by <li class =’bc-list-item languageLabel> and <span>. Let’s use the function below to get some information.

    def get_language(book_contents):
        language=[]
        for tag in book_contents:
            lang_tag= tag.find('li', class_='bc-list-item languageLabel')
            try:
                language_tag = lang_tag.find('span').text.split()
                language.append(language_tag)
            except AttributeError:
                language.append(None)
        return language
    get_language(book_contents)
    [None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'portuguese'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'german'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'portuguese'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     ['Language:', 'japanese'],
     ['Language:', 'french'],
     ['Language:', 'English'],
     ['Language:', 'English'],
     None,
     ['Language:', 'English']]
    

    We gathered all the audio book languages using the get_language(book contents) method.

     

    7. Book Ratings

    Book Ratings

    <li class_=’bc-list-item ratingsLabel’> and <span class=”bc-text bc-pub-offscreen”> represents ratings field. Let’s retrieve no of stars by using the below function.

    def get_rating(book_contents):
        rating=[]
        for tag in book_contents:
            star_tag= tag.find('li', class_='bc-list-item ratingsLabel')
            try:
                rating_tag = star_tag.find('span', class_='bc-text bc-pub-offscreen').text.strip()
                rating.append(rating_tag)
            except AttributeError:
                rating.append(None)
        return rating
    get_rating(book_contents)
    [None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     '4 out of 5 stars',
     '4.5 out of 5 stars',
     None,
     None,
     '3.5 out of 5 stars',
     None,
     '5 out of 5 stars',
     None,
     None,
     '5 out of 5 stars',
     None,
     None,
     None,
     None,
     '5 out of 5 stars',
     '4 out of 5 stars',
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     '4.5 out of 5 stars',
     None,
     None,
     None,
     None,
     '4 out of 5 stars',
     None,
     None]
    

    We found the number of stars for each book using the get_rating(book_contents) method.

     

    8. Ratings of Number of People

    Ratings of Number of People

    <li class_=’bc-list-item ratingsLabel’> and <span class=”bc-text bc-size-small bc-color-secondary”> represents ratings field. Let’s retrieve no of stars by using the below function.

    def get_no_of_ratings(book_contents):
        no_of_ratings=[]
        for tag in book_contents:
            star_tag= tag.find('li', class_='bc-list-item ratingsLabel')
            try:
                rating_tag = star_tag.find('span', class_='bc-text bc-size-small bc-color-secondary').text.strip()
                no_of_ratings.append(rating_tag)
            except AttributeError:
                no_of_ratings.append(None)
        return no_of_ratings
     
    get_no_of_ratings(book_contents)
    ['Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     '8 ratings',
     '4 ratings',
     'Not rated yet',
     'Not rated yet',
     '11 ratings',
     'Not rated yet',
     '4 ratings',
     'Not rated yet',
     'Not rated yet',
     '3 ratings',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     '1 rating',
     '1 rating',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     '8 ratings',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     'Not rated yet',
     '26 ratings',
     'Not rated yet',
     'Not rated yet']
    

    We found the number of stars for each book using the get_no_of_ratings (book_contents) method.

     

    9. Audio Book Pricing

     

    Audio-Book-Pricing
    <p class_='bc-text buybox-regular-price bc-spacing-none bc-spacing-top-none'>

    two tags such as <span class=”bc-text bc-size-small bc-color-secondary”> and other tag indicates the price field. Let us extract regular price of the audio book using the below mentioned function

    def get_regular_price(book_contents):
        regular_price=[]
        for tag in book_contents:
            buy_tag= tag.find('p', class_='bc-text buybox-regular-price bc-spacing-none bc-spacing-top-none')
            try:
                price_tag = buy_tag.find_all('span', class_='bc-text bc-size-base bc-color-base')
                price= price_tag[1].text.strip()
                regular_price.append(price)
            except AttributeError:
                regular_price.append(None)
        return regular_price
    get_regular_price(book_contents)
    [None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     '₹615.00',
     '₹586.00',
     '₹164.00',
     '₹586.00',
     '₹233.00',
     '₹568.00',
     '₹586.00',
     '₹703.00',
     '₹754.00',
     '₹759.00',
     '₹586.00',
     '₹586.00',
     '₹586.00',
     '₹398.00',
     '₹351.00',
     '₹234.00',
     '₹1,003.00',
     '₹586.00',
     '₹164.00',
     '₹888.00',
     '₹668.00',
     '₹469.00',
     '₹586.00',
     '₹586.00',
     '₹1,005.00',
     '₹568.00',
     '₹1,395.00',
     '₹363.00',
     '₹615.00',
     '₹304.00',
     None,
     '₹586.00']
    

    From get_regular_price(book_contents) function we will collect the prices of the audio book.

     

    10. Book Cover Images

    Book Cover Images

    <img class_=’bc-pub-block bc-image-inset-border js-only-element’> and src attribute will represent the image link. Now, we will retrieve the links of the images using the below function.

    def get_cover_img(book_contents):
        cover_img=[]
        for tag in book_contents:
            img_tag= tag.find_all('img', class_='bc-pub-block bc-image-inset-border js-only-element')
            try:
                #price_tag = img_tag.find('span', class_='bc-text bc-size-base bc-color-base')
                book_image_url= img_tag[0]['src'].strip()
                cover_img.append(book_image_url)
            except AttributeError:
                cover_img.append(None)
        return cover_img
     
    get_cover_img(book_contents)
    ['https://m.media-amazon.com/images/I/41hxEX0QtSS._SL500_.jpg',
     'https://m.media-amazon.com/images/I/31Sj1TU4icS._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51Dn7zu5HsS._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41VNnS19XWS._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41LJ36QPJ5S._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41SRcn-iLTL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41d76Ssd4NL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51qgoisVutS._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41zi-MhxY1L._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41SCP2yQl9L._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51vOrhK8yGS._SL500_.jpg',
     'https://m.media-amazon.com/images/I/510vhr-x++S._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51hS6VOUAML._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41t2GRh+lcS._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51pne+9xZvS._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41t3tu3PF5L._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51-JUrcMwMS._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41kMVH6eWyL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51ZC-zHE6ZL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51c1CaVkQJL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51NlVkxywiL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41hk7EszwmL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51tWjmKtoWL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41pn49ph8qS._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51d7ghqWCiL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51VbY5tSEXL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41A7nJTUTXL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41EbS2T01kL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51yMGil0ldL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51+BGKYQ+LL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41xjEeHV0RL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41085kMCZ0L._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41QCi1aRBJL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/61V2vqRj1SL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/415kxyNsRdL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41-9YE8wnNL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41NaPw5uxcL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/410ZHV7pWDS._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51rvBgy3fxL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51ayHAAQSpL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51sGAJasAqL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/4155g4IzesL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/5188Sm0OuVL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51EnU0l1noL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51tMGdL+8eL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51GyPgNvU1L._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41S+wVCOcNL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51fwK6blFHL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/51dwRb-5HJL._SL500_.jpg',
     'https://m.media-amazon.com/images/I/41SXSBiFu0L._SL500_.jpg']
    

    From function get_cover_img(book_contents) we will collect all the links.

    Creating a Dictionary Using all the Functions

    Now that we’ve gathered all of the information we want from the website, let’s create a function that contains the main HTML code from a variety of web sites, compiles the list of things, and accesses it like a dictionary.

    def parse_pages_ranged(end_page):
        all_page_contents = {
                'Book_Name':[],
                'Description':[],
                'Author':[],
                'Rating':[],
                'No_of_Ratings':[],
                'Regular_Price':[],
                'Language':[],
                'Book_Audio_Length':[],
                'Cover_IMG':[],
                'Book_URL':[],
                }    
        for page in range (1,end_page):
            pageno_x = get_pageno(page)
            book_contents = pageno_x.find_all('li', class_='bc-list-item productListItem')
            all_page_contents['Book_Name'] += get_book_names(book_contents)
            all_page_contents['Description'] += get_description(book_contents)
            all_page_contents['Author'] += get_written_by(book_contents)
            all_page_contents['Rating'] += get_rating(book_contents)
            all_page_contents['No_of_Ratings'] += get_no_of_ratings(book_contents)
            all_page_contents['Regular_Price'] += get_regular_price(book_contents)
            all_page_contents['Language'] += get_language(book_contents)
            all_page_contents['Book_Audio_Length'] += get_book_length(book_contents)
            all_page_contents['Cover_IMG'] += get_cover_img(book_contents)
            all_page_contents['Book_URL'] += get_book_links(book_contents)
            page = page + 1
        return all_page_contents
    

    The following is the output from the Audible website, which we will visualize in a tabular fashion using pandas.

    Python Pandas – pandas.dataframe()

    Data Frame: A data frame is a two-dimensional data structure in which data is organized in rows and columns in a tabular format.

    all_pages_scraped= pd.DataFrame(parse_all_pages(24))
    all_pages_scraped
    
    data filed

    Downloading the Extracted File to a CSV file

    all_pages_scraped.to_csv('Audible_Business_and_Careers_Books_2022.csv',index=None)

    The output will look like:

    Book_Name,Description,Author,Rating,No_of_Ratings,Regular_Price,Language,Book_Audio_Length,Cover_IMG,Book_URL
    The Everyday Hero Manifesto,"Activate Your Positivity, Maximize Your Productivity, Serve The World",Robin Sharma,,Not rated yet,"₹1,519.00","['Language:', 'English']",Length: 9 hrs and 27 mins,https://m.media-amazon.com/images/I/51LP52ob7CL._SL500_.jpg,https://www.audible.in/pd/The-Everyday-Hero-Manifesto-Audiobook/B08XY8T574
    Start with Why,How Great Leaders Inspire Everyone To Take Action,Simon Sinek,4.5 out of 5 stars,84 ratings,₹888.00,"['Language:', 'English']",Length: 7 hrs and 18 mins,https://m.media-amazon.com/images/I/41Px2q4eSiL._SL500_.jpg,https://www.audible.in/pd/Start-with-Why-Audiobook/B09J5J1PTZ
    HBR at 100,The Most Influential and Innovative Articles from Harvard Business Review's First Century,Harvard Business Review,,Not rated yet,₹703.00,"['Language:', 'English']",Length: 17 hrs and 7 mins,https://m.media-amazon.com/images/I/41xwPia5dAL._SL500_.jpg,https://www.audible.in/pd/HBR-at-100-Audiobook/B09WFVS56M
    The Design of Everyday Things,Revised and Expanded Edition,Don Norman,4.5 out of 5 stars,129 ratings,₹500.00,"['Language:', 'English']",Length: 10 hrs and 39 mins,https://m.media-amazon.com/images/I/51Dl6lXXesL._SL500_.jpg,https://www.audible.in/pd/The-Design-of-Everyday-Things-Audiobook/B07L5T1Q55
    

    For any web scraping service requirement, contact X-Byte Enterprise Crawling today or request for a quote!

    Send Message

      Send Message