OVERVIEW
Web extraction using Python is extremely easy to do when you follow these 10 easy steps.
This blog post includes the first part: News articles data extraction using Python. We’ll make a script, which extracts the newest news articles from various newspapers as well as saves the text that would be fed in the model afterwards to get predictions in its category.
A Short Introduction about HTML and Webpage Design
In case, we wish to extract different news articles from the website, the initial step is to understand how any website works.
We would follow an example for understanding this:
Whenever we insert the URL into a web browser (i.e. Firefox, Google Chrome, etc.) as well as access to that, what we observe is the grouping of three different technologies:
- HTML (Hyper Text Markup Language): This is a standard language to add content in a website. This helps us insert images, text, as well as other things in our site. In one words, HTML defines content of all webpages on the internet.
- CSS (Cascading Style Sheets): The given language permits us in setting the visual designs of any website. This means that it determines the presentation or style of the webpage like layouts, fonts, and colors.
- JavaScript: JavaScript is the dynamic computer programming language. This helps us make the content as well as style interactive and offers a dynamic interface among client-side use and script.
Note that all these are programming languages. They would permit us to make and manipulate all the design aspects of any webpage.
Let’s prove these concepts using an example. Whenever we visit a Politifact page, we will see these:
In case, we disable JavaScript, we won’t be able to utilize the pop-up any longer, as we are unable to see the video pop-up window:
In case, we delete the CSS elements from the web-page after getting it with Ctrl+F on the inspect window, we would see somewhat like that:
So, here, we will ask a question.
“If you wish to extract a webpage’s content using web scraping, where do you want to search?”
So, at that point, we hope that you are very clear about what type of source codes we require to extract. Yes, you are totally right, if you have thought about HTML.
Therefore, the last stage before using any web extraction methods is understanding a bit of HTML.
HTML
HTML is the language, which defines a webpage content as well as constitute of attributes and elements to extract data, you need to be familiar with examining those elements.
The element might be a paragraph, division, heading, anchor tags, and more.
An attribute might be that a heading is within bold letters.
The tags are characterized with the opening symbol
e.g.,
So, here, we will ask a question.
“If you wish to extract a webpage’s content using web scraping, where do you want to search?”
So, at that point, we hope that you are very clear about what type of source codes we require to extract. Yes, you are totally right, if you have thought about HTML.
Therefore, the last stage before using any web extraction methods is understanding a bit of HTML.
HTML
HTML is the language, which defines a webpage content as well as constitute of attributes and elements to extract data, you need to be familiar with examining those elements.
The element might be a paragraph, division, heading, anchor tags, and more.
An attribute might be that a heading is within bold letters.
The tags are characterized with the opening symbol
e.g.,
<p>This is paragraph.</p> <h1><b>This is heading one in bold letters</b></h1>
Scrape Data with BeautifulSoup using Python
Step-1: Package Installation
We would initially start with installing the required packages:
1. beautifulsoup4
For installing it, please type the given code with Python distribution.
! pip install beautifulsoup4
BeautifulSoup with bs4 package is the library utilized to parse XML & HTML docs in Python in an extremely convenient and easy way as well as access s elements through recognizing them with their attributes and tags.
It is extremely easy to utilize, yet extremely powerful package for scraping any type of data online in merely 5–6 lines.
2. requests
For installing it, utilize the given command in the IDE or utilize the command without any exclamation mark within command shells.
! pip install requests
Therefore, to offer BeautifulSoup with an HTML code, we would require some requests module.
3. urllib
For installation, utilize the following commands:
! pip install urllib
urllib module is an URL dealing module for Python. This is used for fetching URLs (Uniform Resource Locator)
Though, we are utilizing these modules for a different objective, to call different libraries like:
time (utilizing that we could call a sleep() function for delaying or suspending execution for total number of seconds.
sys(This is utilized to have exception information like types of errors, error objects, info regarding the errors.
Step-2: Import Libraries
Now, we would import all the necessary libraries:
1. BeautifulSoup
For importing it, utilize the given command in the IDE
from bs4 import BeautifulSoup
The library assists us with having HTML structure of all pages, which we wish to deal with and offers functions to use particular elements as well as extract applicable data.
2. urllib
For doing import, type the following command
import urllib.request,sys,time
urllib.request: This assists in defining classes and functions that assist in opening the URLs
urllib.sys: These classes and functions assists us in retrieving exception details.
urllib.time : Python is having the module called time that offers many useful functions for dealing with time-associated tasks. Amongst the well-known functions is sleep().
3. requests
For importing it, type import earlier to this library keyword.
import requests
The module helps us send HTTP requests to any web-server with Python. (HTTP messages include requests from clients to server as well as responses from a server to clients.)
4. pandas import pandas as pd
This is a very high-level data-manipulation tool, which we required to visualize the well-structured extracted data.
will utilize this library for making DataFrame(Main data structure for the library). DataFrames help us store as well as operate tabular data within rows of columns and observations of variables.
import urllib.request,sys,time from bs4 import BeautifulSoup import requests import pandas as pd
Step-3: Make Easy Requests
with a request module, it is easy to get HTML content as well as store in the page variables.
Make a simple get request(just fetching a page) #url of the page that we want to Scarpe #+str() is used to convert int datatype of the page no. and concatenate that to a URL for pagination purposes. URL = 'https://www.politifact.com/factchecks/list/?page='+str(page) #Use the browser to get the URL. This is a suspicious command that might blow up. page = requests.get(url)
As requests.get(url) is a wary command and could throw any exception, we would call it within the try-except block
try: # this might throw an exception if something goes wrong. page=requests.get(url) # this describes what to do if an exception is thrown except Exception as e: # get the exception information error_type, error_obj, error_info = sys.exc_info() #print the link that cause the problem print ('ERROR FOR LINK:',url) #print error info and line that threw the exception print (error_type, 'Line:', error_info.tb_lineno) continue
We would also utilize an outer for the loop for pagination objectives.
Step-4: Inspect the Reply Object
I. See what reply code a server sent back (helpful for finding 5XX or 4XX errors.
page.status_code
Output:
With HTTP 200 OK success position response code specifies that a request has been succeeded.
II. Use the complete response as text (find the HTML of a page in the big string)
page.text
Output:
This will return HTML content of the response object using Unicode.
Substitute:
page.content
Output:
Output:
While, this will return content of the response using bytes.
III. Search for any particular substring of texts within response.
if "Politifact" in page.text: print("Yes, Scarpe it")
IV. Observe response’s Content-Types (observe if you get back JSON, XML, HTML, etc.)
print (page.headers.get("content-type", "unknown"))
Output:
Step-5: Delay the Request Time
Following the time modules, we could call sleep(2) function having values of 2 seconds. There, it delayed to send requests to the web-server within 2 seconds.
time.sleep(2)
The sleep() function interrupts execution of present thread for any given seconds.
Step 6: Extract Content using HTML
Now as you’ve done with your HTTP requests and got some new HTML content, you can parse it to extract different values you’re searching for.
A) With Regular Expressions
With Regular Expressions to look for the HTML content is not suggested at all.
Though, regular expressions are very useful for getting particular string patterns including prices, phone numbers, or email addresses.
Run any regular expression about the response texts to search for any particular string patterns:
import re # put this at the top of the file ... print(re.findall(r'\$[0-9,.]+', page.text))
Output:
Step-5: Delay the Request Time
Following the time modules, we could call sleep(2) function having values of 2 seconds. There, it delayed to send requests to the web-server within 2 seconds.
time.sleep(2)
The sleep() function interrupts execution of present thread for any given seconds.
Step 6: Extract Content using HTML
Now as you’ve done with your HTTP requests and got some new HTML content, you can parse it to extract different values you’re searching for.
A) With Regular Expressions
With Regular Expressions to look for the HTML content is not suggested at all.
Though, regular expressions are very useful for getting particular string patterns including prices, phone numbers, or email addresses.
Run any regular expression about the response texts to search for any particular string patterns:
import re # put this at the top of the file ... print(re.findall(r'\$[0-9,.]+', page.text))
Output:
B) With Object Soup from BeautifulSoup
BeautifulSoup is the Python library used to pull data out from XML and HTML files. This works with the favorite parser for providing idiomatic ways to navigate, search, and modify a parse tree. This generally saves programmers’ days of hours of work.
soup = BeautifulSoup(page.text, "html.parser")
The given-listed command would look for different tags e.g., <li> having particular attribute ‘o-listicle__item’
links=soup.find_all('li',attrs={'class':'o-listicle__item'})
INSPECT WEBPAGE
To understand the give code, you have to inspect a webpage & follow along:
1. Go to the listed URL given
2. Press Ctrl+Shift+I for inspecting it.
3. That is how the ‘Inspect window’ would look like:
Press Ctrl+Shift+C to choose an element given in a page for inspecting it or going to leftmost arrow given in a header of an Inspect window.
4) To get the given specific elements & attributes in the inspect window
Initially, try to go in each section of a webpage to see the changes on the inspect window, you would easily get the idea about how webpages work as well as which element are what and what specific attributes are contributing in the webpage.
When completed with the given step, we assume that you could understand the workings of given element<li> as well as its attribute.
As we needed a news section of any particular article, we go to the article section through choosing the inspect element alternative in an inspect window, this will highlight the article section on a web-page as well as its HTML resource on the Inspect Window. Voila!
Did you locate the similar tag on the machine?
If yes, then you are ready to understand all the HTML tags we have utilized in our code.
We are continuing with our code:
print(len(links))
The command will assist you to investigate how many news articles are available on any given page.
Assist you understand consequently, up to which level you require to paginate loop for scraping huge data.
Step-7: Find Attributes and Elements
Search for all the anchor tags on a page (helpful in case, you’re creating a crawler as well as require to get next pages for visiting)
links = soup.find_all("a")
This will get a division tag below <li> tag in which div tag needs to get listed or particular attribute value. Here ‘j’ is the iterable variable, which is iterating above response object ‘Links’ to all the listed news articles on any given page.
Statement = j.find("div",attrs={'class':'m-statement__quote'})
text.strip() function would return text confined within the tag as well as strip any type of additional spaces, ‘\n’,’\t’ from a text string object.
Statement = j.find("div",attrs={'class':'m- statement__quote'}).text.strip()
Hurrah! We have extracted the initial attribute i.e., Statement of the dataset
In the similar division section, this would look for an anchor tag as well as return with value of a hypertext link. Another time, strip() function is utilized to get the values well-organized so that the CSV files look good.
Link=j.find("div",attrs={'class':'m-statement__quote'}).find('a')['href'].strip()
To get the Date attribute, you have to inspect the web-page first, because there is the string restricted with it. Therefore, calling text functions without identifying indexing, you would get something like that.
However, we don’t require any text rather than date, therefore I utilize indexing. Though, you may clean your attributes later using a few regex combinations. The ‘footer’ is a component, which contained the necessary text.
Date = j.find('div',attrs={'class':'m-statement__body'}).find('footer').text[-14:-1].strip()
Here, we have done the whole thing same as before excluding get() that is scraping content of the passed attribute (i.e., title)
Source = j.find('div', attrs={'class':'m-statement__author'}).find('a').get('title').strip()
On this site, we do find articles already attached with the labels however, the text isn’t retrievable as it is restricted in the image. For this type of particular task, you may utilize get() to save particular text efficiently. Here, we are passing ‘alt’ like an attribute for get() that has the Label text.
Label = j.find('div', attrs ={'class':'m-statement__content'}).find('img',attrs={'class':'c-image__original'}).get('alt').strip()
In the given code lines, we have put different concepts together as well as tried to draw data for five attributes of our Dataset.
for j in links: Statement = j.find("div",attrs={'class':'m-statement__quote'}).text.strip() Link=st.find('a')['href'].strip() Date = j.find('div',attrs={'class':'m-statement__body'}).find('footer').text[-14:-1].strip() Source = j.find('div', attrs={'class':'m-statement__author'}).find('a').get('title').strip() Label = j.find('div', attrs ={'class':'m-statement__content'}).find('img',attrs={'class':'c-image__original'}).get('alt').strip() frame.append([Statement,Link,Date,Source,Label]) upperframe.extend(frame)
Step-8: Make Dataset
Append all attribute values to the empty lists ‘frame’ for every article
frame.append([Statement,Link,Date,Source,Label])
After that, extend the list to the empty listing ‘upperframe’ for every page.
upperframe.extend(frame)
Step-9: Visualize Datasets
If you want to visualize data on Jupiter, you could utilize pandas DataFrame for doing so.
data=pd.DataFrame(upperframe, columns=['Statement','Link','Date','Source','Label']) data.head()
Step-10: Make CSV files & Save it in Your PC
A) Open & Write to file
The given command will assist you to write the CSV file as well as save that to machine in a same directory where the Python file is saved.
filename="NEWS.csv" f=open(filename,"w") headers="Statement,Link,Date, Source, Label\n" f.write(headers) .... f.write(Statement.replace(",","^")+","+Link+", "+Date.replace(",","^")+","+Source.replace(",","^")+","+Label.replace(",","^")+"\n")
The line would write every attribute to the file by replacing all ‘,’ with ‘^’.
f.write(Statement.replace(",","^")+","+Link+","+Date.replace(",","^")+","+Source.replace(",","^")+","+Label.replace(",","^")+"\n")
Therefore, whenever you run the file on a command shell, this will make the CSV file within a .py file directory.
While opening that, you might observe weird data in case, you don’t utilize strip() while extracting. Therefore, check that without applying the strip() as well as in case, you don’t exchange ‘^’ with ‘,’, this will look weird.
Therefore, replace that using these easy steps:
Open the excel file or .csv file
Press Ctrl+H (One pop-up window would open asking find what and replace with)
Provide ‘^’ value to the ‘find what’ field as well as offer ‘,’ value to ‘replace with’ field.
Then click on ‘Replace All’
Click the Close button and & Hola! You are done with getting your dataset in an ideal form as well as don’t overlook to close the file with following command after completing both for loops,
f.close()
as well as run the same codes repeatedly might throw the error in case, it has already made the dataset using a file writing technique.
B) Convert Dataframe to CSV File with to_csv()
Therefore, in place of using this long method, you can go for other methods: to_csv() is also utilized to convert a data frame into the CSV file as well as also offer an attribute for specifying the paths.
path = 'C:\\Users\\Kajal\\Desktop\\KAJAL\\Project\\Datasets\\' data.to_csv(path+'NEWS.csv')
To evade the vagueness and permit portability of the code, you can utilize this:
import os data.to_csv(os.path.join(path,r'NEWS.csv'))
It will add your CSV name with the destination path appropriately.
CONCLUSION
Although we would suggest using a first method through open file as well as writing to that and close it, we know that it is slightly tacky and lengthy to implement however, at least this would not offer you vague data as to_csv technique mostly does.
You can check in the given image, how it extracts vague data for Statement attribute.
Therefore, rather than spending hours to clean the data manually, we would recommend writing some additional lines of code given in the main technique.
Now, you are completed with that!
IMPORTANT: If you have tried to copy and paste our source code to extract different websites as well as run that, it’s quite possible that this will show an error as every webpage’s layout is quite different and to do that, you require to make the changes accordingly.
Full Code
import urllib.request,sys,time from bs4 import BeautifulSoup import requests import pandas as pd pagesToGet= 1 upperframe=[] for page in range(1,pagesToGet+1): print('processing page :', page) url = 'https://www.politifact.com/factchecks/list/?page='+str(page) print(url)