How To Do News Data Extraction With AI-Powered Auto Extraction?

A huge percentage of the internet consists of news. It’s an extremely important content type as there are always many things taking place either in the local area or generally that we wish to know. The news published daily on various sites is outrageous. At times, it is some good news or some bad news however, one thing is sure that it’s physically impossible to go through all of them every day.

In this blog, we will talk about how to scrape data from any two well-known news websites and do basic examining analysis on articles to get some all-encompassing theme across different news articles as well as might even some news websites.

With this blog you would get a few insights regarding our latest AI data scraping API – X-Byte Auto Extraction as well as how it could be used for scraping news data without writing any selectors or XPath.

You would also learn about a few easy-to-do but remarkable methods of analyzing text data. The precise procedure we are going to use for every news website:

Determine news URLs on a website
Pass URLs to X-Byte Auto Extraction API
Output results in JSON
Analyze data

Determine News URLs on a Website

Determining URLs on a website is an important first step as otherwise, we don’t have any input data, which is required for X-Byte Auto Extraction API. The input data is a URL of news articles and you won’t need anything else. So initially, let’s get all the key pages news article URLs. We have to set a new Scrapy spider as well as utilize Scrapy’s Linkextractor for getting the right URLs:

class NewsSpider(Spider):
 name = "news"
 start_urls = ["somenewsite.com"]
 def parse(self, response)í:
 extractor = LinkExtractor(restrict_css=".headline-text")
 urls = [link.url for link in extractor.extract_links(response)]

Pass URLs to X-Byte Auto Extraction API

Now as we’ve the URLs, we could go ahead as well as utilize the X-Byte Auto Extraction API. All these are the things you should have, packaged in the JSON object for making API requests:

Valid API keys (Sign up free to find one!)
pageType (article or product)
URL

That is how X-Byte Auto Extraction API call needs to look like, within Scrapy:

xod = "https://autoextract.scrapinghub.com/v1/extract" 
headers = {"Authorization": basic_auth_header(, ""), "Content-Type": "application/json"} 
params = [{"pageType": "article", "url": url}] 
req = Request(xod, method="POST", 
body=json.dumps(params),
headers=headers, 
callback=self.callback_func)

Add this code to our before produced spider:

def parse(self, response):
 extractor = LinkExtractor(restrict_css=".headline-text")
 urls = [link.url for link in extractor.extract_links(response)]
 xod = "https://autoextract.scrapinghub.com/v1/extract"
 headers = {"Authorization": basic_auth_header(, ""), "Content-Type": "application/json"}
 for url in urls:
 params = [{"pageType": "article", "url": url}]
 yield Request(xod, method="POST", body=json.dumps(params), headers=headers, callback=self.extract_news)

This function would initially collect all news article URLs on the main page with Scrapy’s LinkExtractor. After that, we pass every URL, in sequence, to the X-Byte Auto Extraction API. X-byte data extraction services would get all the data associated with an article, such as author(s), article body, publish date, language, etc. The finest part is that it does not require any XPath or HTML parsing.

X-Byte Auto Extraction utilizes Machine Learning for scraping all the important data points from a page as well as we don’t have to write different locators manually. This also indicates that in case a website changes its layout or design, we don’t have to manually alter our code. This will continue working as well as keep offering data.

Output Results in JSON

In the API requests, we utilize extract_news like a callback function as it will parse a response of API requests that is a JSON as it contains all data fields related with the scraped article. Like URL, authors, headline, etc. After that, we populate Scrapy items.

def extract_news(self, response):
 item = ArticleItem()
 data = json.loads(response.body_as_unicode())
 item["url"] = data[0]["article"].get("url")
 item["text"] = data[0]["article"].get("articleBody")
 item["headline"] = data[0]["article"].get("headline")
 item["authors"] = data[0]["article"].get("authorsList")
 return item

With the given code, we populated an ArticleItem having data from X-Byte Auto Extraction API. As we can run a complete spider as well as output data for future analysis:

scrapy crawl news -o news_output.json

Analyze Data

We could scrape news data online. Although, it is pure data, you may find it helpful to find news data from a website, anytime. If we a bit further as well as do some investigative data analysis on scraped text.

Word Cloud

When comes to doing text analysis, amongst the most well-known visualization methods is word cloud. This visualization is intended to show what phrases or words are used most frequently within given text. With bigger words, you can get more frequency. It is ideal for us to discover which words are most generally used in the headlines as well as in an article itself.

Installation of Word Cloud and Requests

In Python, one open-source library is there to produce word clouds, wordcloud. It is a very easy-to-use library for creating easy or well-customized word clouds (or actual images). For using wordcloud, we need to initially install libraries like matplotlib, pillow numpy, and pandas and obviously wordcloud itself. All the packages are very easy to install using pip.

Producing a Word Cloud

Let’s assume we need to know the most general words in headlines given on a homepage. To perform this, initially, we read data from a JSON file, which Scrapy has generated. After that, we make a nice and easy word cloud:

df = pd.read_json("news_output.json") 
cloud = WordCloud(max_font_size=80, max_words=100, 
background_color="white").generate(text) 
plt.imshow(cloud, interpolation='bilinear') 
plt.axis("off") plt.show()

News in the UK vs. the US

UK News Website Homepage Headlines

Other websites we scraped news from is amongst the most visited news sites in the UK. Even without observing the results, we can perhaps guess that frequently utilized words on the UK sites would vary from what we have found on a US site. The maximum utilized words in headlines include “Boris Johnson”, “US”, and “Brexit”. Although there are likenesses also, “Hurricane Dorian” is regularly used here also, the similar is true with “Trump”.

US News Website Homepage Headlines

While writing this blog, it looks as if the examined US news website puts ample focus on the “Hurricane Dorian” in headlines. Also, there are many news articles available on the topic. The 2nd most shared word is “Trump” and other regularly used words include “American”, “million”, “Carolinas”, “Bahamas”, and “climate”.

While analyzing not a headline but an article itself, it becomes extremely noisy on a US news website. “Trump” is regularly used in text also. Other words include “people”, “year” “said”, “will”. There are not much differences between the word frequency as we have seen that in the headlines..

For a UK news website, it becomes less loud. It looks as if there are different words utilized relatively often. However, it’s very hard to select merely one or two. “Johnson”, “government”, “People”, “Trump”, and “US” are amongst the most utilized words in the articles.

Conclusion

At the principal of X-Byte Auto Extraction is the AI-enabled data scraping engine to scrape data from the webpage without requiring design and custom code. Using computer vision, deep learning, as well as X-Byte Proxy Manager, advanced proxy management solutions, the data engine can identify general items on products and article pages as well as scrape them without requirement of developing and maintaining scraping rules for every website.

To use our news data scraping tool, contact X-Byte Enterprise Crawling now!

✯ Alpesh Khunt ✯

Alpesh Khunt, CEO and Founder of X-Byte Enterprise Crawling created data scraping company in 2012 to boost business growth using real-time data. With a vision for scalable solutions, he developed a trusted web scraping platform that empowers businesses with accurate insights for smarter decision-making.