Web-Scraping-with-AWS-Lambda

Web scraping is an important technique for data from various targeted websites.  This data is used by businesses and developers for several purposes including market research, data analysis and competitive intelligence. AWS Lambda, with its serverless computing capabilities, offers a scalable and cost-effective solution for implementing web scraping tasks. Here, in this article we will explore how to set up a web scraping pipeline using AWS Lambda, Python, and Chalice.

What is AWS?

Amazon offers a comprehensive and dynamic cloud computing platform known as AWS (Amazon Web Services). It consists of a combination of bundled software-as-a-service (Saas), platform-as-a-service (PaaS), and infrastructure-as-a-service (IaaS) products. AWS offer various tools including compute power, database storage, and content delivery services.

In 2002, Amazon.com Web Services initially launched its web services using the internal infrastructure that it had constructed to manage its online retail business. It started providing its distinguishing LaaS services in 2006. AWS was one of the pioneers of the Pay-as-you-go cloud computing paradigm, which scales to give consumers access to compute, storage, and throughput when required.

Introduction to AWS Lambda

AWS Lambda is defined as a serverless compute service that executes your code in reaction to various events and automatically handles the underlying compute sources. These events could be updates or status changes, like when a user adds something to their online shopping cart on any of the e-commerce websites.

Did you know?

You can build your own backend services that function at AWS scale, performance, and security or you can use AWS Lambda to augment existing AWS services with custom functionality.

AWS Lambda will automatically start executing the code in case of multiple events, such as HTTP requests made over the Amazon API Gateway, changes made to objects in Amazon Simple Storage (Amazon S3) buckets, table update in Amazon DynamoDB, and state transitions in Amazon Web Services (AWS) step functions.

Lambda manages all of your compute resources administration in addition to running your code on high availability compute infrastructure. This covers code and security patch deployment, capacity provisioning and automatic scaling, server and operating system maintenance, and code monitoring and logging. You will only have to provide the code.

When to Use AWS Lambda?

Lambda is a perfect compute service, for an application scenario that will require to scale up quickly and down to zero when it not in use. You can use Lambda for:

File Processing: Leverage Amazon Simple Storage (Amazon S3) to start Lambda data processing immediately after uploading the file.

 Stream Processing: Utilize Lambda and Amazon Kinesis to execute real-time data streaming for clickstream analysis, data cleansing, log filtering, indexing, social media analysis, Internet of Things (IoT) device data telemetering.

 Web Applications: By combining Lambda with additional AWS services, developers can create robust web application that operate in a highly available configuration across several data centers and scale up and down automatically.

 IoT Backends: Utilize Lambda to create serverless backends that can manage requests from third-party APIs, mobile applications and the web.

 Mobile Backends: To authenticate and handle API requests, create backends with Lambda and Amazon API gateway. Utilize AWS amplify for simple integration with your web, React Native, iOS, and Android frontends.

All you need to worry about while utilizing Lambda is your code. The compute fleet, which provides a balance of memory, CPU, network, and the other resources to run your code, is managed by Lambda. You are unable to access compute instances or changing the operating system on runtimes that are provided since Lambda is in charge of managing these resources.

Lambda manages all operational and administrative tasks, such as capacity management, functional monitoring, and log keeping, on your behalf.

What is Serverless Web Scraping?

To effectively retrieve data from the targeted websites, serverless web scraping males use of web crawling frameworks like Scrapy and serverless computing platforms like AWS Lambda. Developers can build reliable, scalable, and affordable online scraping solutions by integrating these technologies, all without having to worry about managing servers or paying for downtime.

Due to the serverless design, resources are allocated optimally, enabling the application to scale up or down in response to spikes in demand. Serverless web scraping is a wise option for projects with erratic loads or when large-scale tasks are occasionally required because of its elasticity.

However, adopting a potent web crawling technology such as Scrapy provides extensive capabilities to efficiently and more precisely scrape websites. Managing intricate data extraction and storing the scraped data in the required format are made possible by this framework.

What is Scrapy?

Scrapy is a Python-based open-source, collaborative web crawling framework. It can be used for a variety of web scraping activities, such as processing data that has been scraped. Scrapy is a powerful framework for online scraping since it has built-in features for extracting and storing data in the format and structure of your choice.

To perform efficient web scraping, it is recommended to integrate scrapy with Datacenter Proxies. The main benefit of using proxies with Scrapy is that it will enable you to conceal your real IP address from the server of the original websites while you are scraping. When you use an automated tool instead of manual copying and pasting data from the website, using proxies prevents your privacy and keeps you safe from getting blacklisted from the target websites.

AWS Lambda for Web Scraping

Now that you have understand AWS Lambda, let us focus on why should you use it for web scraping. AWS is a reliable and cost-effective solution for scraping tasks performed regularly. Using AWS Lambda, you can set up automated schedules, run functions without supervision, and use various programming languages. Additionally, you have access to serverless framework and container tools for web scraping solutions.

Build your Serverless Web Scraper with AWS Lambda, Python, and Chalice

Setting Up the Development Environment

The initial step is to set up the virtual Python environment and install the Chalice. Additionally, the user must install Pip which greatly simplifies the use of web scraper.

screenshot-1

Set up Chalice with AWS

You must make sure that the AWS account is used for the machine’s authentication. Access keys are required, to obtain them, navigate to the security credential page using the drop-down menu located in the upper right corner of the screen. You must select “Create New Access Key” after expanding the access keys. Lastly, the AWS configuration file needs to be saved. You must create the AWS configuration folder, create the new file, then open it to complete the activity.

screenshot-2

To replace the current keys and region with the unique keys and region you have established, copy and paste the below codes:

screenshot-3

Create a Scraper Script

Develop the chalice project using the below commands:

screenshot-4

Use the following lines to replace the app.py file located inside the chalice project:

screenshot-5

Now it will be easy to simplify the codes:

screenshot-6

You must be aware that Chalice’s serverless functions are comparable to the standard Python functions that you are using. An @app decorator to invoke the function is the sole addition. In the above example, when an HTTP request is performed the function is called using a @app.route.

It must be noted that the requests_html package is used in the main portion of the function to carry out activities like parsing HTML documents and drawing items based on class names and HTML tags. Additionally, you can see it will either return an error or an object that includes the best product.

Deployment

After building the scrapper, use the chalice local command to test it locally.

When you are prepared to move forward, deploy the chalice by using the command chalice deploy. The remaining features, such as creating the AWS Lambda function in the console, are handled by Chalice. The requirements needed for AWS Lambda will all be bundled together for its use. The product hunt scraper and the public URL for the serverless function will both be sent by the deploy command.

How to Run Scrapy in AWS Lambda?

Step 1: Creating Scrapy Spider

The initial step in serverless web scraping is to set up a Scrapy crawler. Spiders are autonomous crawlers with a set of instructions that Scrapy employs.

Here is an example of a basic Scrapy spider:

screenshot-7

We’ve built a basic spider in this example that will launch at https://books.toscrape.com, gather the price and title of every book, go on to the next page, and repeat. Data that has been scraped from 1,000 books is the end result.

The output of the spider can be sent to a file by using the -o switch when running it on your computer. A books.json file, for instance, can be created using the following command:

screenshot-8

Step 2: Modify Scrapy Spider for AWS

When we use AWS Lambda function to execute Scrapy Spider, we are unable to access the file system or the terminal. It indicates that we send and retrieve the output from a local file system. Alternatively, the output can be kept in an S3 bucket.

To add these personalized options, edit the spider code:

screenshot-9

Replace your-bucket-name-here with the original S3 bucket you created.

Step 3: Setting the Environment for the Lambda Function

Install the required executables before configuring the AWS Lambda local environment:

  • Docker
  • AWS CLI
  • Serverless Framework
  • Boto3 package

Making a Docker file of our Scrapy spider and submitting it to the AWS platform is necessary for configuring the Lambda function. Using Docker, you can instantly ship and execute an application anywhere by packaging it along with its dependencies and environment inside a container. By using this procedure, we can be sure that our application will function the same on the AWS Lambda server as it will in our local development environment, even if there are any unique settings or previously installed software.

With the help of the robust AWS Command Line Interface (AWS CLI), customers can communicate with Amazon Web Services (AWS) through your operating system’s command-line interface. It offers a quick and easy method of managing different AWS resources and services without the need for a graphical user interface.

Go to https://aws.amazon.com/cli/ and get the operating system package to install AWS CLI.

We must develop a Docket image of our Scrapy spider and upload it to the AWS platform for AWS Lambda to handle Scrapy.

With npm, you can install the Serverless Framework. Visit the official Node.js website at https://nodejs.org/, download LTS, and install it if you still need to set it up. Run the following command after that:

screenshot-serverless

A Python module called Botocore was created by Amazon Web Services (AWS). It serves as the basis for the Python (Boto3) AWS SDK. It is a low-level interface that offers Python code users the essential capabilities for communicating with AWS services.

After creating and activating a virtual environment, ensure to perform the following actions:

screenshot-10

Not that Scrapy needs to be installed in this virtual environment as well. If you haven’t already, use pip to install it:

screenshot-11

Step 4: Prepare your Code for Lambda Function Deployment

It is imperative that you create a requirements.txt file listing all the Python packages your Scrapy spider needs to run in order to guarantee the Docker container is manageable and only contains the necessary dependencies. This file may appear as follows:

screenshot-12

The process of creating a docker image comes next. We can take advantage of the Dockerfile by creating a file command. The following will be included in this file:

screenshot-13

We begin with a simple Python image in this Dockerfile, change the working directory to /app, copy our application into the Docker container, install any prerequisites and configure the command to launch our spider. This is the entrypoint.sh script command.

Make a new file, Lambda_function.py and save it.

screenshot-14

Finally, a YML file is required for deployment. Create a new file, give it the name serverless.yml, and insert the following code in it:

screenshot-15

The YOUR_REPO_URI will be updated in the following step. Moreover, take note that scrapy-lambda is merely the name of the Docker image that we will build in the following phase.

Step 5: Docker image deployment

Using AWS IAM, create a user as the initial step. Make a note of the secret and access keys.

Perform the following actions and input these keys as prompted:

screenshot-16

Next, execute the following commands to establish a new ECR repository:

screenshot-17

Note the value of repositoryUri from the JSON output. 76890223446.dkr.ecr.us-east-1.amazonaws.com/scrapy-images is how it will seem.

In the serverless.yml file, replace YOUR_REPO_URI with this value.

If you still need to create an S3 bucket, go to the AWS console.

After updating the Scrapy spider code, make a note of the bucket name. In books.py, replace YOUR-BUCKET-NAME-HERE with the real S3 bucket name.

Now, use the following command to construct your Docker image:

screenshot-18

screenshot-19

Use these commands to tag and send your Docker image to Amazon ECR

$ aws ecr get-login-password –region region | docker login –username AWS –password-stdin YOUR_REPO_URI

$ docker tag scrapy-lambda:latest YOUR_REPO_URI:latest

$ docker push YOUR_REPO_URI:latest

You can substitute your Amazon ECR repository’s name for YOUR_REPO_NAME and your AWS region for region.

Finally, use the following command to deploy the images:

screenshot-20

The output of the command will be:

screenshot-21

Step 6: Executing the Lambda Function

The command output for the sls deploy command will display the service URL endpoint.

Sending the following POST request to this URL to start this function:

screenshot-22

The Lambda function will start executing and output will be stored in S3 bucket.

screenshot-23

What are the Benefits of Web Scraping with AWS Lambda?

Data Acquisition: Lambda functions allow for short and effective code execution, which streamlines the data extraction process. Web scraping is an effective method of gathering data from websites.

Automation: By automating data collecting through the use of lambda functions in conjunction with web scraping, time and effort are saved over manual techniques.

Competitive Analysis: With the help of the integration, companies may make well-informed decisions by swiftly analyzing competitor data, including pricing and product details.

Real-Time Insights: Using lambda functions to scrape websites allows you to obtain real-time data, which is updated and useful for trend analysis and market research.

Efficiency and Customization: Web scraping jobs can be made more efficient and customized by using lambda functions, which can be programmed to extract particular data.

Scalability: Scalable data extraction is made possible by the integration of web scraping and lambda functions, which can efficiently handle loads and accommodate different data requirements.

Conclusion

To set up a Lambda function for web scraping, you must first access the Lambda Console, create a function, configure some basic settings, upload code with dependencies, define the Lambda handler, and, if desired, set environment variables and triggers. Finally, you can invoke the function by using make.com.

This comprehensive guide was provided by X-Byte.io. Before putting the function into use, save it and give it a try. This comprehensive tutorial from X-Byte.io. Automation Solutions provides insights into using Lambda for web scraping. It covers the fundamentals and shows how to use Lambda functions effectively for make.com data extraction. It is an invaluable tool that offers direction on how to maximize web scraping activities for increased effectiveness and output.

Send Message

    Send Message