how-concurrency-in-python-is-used-to-speed-up-web-scraping-process

A common use case for developers is to scrape websites for data. There are various reasons to scrape the web, whether you’re working on a side project or launching a new business. Web scraping will take a lot of time because you must wait for server answers and deal with rate-limiting.

Prerequisites

You must have Python 3 installed in order for the code to function. It comes pre-installed on some platforms. Run pip install to install all required libraries after that.

pip install requests beautifulsoup4 aiohttp numpy.

Concurrency

The word “concurrency” refers to the capacity to carry out many computational processes concurrently. Sequential requests are ones that you send to websites one at a time, wait for it to respond, and then send the next one.

With concurrency, you may send out several requests at once and handle them all as they come in. This technique results in remarkable speed increase. Concurrent queries, whether they use parallel processing (many CPUs), will be significantly quicker than sequential ones – more on this later.

Understanding the distinction between processing activities sequentially and simultaneously is necessary to comprehend the advantages of concurrency. Let’s imagine, for illustration, that we have five activities that each take 10 seconds to perform.

The total processing time for all five is 50 seconds when they are processed in order. However, when all five activities are handled simultaneously, it takes just 10 seconds for everything to be finished.

Concurrency helps us to accomplish more work in less time while also accelerating performance by splitting up the effort of web scraping into many processes.

Requests can be parallelized in a number of methods, including asyncio and multiprocessing. These libraries can be used to parallelize queries to various websites or other pages on the same website from the standpoint of web scraping. Asyncio, a Python package that provides infrastructure for building single-threaded concurrent programmes using coroutines, will be the main emphasis of this post.

Consider whether the benefits of concurrency exceed the drawbacks for your use case because it calls for more complex systems and code.

Concurrency Advantages
  • Faster completion of more work.
  • Time spent on other requests while the network was inactive.

Threats from Concurrency

  • Tougher to create and debug.
  • Circumstances of race.
  • Using and verifying thread-safe functions
  • If not managed appropriately, block probability increase.
  • Concurrency is a cost to the system; limit concurrency to a manageable level.
  • If a tiny site receives a lot of requests, an automatic DDoS may occur.

Why Asyncio?

We must comprehend the distinction between Asyncio and multiprocessing in order to choose the appropriate technology. Additionally, I/O and CPU limits exist.

A library called Asyncio allows concurrent code to be written using the async/await syntax. It utilises a solitary processor.

In order to fully use the many processors on a given computer, multiprocessing “is a software that facilitates launching processes via an API.” The Python interpreter for each process will launch on a separate CPU.

I/O-bound refers to a program’s reduced performance as a result of input/output activities mostly network requests in our situation.

When a software is CPU-bound, it indicates that tasks that involve the central processor, such math operations, will cause it to operate more slowly.

Why does this have an impact on the concurrency library we’ll be using? Since establishing and maintaining threads and processes accounts for a sizable portion of the expense associated with concurrency. Having several processes running on various CPUs will be beneficial for CPU-bound issues. For I/O-bound circumstances, however, that might not be the case.

We choose asyncio since scraping is largely I/O-bound. However, you may duplicate the strategy using multiprocessing and contrast the outcomes if you’re unsure (or just for fun).

Sequential Version?

We will initiate scraping from website scrapeme.live which is created as fake pokemon e-commerce for testing. Starting with the sequential scraper. All instances contain a number of snippets; therefore, those will not change.

We can tell that there will be 48 pages by browsing the page. That won’t change very soon because it’s a testing environment. The base URL and a range for the pages will be our initial constants.

Now, take a product and extract the fundamentals. Use requests. to obtain the HTML, and BeautifulSoup to analyze it. Each product will be looped over one again to collect some fundamental data

An URL with the base previously observed will be created by concatenating a page number with the extract_details method. Return the items after receiving the content and producing a variety of them. As a result, a list of dictionaries will be the returned values. It is a crucial element to remember afterward.

To get all the results and save them, you must call the aforementioned code once for each page.

Executing the above code will get two product pages, extract 32 goods, and save them in a CSV file named pokemon.csv. The sequential or concurrent scraping is unaffected by the store_results function. Skip over it.

The results are lists; thus, we must flatten them in order for writerows to function properly. To emphasize that it’s not flat, that’s why we gave the variable list_of_lists (even if it’s a little strange) that name.

The output file will be as shown below:

If you were to run the script once for each of the 48 pages, it would take around 30 seconds to create a CSV with 755 goods.

Why Asyncio?

Although concurrency should let things run more quickly, there is some overhead. Therefore, it is not an improvement in linear mathematics. But we’ll get better.

We will employ the aforementioned asyncio for that. It enables the execution of several jobs on one thread while using an event loop (like Javascript does). When permitted, it will execute a function and change the context. HTTP requests enable that switch in our situation.

Starting off, we’ll see an example that will doze off briefly. And the script’s execution should to take a second. Take note that we are unable to reach main directly. We must inform asyncio that an async function has to be executed.

Easy Script in Parallel?

We will then extend one of our sample cases to execute 100 functions. They will all each take a little nap before printing a text. If we ran them consecutively, it would take around a hundred seconds. Asyncio will only require one!

The strength of concurrency lies in it. Sleeping is not an I/O-bound work, but it counts for the example since it will execute much more quickly for pure I/O-bound activities.

A helper function that will briefly sleep and print a message is required. Then, we modify main such that it calls the function 100 times, recording each call in a tasks list. Execution and waiting for each work to be completed are the final and most important steps. Asyncio.gather does this.

Web Scraping using Asyncio?

We must use that information while scraping. The strategy to be used is to make concurrent requests and then provide product listings. Upon completion of all requests, save them. To prevent data loss in real-world scenarios, it can be preferable to preserve data after each request or in batches.

We will utilize aiohttp because requests do not by default allow async, which will help us avoid problems. Requests are capable of doing the task, and there is little performance difference. But aiohttp makes the code more readable.

Every product (755) in the CSV file should be present as previously. The results won’t come in sequence since we make all of the page calls at once. The results can be unordered if we added them to the file inside extract details. The order won’t be an issue because we’ll wait for each job to complete before processing it.

It may be intentional. To prevent excessive traffic from the same IP address, several servers and service providers have limits on the number of requests that can be sent concurrently. It is more of a line than a block. You will be served, but it could take a little while.

You can run a test against a delay page to see genuine speed-up. Another testing page, it will pause for two seconds before responding.

Removing all the extracting and storing logics, the delay will be about 48 seconds and will run under 3 seconds.

Concurrency Limiting with Semaphore?

We should restrict the number of concurrent queries, especially against a single domain, as was previously stated.

Semaphore, an item that can acquire and release a lock, is included with asyncio. Until the lock is obtained, some of the calls will be blocked by its internal logic, resulting in maximum concurrency.

The semaphore must be built with as many bits as possible. then use async with sem to wait on the extraction function till it becomes available.

It does the job and is quite simple to use and here is the result with the maximum concurrency setting at 3.

It demonstrates that the version with infinite concurrency isn’t running as quickly as it might. The overall duration is identical to the unbound script if we increase the limit by 10.

Using TCPConnector, for configuring concurrency?

A different option with more settings is provided by aiohttp. We are able to build the client session by incorporating a unique TCPConnector.

It may be constructed using the following two parameters:

  • “Total number of simultaneous connections” is the limit.
  • Limit per host (same host, port, and is ssl) limits the number of concurrent connections to a given endpoint.

Also, simple to set up and keep up with. The result with the maximum concurrency set to 3 hosts is shown below.

The ability to restrict the total number of concurrent calls and requests per domain gives it an edge over Semaphore. Different sites might be scraped during the same session, and each would be subject to a cap.

The drawback is that it appears a little slower. Perform some tests using additional pages and real data to simulate a real-world situation.

Multiprocessing

Scraping is I/O-bound. What would happen though if we had to combine it with some CPU-intensive calculations? We’ll use a function that will count a lot (up to 100 million) after each scraped page to verify that hypothesis. It is an easy (and ridiculous) approach to make a CPU work for a while.

Run it as normal for the asyncio version. It might take a while.

Multiprocessing is more difficult to add. A ProcessPoolExecutor that “uses a pool of processes to execute calls asynchronously” is what we need to construct. Each process will be created and managed by it on a different CPU.

However, it won’t split the load. We will do it by utilizing NumPy’s array split function, which divides the pages range into equal portions in accordance with the number of CPUs.

The remainder of the main function is comparable to the asyncio version, with a few syntactic tweaks to accommodate multiprocessing.

The main distinction is that we are unable to call extract_details directly. We could, but we’ll aim to combine asyncio with multiprocessing to get the most power possible.

To cut a long tale short, each CPU process must scrape a few pages. Assuming your computer has eight CPUs, there are 48 pages, and each process will need six (6 x 8 = 48).

And those six pages will display simultaneously! The computations will then have to wait as they need a lot of CPU power. However, since we have more CPUs, the version with asyncio should run more quickly.

A subset of the pages will launch an asyncio for each CPU process (e.g., from 1 to 6 for the first one).

Following that, each of them will use the well-known extract details method to call a number of URLs.

Think about that for a second. The entire procedure is as follows:

  • Appoint the executor.
  • Dividing the pages.
  • Launch asyncio for every process.
  • The tasks of a subset of pages should be created in an aiohttp session.
  • For each page, extract the data.
  • Compile the data, then archive it.

The stage with asyncio and multiple processes:

The first one required over two minutes, and the second one was just only 40 seconds. However, the second one’s entire CPU (user) time was more than three minutes! This is mostly because of system overhead and other factors.

This demonstrates how parallel processing “wastes” more time overall yet is completed earlier. The decision of which approach to choose is then yours.

Final Words

Asyncio has proven to be enough for scraping, and networking consumes the majority of the execution time. It requires I/O and does well with multi-tasking on a single core.

The scenario is different if processing the collected data demands a lot of CPU power.

Asyncio with aiohttp, which is more appropriate than requests for async work, usually accomplishes the task at hand. Use a custom connection to reduce the overall number of concurrent requests and the number of requests per domain. These three components are all you need to begin creating a scalable scraper.

For any web scraping services, contact X-Byte Enterprise Crawling today!

Request for a quote!

Send Message

    Send Message