How to use asyncio to scrape websites with Python

01 July 2024 | 10 min read

In this article, we'll take a look at how you can use Python and its coroutines, with their async/await syntax, to efficiently scrape websites, without having to go all-in on threads 🧵 and semaphores 🚦. For this purpose, we'll check out asyncio, along with the asynchronous HTTP library aiohttp.

What is asyncio?

asyncio is part of Python's standard library (yay, no additional dependency to manage 🥳) which enables the implementation of concurrency using the same asynchronous patterns you may already know from JavaScript and other languages: async and await

Asynchronous programming is a convenient alternative to Python threads, as it allows you to run tasks in parallel without the need to fully dive into multi-threading, with all the complexities this might involve.

When using the asynchronous approach, you program your code in a seemingly good-old synchronous/blocking fashion and just sprinkle mentioned keywords at the relevant spots in your code and the Python runtime will automatically take care of your code being executed concurrently.

Asynchronous Python basics

asyncio uses the following three key concepts to provide asynchronicity:

  • Coroutines
  • Tasks
  • Futures

Coroutines are the basic building blocks and allow you to declare asynchronous functions, which are executed concurrently by asyncio's event loop and provide a Future as response (more on that in a second). A coroutine is declared by prefixing the function declaration with async (i.e. async def my_function():) and it typically uses await itself, to invoke other asynchronous functions.

Tasks are the components used for scheduling and the actual concurrent execution of coroutines in an asyncio context. They are instantiated with asyncio.create_task() and automatically handled by the event loop.

Futures are the return value of a coroutine and represent the future value computed by the coroutine.

You can find a full list of technical details at https://docs.python.org/3/library/asyncio-task.html#awaitables.

How does async/await work?

If you happen to be already familiar with async/await in JavaScript, you'll feel right at home as the underlying concept is the same. While asynchronous programming per se is not something new, this was usually achieved with callbacks and the eventual pyramid of doom with all the nested and chained callbacks. This was pretty unmanageable in both, JavaScript and Python. async/await came to the rescue here.

When you have a task which takes longer to compute (typical example when to use multi-threading), you can mark the function with async and turn it into a coroutine. Let's take the following code as quick example.

import asyncio

async def wait_and_print(str):
  await asyncio.sleep(1)
  print(str)

async def main():
  tasks = []

  for i in range(1, 10):
    tasks.append(asyncio.create_task(wait_and_print(i)))

  for task in tasks:
    await task

asyncio.run(main())

Here, we define a function which prints a value and we call the function ten times. Pretty simple, but the catch is the function waits a second before it prints its text. If we called this in a regular, sequential fashion, the whole execution would take ten times one second. However, with coroutines, we use create_task to set up a list of tasks, which we subsequently execute all at the same time using the await statement. Each function still pauses for one second, but given they all run at the same time, the whole script will have completed after a second. Yet, the code does not utilise any (obvious) multi-threading and looks mostly like traditional single-threaded code.

One thing to note is, that we need to encapsulate our top-level code in its own async function main(), which we then call with run(). This starts the event loop and provides us with all the asynchronous goodies. Let's take a quick look at what asyncio provides out-of-the-box!

Asyncio feature/function overview

The following table provides a quick overview of the core features and functions of asyncio, which you'll mostly come across when programming asynchronously with asyncio.

Feature Description
as_completed() Executes a list of coroutines and returns an iterator object for their results
create_task() Executes the given coroutine concurrently in the context of a task
ensure_future() Accepts a list of different parameters types and verifies they are all Future-like objects
gather() Concurrently executes the passed awaitables using tasks and returns their combined results
get_event_loop() Provides access to the currently active event loop instance
run() Executes the given coroutine - typically used for the main function
sleep() Suspends the current task for the indicated number of seconds - akin to the standard time.sleep() function
wait() Execute a list of awaitables and waits until the condition specified in the when parameter is met
wait_for() Similar to wait, but cancels the future when a timeout occurs
Lock() Provides access to a mutex object
Semaphore() Provides access to a semaphore object
Task() The task object, as returned by create_task()

Scrape Wikipedia asynchronously with Python and asyncio

Now, that we have a basic understanding of how asynchronous calls work in Python and the features asyncio provides, let's put our knowledge to use with a real-world scraping example, shall we?

The idea of the following example is to compile a list of creators of programming languages. For this purpose, we first crawl Wikipedia for the articles it has on programming languages.

With that list, we then scrape these pages in the second step and extract the respective information from the pages' infoboxes. Voilà, that should then get us a list of programming languages with their respective creators. Let's get coding! 👨🏻‍💻

Installing dependencies

For starters, let's install the necessary library dependencies using pip:

pip install aiohttp
pip install BeautifulSoup

Splendid! We have the basic libraries installed and can continue with getting the links to scrape.

Crawling

Create a new file scraper.py and save the following code:

import asyncio
import aiohttp as aiohttp
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pprint


BASE_URL = 'https://en.wikipedia.org'


async def fetch(url):
  async with aiohttp.ClientSession() as session:
    async with session.get(url) as resp:
      return await resp.text()


async def crawl():
  pages = []

  content = await fetch(urljoin(BASE_URL, '/wiki/List_of_programming_languages'))
  soup = BeautifulSoup(content, 'html.parser')
  for link in soup.select('div.div-col a'):
    pages.append(urljoin(BASE_URL, link['href']))

  return pages


async def main():
  links = await crawl()

  pp = pprint.PrettyPrinter()

  pp.pprint(links)


asyncio.run(main())

Lovely! That code is ready to run, but before we do that, let's take a look at the individual parts we are performing here:

  1. The usual imports
  2. Then, we declare BASE_URL for our base URL
  3. Next, we define a basic fetch() function handling the asynchronous aiohttp calls and returning the URL's content
  4. We also define our central crawl() function, which does the crawling and uses the CSS selector div.div-col a to get a list of all relevant language pages
  5. Lastly, we call our asynchronous main() function, using asyncio.run(), to run crawl() and print the list of the links we found

💡 Using CSS selectors with Python

If you want to learn more about how to use CSS selectors specifically in Python, please check out How to use CSS Selectors in Python?.

Great so far, but just theory. Let's run it and check if we actually get the links we are after ....

$ python3 scraper.py
['https://en.wikipedia.org/wiki/A_Sharp_(.NET)',
 'https://en.wikipedia.org/wiki/A-0_System',
 'https://en.wikipedia.org/wiki/A%2B_(programming_language)',
 'https://en.wikipedia.org/wiki/ABAP',
 'https://en.wikipedia.org/wiki/ABC_(programming_language)',
 'https://en.wikipedia.org/wiki/ABC_ALGOL',
 'https://en.wikipedia.org/wiki/ACC_(programming_language)',
 'https://en.wikipedia.org/wiki/Accent_(programming_language)',

 LOTS MORE

Well, we did, and it was easier than one might think - not lots of boilerplate code (public static void main(String[] args), anyone? ☕).

But crawling was only the first step on our journey. It's certainly crucial, because without a list of URLs we won't be able to scrape them, but what we are really interested in is the information of each individual language. So let's continue to the scraping bit of our project!

Scraping

All right, so far we managed to get the list of URLs we want to scrape and now we are going to implement the code which will perform the actual scraping.

For this, let's start with the core function that implement the scraping logic and add the following asynchronous function to our scraper.py file.

async def scrape(link):
  content = await fetch(link)
  soup = BeautifulSoup(content, 'html.parser')

  # Select name
  name = soup.select_one('caption.infobox-title')

  if name is not None:
    name = name.text

    creator = soup.select_one('table.infobox tr:has(th a:-soup-contains("Developer", "Designed by")) td')
    if creator is not None:
      creator = creator.text

    return [name, creator]

  return []

Once again, quite some manageable piece of code, isn't it? What do we do here in detail, though?

  1. First, scrape() takes one argument: link, the URL to scrape
  2. Next, we call our fetch() function to get the content of that URL and save it into content
  3. Now, we instantiate an instance of Beautiful Soup and use it to parse content into soup
  4. We quickly use the CSS selector caption.infobox-title to get language name
  5. As last step, we use the :-soup-contains pseudo-selector to precisely select the table entry with the name of the language author and return everything as array - or an empty array if we did not find the information

Now that we have this in place, we just need to pass the links obtained with our crawler to scrape() and our scraper is almost ready!

Let's quickly adjust the main() function as follows:

async def main():
  links = await crawl()

  tasks = []
  for link in links:
    tasks.append(scrape(link))
  authors = await asyncio.gather(*tasks)

  pp = pprint.PrettyPrinter()
  pp.pprint(authors)

We still call crawl(), but instead of printing the links, we schedule a scrape() call for each link as an asyncio task and add it to the tasks queue. The real magic then happens when we pass the queue to asyncio.gather, which runs all the tasks asynchronously and in parallel.

And here is the full code:

import asyncio
import aiohttp as aiohttp
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pprint


BASE_URL = 'https://en.wikipedia.org'


async def fetch(url):
  async with aiohttp.ClientSession() as session:
    async with session.get(url) as resp:
      return await resp.text()


async def crawl():
  pages = []

  content = await fetch(urljoin(BASE_URL, '/wiki/List_of_programming_languages'))
  soup = BeautifulSoup(content, 'html.parser')
  for link in soup.select('div.div-col a'):
    pages.append(urljoin(BASE_URL, link['href']))

  return pages


async def scrape(link):
  content = await fetch(link)
  soup = BeautifulSoup(content, 'html.parser')

  # Select name
  name = soup.select_one('caption.infobox-title')

  if name is not None:
    name = name.text

    creator = soup.select_one('table.infobox tr:has(th a:-soup-contains("Developer", "Designed by")) td')
    if creator is not None:
      creator = creator.text

    return [name, creator]

  return []


async def main():
  links = await crawl()

  tasks = []
  for link in links:
    tasks.append(scrape(link))
  authors = await asyncio.gather(*tasks)

  pp = pprint.PrettyPrinter()
  pp.pprint(authors)


asyncio.run(main())

Summary

What we learned in this article is that Python provides an excellent environment for running concurrent tasks without the need to implement full multi-threading. Its asynchronous async/await syntax enables you to implement your scraping logic in a straightforward, blocking fashion and, nonetheless, run an efficient scraping pipeline and fully utilise the available CPU cores.

Our examples provide a good initial overview on how to approach async programming in Python, but there are still a few factors to take into account to make sure your scraper is successful

  • the user-agent - you can simply pass a headers dictionary as argument to session.get and indicate the desired user-agent
  • request throttling - make sure you are not overwhelming the server and send your requests with reasonable delays
  • IP addresses - some sites may be limited to certain geographical regions or import restrictions on concurrent or total requests from one, single IP address
  • JavaScript - some sites (especially SPAs) make heavy use of JavaScript and you need a proper JavaScript engine to support your scraping

If you wish to find out more on these issues, please drop by our other article on this very subject: Web Scraping without getting blocked

If you don't feel like having to deal with all the scraping bureaucracy of IP address rotation, user-agents, geo-fencing, browser management for JavaScript and rather want to focus on the data extraction and analysis, then please feel free to take a look at our specialised web scraping API. The platform handles all these issues on its own and comes with proxy support, a full-fledged JavaScript environment, and straightforward scraping rules using CSS selectors and XPath expressions. Registering an account is absolutely free and comes with the first 1,000 scraping requests on the house - plenty of room to discover how ScrapingBee can help you with your projects.

Happy asynchronous scraping with Python!

image description
Alexander M

Alexander is a software engineer and technical writer with a passion for everything network related.