In this article, we'll take a look at how you can use Python and its coroutines, with their async
/await
syntax, to efficiently scrape websites, without having to go all-in on threads 🧵 and semaphores 🚦. For this purpose, we'll check out
asyncio
, along with the asynchronous HTTP library
aiohttp
.
What is asyncio?
asyncio
is part of Python's standard library (yay, no additional dependency to manage 🥳) which enables the implementation of concurrency using the same asynchronous patterns you may already know from JavaScript and other languages: async
and await
Asynchronous programming is a convenient alternative to Python threads , as it allows you to run tasks in parallel without the need to fully dive into multi-threading, with all the complexities this might involve.
When using the asynchronous approach, you program your code in a seemingly good-old synchronous/blocking fashion and just sprinkle mentioned keywords at the relevant spots in your code and the Python runtime will automatically take care of your code being executed concurrently.
Asynchronous Python basics
asyncio uses the following three key concepts to provide asynchronicity:
- Coroutines
- Tasks
- Futures
Coroutines are the basic building blocks and allow you to declare asynchronous functions, which are executed concurrently by asyncio's event loop and provide a Future as response (more on that in a second). A coroutine is declared by prefixing the function declaration with async
(i.e. async def my_function():
) and it typically uses await
itself, to invoke other asynchronous functions.
Tasks are the components used for scheduling and the actual concurrent execution of coroutines in an asyncio context. They are instantiated with asyncio.create_task()
and automatically handled by the event loop.
Futures are the return value of a coroutine and represent the future value computed by the coroutine.
You can find a full list of technical details at https://docs.python.org/3/library/asyncio-task.html#awaitables .
How does async
/await
work?
If you happen to be already familiar with async
/await
in JavaScript, you'll feel right at home as the underlying concept is the same. While asynchronous programming per se is not something new, this was usually achieved with callbacks and the eventual
pyramid of doom
with all the nested and chained callbacks. This was pretty unmanageable in both, JavaScript and Python. async
/await
came to the rescue here.
When you have a task which takes longer to compute (typical example when to use multi-threading), you can mark the function with async
and turn it into a coroutine. Let's take the following code as quick example.
import asyncio
async def wait_and_print(str):
await asyncio.sleep(1)
print(str)
async def main():
tasks = []
for i in range(1, 10):
tasks.append(asyncio.create_task(wait_and_print(i)))
for task in tasks:
await task
asyncio.run(main())
Here, we define a function which prints a value and we call the function ten times. Pretty simple, but the catch is the function waits a second before it prints its text. If we called this in a regular, sequential fashion, the whole execution would take ten times one second. However, with coroutines, we use create_task
to set up a list of tasks, which we subsequently execute all at the same time using the await
statement. Each function still pauses for one second, but given they all run at the same time, the whole script will have completed after a second. Yet, the code does not utilise any (obvious) multi-threading and looks mostly like traditional single-threaded code.
One thing to note is, that we need to encapsulate our top-level code in its own async
function main()
, which we then call with run()
. This starts the event loop and provides us with all the asynchronous goodies. Let's take a quick look at what asyncio provides out-of-the-box!
Asyncio feature/function overview
The following table provides a quick overview of the core features and functions of asyncio, which you'll mostly come across when programming asynchronously with asyncio.
Feature | Description |
---|---|
as_completed() | Executes a list of coroutines and returns an iterator object for their results |
create_task() | Executes the given coroutine concurrently in the context of a task |
ensure_future() | Accepts a list of different parameters types and verifies they are all Future-like objects |
gather() | Concurrently executes the passed awaitables using tasks and returns their combined results |
get_event_loop() | Provides access to the currently active event loop instance |
run() | Executes the given coroutine - typically used for the main function |
sleep() | Suspends the current task for the indicated number of seconds - akin to the standard
time.sleep()
function |
wait() | Execute a list of awaitables and waits until the condition specified in the when parameter is met |
wait_for() | Similar to wait , but cancels the future when a timeout occurs |
Lock() | Provides access to a mutex object |
Semaphore() | Provides access to a semaphore object |
Task() | The task object, as returned by create_task() |
Scrape Wikipedia asynchronously with Python and asyncio
Now, that we have a basic understanding of how asynchronous calls work in Python and the features asyncio provides, let's put our knowledge to use with a real-world scraping example, shall we?
The idea of the following example is to compile a list of creators of programming languages. For this purpose, we first crawl Wikipedia for the articles it has on programming languages.
With that list, we then scrape these pages in the second step and extract the respective information from the pages' infoboxes. Voilà, that should then get us a list of programming languages with their respective creators. Let's get coding! 👨🏻💻
Installing dependencies
For starters, let's install the necessary library dependencies using pip:
pip install aiohttp
pip install BeautifulSoup
Splendid! We have the basic libraries installed and can continue with getting the links to scrape.
Crawling
Create a new file scraper.py
and save the following code:
import asyncio
import aiohttp as aiohttp
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pprint
BASE_URL = 'https://en.wikipedia.org'
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def crawl():
pages = []
content = await fetch(urljoin(BASE_URL, '/wiki/List_of_programming_languages'))
soup = BeautifulSoup(content, 'html.parser')
for link in soup.select('div.div-col a'):
pages.append(urljoin(BASE_URL, link['href']))
return pages
async def main():
links = await crawl()
pp = pprint.PrettyPrinter()
pp.pprint(links)
asyncio.run(main())
Lovely! That code is ready to run, but before we do that, let's take a look at the individual parts we are performing here:
- The usual imports
- Then, we declare
BASE_URL
for our base URL - Next, we define a basic
fetch()
function handling the asynchronous aiohttp calls and returning the URL's content - We also define our central
crawl()
function, which does the crawling and uses the CSS selectordiv.div-col a
to get a list of all relevant language pages - Lastly, we call our asynchronous
main()
function, usingasyncio.run()
, to runcrawl()
and print the list of the links we found
💡 Using CSS selectors with Python
If you want to learn more about how to use CSS selectors specifically in Python, please check out How to use CSS Selectors in Python? .
Great so far, but just theory. Let's run it and check if we actually get the links we are after ....
$ python3 scraper.py
['https://en.wikipedia.org/wiki/A_Sharp_(.NET)',
'https://en.wikipedia.org/wiki/A-0_System',
'https://en.wikipedia.org/wiki/A%2B_(programming_language)',
'https://en.wikipedia.org/wiki/ABAP',
'https://en.wikipedia.org/wiki/ABC_(programming_language)',
'https://en.wikipedia.org/wiki/ABC_ALGOL',
'https://en.wikipedia.org/wiki/ACC_(programming_language)',
'https://en.wikipedia.org/wiki/Accent_(programming_language)',
LOTS MORE
Well, we did, and it was easier than one might think - not lots of boilerplate code (public static void main(String[] args)
, anyone? ☕).
But crawling was only the first step on our journey. It's certainly crucial, because without a list of URLs we won't be able to scrape them, but what we are really interested in is the information of each individual language. So let's continue to the scraping bit of our project!
Scraping
All right, so far we managed to get the list of URLs we want to scrape and now we are going to implement the code which will perform the actual scraping.
For this, let's start with the core function that implement the scraping logic and add the following asynchronous function to our scraper.py
file.
async def scrape(link):
content = await fetch(link)
soup = BeautifulSoup(content, 'html.parser')
# Select name
name = soup.select_one('caption.infobox-title')
if name is not None:
name = name.text
creator = soup.select_one('table.infobox tr:has(th a:-soup-contains("Developer", "Designed by")) td')
if creator is not None:
creator = creator.text
return [name, creator]
return []
Once again, quite some manageable piece of code, isn't it? What do we do here in detail, though?
- First,
scrape()
takes one argument:link
, the URL to scrape - Next, we call our
fetch()
function to get the content of that URL and save it intocontent
- Now, we instantiate an instance of Beautiful Soup and use it to parse
content
intosoup
- We quickly use the CSS selector
caption.infobox-title
to get language name - As last step, we use the
:-soup-contains
pseudo-selector to precisely select the table entry with the name of the language author and return everything as array - or an empty array if we did not find the information
Now that we have this in place, we just need to pass the links obtained with our crawler to scrape()
and our scraper is almost ready!
Let's quickly adjust the main()
function as follows:
async def main():
links = await crawl()
tasks = []
for link in links:
tasks.append(scrape(link))
authors = await asyncio.gather(*tasks)
pp = pprint.PrettyPrinter()
pp.pprint(authors)
We still call crawl()
, but instead of printing the links, we schedule a scrape()
call for each link as an
asyncio task
and add it to the tasks
queue. The real magic then happens when we pass the queue to
asyncio.gather
, which runs all the tasks asynchronously and in parallel.
And here is the full code:
import asyncio
import aiohttp as aiohttp
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pprint
BASE_URL = 'https://en.wikipedia.org'
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def crawl():
pages = []
content = await fetch(urljoin(BASE_URL, '/wiki/List_of_programming_languages'))
soup = BeautifulSoup(content, 'html.parser')
for link in soup.select('div.div-col a'):
pages.append(urljoin(BASE_URL, link['href']))
return pages
async def scrape(link):
content = await fetch(link)
soup = BeautifulSoup(content, 'html.parser')
# Select name
name = soup.select_one('caption.infobox-title')
if name is not None:
name = name.text
creator = soup.select_one('table.infobox tr:has(th a:-soup-contains("Developer", "Designed by")) td')
if creator is not None:
creator = creator.text
return [name, creator]
return []
async def main():
links = await crawl()
tasks = []
for link in links:
tasks.append(scrape(link))
authors = await asyncio.gather(*tasks)
pp = pprint.PrettyPrinter()
pp.pprint(authors)
asyncio.run(main())
💡 Love web scraping in Python? Check out our expert list of the Best Python web scraping libraries.
Summary
What we learned in this article is that Python provides an excellent environment for running concurrent tasks without the need to implement full multi-threading. Its asynchronous async
/await
syntax enables you to implement your scraping logic in a straightforward, blocking fashion and, nonetheless, run an efficient scraping pipeline and fully utilise the available CPU cores.
Our examples provide a good initial overview on how to approach async programming in Python, but there are still a few factors to take into account to make sure your scraper is successful
- the user-agent - you can simply pass a
headers
dictionary as argument tosession.get
and indicate the desired user-agent - request throttling - make sure you are not overwhelming the server and send your requests with reasonable delays
- IP addresses - some sites may be limited to certain geographical regions or import restrictions on concurrent or total requests from one, single IP address
- JavaScript - some sites (especially SPAs) make heavy use of JavaScript and you need a proper JavaScript engine to support your scraping
If you wish to find out more on these issues, please drop by our other article on this very subject: Web Scraping without getting blocked
If you don't feel like having to deal with all the scraping bureaucracy of IP address rotation, user-agents, geo-fencing, browser management for JavaScript and rather want to focus on the data extraction and analysis, then please feel free to take a look at our specialised web scraping API . The platform handles all these issues on its own and comes with proxy support, a full-fledged JavaScript environment, and straightforward scraping rules using CSS selectors and XPath expressions . Registering an account is absolutely free and comes with the first 1,000 scraping requests on the house - plenty of room to discover how ScrapingBee can help you with your projects.
Happy asynchronous scraping with Python!
Alexander is a software engineer and technical writer with a passion for everything network related.