Crawlee for Python Tutorial with Examples

29 July 2024 | 9 min read

Crawlee is a brand new, free & open-source (FOSS) web scraping library built by the folks at APIFY. While it is available for both Node.js and Python, we'll be looking at the Python library in this brief guide. It's barely been a few weeks since its release and the library has already amassed about 2800 stars on GitHub! Let's see what it's all about and why it got all those stars.

Crawlee Homepage Screenshot

Key Features

Crawlee Python's webpage claims to help us build and maintain 'reliable' crawlers. Essentially, it bundles together several libraries, technologies, and techniques such as BeautifulSoup, Playwright, and Proxy Rotation into one package. It also brings structure to our crawling code, by implementing a way to enqueue links for crawling and handling the storage of scraped data as well. Some of its key features are:

  1. Code Maintainability: Firstly, it provides Python classes with type hints. Second, it makes it easy to switch between HTTP Crawling and Headless Browser Crawling.
  2. Playwright Support: Crawlee can also use Playwright for headless browsing, and its API is very similar to the HTTP scraping API. This makes switching very easy. Popular browsers such as Chrome and Firefox are supported.
  3. Concurrency & Auto-scaling: Crawlee makes thorough use of asyncio to support concurrency and also automatically handles scaling based on available system resources.
  4. Proxy Management: Crawlee also handles proxy rotation in a smart way. We can specify a list of proxies to be used. It automatically tries everything and discards the ones that cause timeouts or return error codes such as 401 and 403.

Installation

We can install Crawlee with pip using the package name crawlee:

pip install crawlee

If we plan to use BeautifulSoup or Playwright with Crawlee, we need to install it with the respective extras:

pip install 'crawlee[beautifulsoup]'
# OR
pip install 'crawlee[playwright]'

We can also install it with both the extras:

pip install 'crawlee[beautifulsoup,playwright]'

If we installed the Playwright extra, we need to install its dependencies as well:

playwright install

Basic Usage

Let's dive into Crawlee Python with a very basic illustration of its use. We'll try and get a list of all the blogs published on our website

( https://scrapingbee.com/blog) . ScrapingBee's Blog Page

Above is a screenshot of our blog's homepage. It shows individual blog cards and pagination links. In our code, we'll visit each page and scrape information from the all blog cards on that page.

Now, let's see the code:

import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main() -> None:
    # initialize the crawler
    crawler = BeautifulSoupCrawler()

    # define how to handle each request's response
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        url = context.request.url
        context.log.info(f'\nCrawling URL: {url}')

        # Extract Info from Blog Links
        blog_link_els = context.soup.select("#content>div a.shadow-card")
        context.log.info(f'Found {len(blog_link_els)} Blogs')
        for el in blog_link_els:
            await context.push_data({
                "url": el.get("href"),
                "title": el.select_one("h4").text,
                "author": el.select_one("strong").text,
                "date": el.select_one("time").get("datetime"),
            })

        # Add more pages to the queue
        await context.enqueue_links(selector="ul.paging a")

    # Start the crawler with a URL
    await crawler.run(['https://www.scrapingbee.com/blog/'])

if __name__ == '__main__':
    asyncio.run(main())

To accomplish this we used the BeautifulSoupCrawler as there is no JavaScript-rendered content, and basic HTTP was sufficient to get what we needed. The key part of our code is inside the main function, where we initialize the crawler, define a function for how each request should be handled, and start the run with one URL.

The Crawlee framework works by using a queue, and the URLs we specify with the run function are initially added to the queue. The crawler runs until all the URLs in the queue are visited. Each time a URL is visited, the request handler which we have defined will be executed with that URL.

Let's look at our request handler in detail. Firstly, Crawlee supplies it with a helpful context object that we can use to get the URL of the page visited, the title, etc. Since we're using the BeautifulSoup crawler, we also have a context.soup object, which is a BeautifulSoup object. In the first step, we extract the necessary info from the blog cards and store it using the context.push_data method provided by Crawlee.

In the second step, we take all links from the pagination and add it to the crawl queue using a CSS selector. Crawlee takes care to crawl each URL only once though we may add the same URL to the queue multiple times.

Every time we call the context.push_data function, the dictionary we pass is saved as a JSON file in the storage/defaults/dataset directory. The JSON files are sequentially numbered as 000000001.json, 000000002.json, and so on. Our run yielded 158 JSON files. One example is below:

{
  "url": "/blog/how-to-scrape-tiktok/",
  "title": "How to Scrape TikTok in 2024: Scrape Profile Stats and Videos",
  "author": "Ismail Ajagbe",
  "date": "2021-09-27"
}

So, that's a quick example we used to get familiar with how we're supposed to work with Crawlee. On first thought, to accomplish the same task without Crawlee, we'd have surely written more code. Next, let's see Crawlee accomplishing more complex tasks.

Crawlee Playwright Usage

We can also use a Playwright crawler with Crawlee, which will spin up a headless browser to visit the enqueued URLs. A headless browser, by definition, will run without showing us a graphical interface. But, we can also have it run with the graphical interface if we want to. While the page opens in the browser, we can write a request handler to be executed - to get data from that page or enqueue more links from that page to be visited.

To illustrate this, let's do a part of the scraping from a previous blog of ours where we studied Amazon's best-selling and most-read books . In the blog, we used the ScrapingBee API to render each book's product page and get some metadata from those pages. This was blocked by Amazon when we initially tried to do it without a headless browser.

Here, let's do it using Crawlee and Playwright.

Amazon Charts Landing Page

Above is a screenshot of amazon.com/charts with the Most Read Fiction shown by default. It has 20 book cards. For the sake of demonstration, let's scrape just the most read fiction chart and pick up all the book cards from there.

Then, let's visit each book's URL and get more data about the books. Let's see the code to accomplish this:

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=5,
        headless=False,
        browser_type='firefox',
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # queue the book links when on chart page
        if "/charts" in context.request.url:
            await context.enqueue_links(selector="div.kc-horizontal-rank-card a.kc-cover-link")
        # when on product page, extract data
        elif "/dp/" in context.request.url or "/gp/" in context.request.url:
            title_el = await context.page.query_selector("#productTitle")
            author_el = await context.page.query_selector("#bylineInfo")
            description_el = await context.page.query_selector("div.a-expander-content")

            await context.push_data({
                "url": context.request.url,
                "title": await title_el.inner_text(),
                "author": await author_el.inner_text(),
                "description": await description_el.inner_text(),
            })

    # start the crawler with a initial list of URLs.
    await crawler.run(['https://www.amazon.com/charts'])

    # export the data
    await crawler.export_data('books.json')

if __name__ == '__main__':
    asyncio.run(main())

The first thing we can notice about the above code is that it is very similar in structure to the previous example. Just that we initialize a PlaywrightCrawler instead of BeautifulSoupCrawler. We've also passed in some parameters while initializing the crawler, including limiting the number of requests and disabling headless mode. We also chose to use the Firefox browser by specifying browser_type='firefox'.

We started the crawler with the URL of the charts page, and in the request handler, we have two parts. If the URL is the charts page, we simply enqueue all the book links, and if the URL is a product page, then we extract some data there. Once the crawler finishes running, we also write all the extracted data to a JSON file using crawler.export_data.

Since we disabled headless mode, we can see the automated browser in action: Crawlee Playwright

In the above video, we see that the code opens the browser with the URLs, and reads the data, and closes them. We can also see concurrency in action with multiple browser windows opening different URLs. Finally, all the data is saved to books.json.

Crawlee Proxy Management

Crawlee also offers us a neat way of using and managing multiple proxies for crawling. Let's see a demonstration of this by making requests to ipinfo.io and checking the location detected in the response. A different location each time would essentially mean that the request went from a different proxy.

import asyncio
import json

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration

async def main() -> None:
    proxy_configuration = ProxyConfiguration(
        proxy_urls=[
            'your-proxy-url-1',
            'your-proxy-url-2',
            'your-proxy-url-3',
        ]
    )
    crawler = BeautifulSoupCrawler(proxy_configuration=proxy_configuration)

    @crawler.router.default_handler
    async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
        # read raw HTTP response
        # this is a JSON, no scraping required
        data = json.loads(context.http_response.read().decode())

        # print the URL and relevant ipinfo
        print(f'{context.request.url}: {data["city"]}, {data["region"]}')

    # start the crawler with a initial list of requests.
    await crawler.run([
        'https://ipinfo.io/?i=1',
        'https://ipinfo.io/?i=2',
        'https://ipinfo.io/?i=3',
        'https://ipinfo.io/?i=4',
    ])


if __name__ == '__main__':
    asyncio.run(main())

In the above code, we run a crawler with 3 proxies. We also supply 4 URLs to crawl, which are essentially the same URLs with junk query parameters to make them unique. For each request, we print the URL and detected location.

The output of the code is below:

https://ipinfo.io/?i=1: Lake Mary, Florida
https://ipinfo.io/?i=2: Zagreb, Zagreb
https://ipinfo.io/?i=3: Gladeview, Florida
https://ipinfo.io/?i=4: Lake Mary, Florida

The proxies are rotated in a round-robin fashion, so we see 3 unique locations in the output. This is because we used 3 proxy URLs. The proxy used for the 1st URL and 4th URL is the same.

IP & Session Management

With minimal additions to the code, we can also persist the session for each proxy, i.e., have one session mapped to each proxy, and reuse the session whenever that proxy is used.

Let's see the code for this:

async def main() -> None:
    proxy_configuration = ProxyConfiguration(
        proxy_urls=[
            'your-proxy-url-1',
            'your-proxy-url-2',
            'your-proxy-url-3',
        ]
    )
    crawler = BeautifulSoupCrawler(
        proxy_configuration=proxy_configuration,
        use_session_pool=True,
        persist_cookies_per_session=True,
    )
    # ...rest of the above code

Tiered Proxies

Not all proxies are equal. Some proxies cost more but have a much lesser chance of getting blocked. Hence, a natural strategy is to use a cheaper proxy first and use an expensive proxy only if the former fails.

This can be accomplished in Crawlee with very little code:

async def main() -> None:
    proxy_configuration = ProxyConfiguration(
        tiered_proxy_urls=[
          # first tier of proxies
          ['free-proxy-url-1', 'free-proxy-url-2'],

          # if the above fail, the next tier will be used
          ['premium-proxy-url-1', 'premium-proxy-url-2'],
        ]
    )
    crawler = BeautifulSoupCrawler(proxy_configuration=proxy_configuration)
    # ...rest of the above code

Summary

In this guide, we looked at the key features of the newly released Crawlee Python library. To demonstrate each of the key features, we wrote some code to accomplish a task.

First, we looked at a basic HTTP scraper, then tried our hands at something that required browser automation, and finally, we saw how to manage Proxies.

While all these techniques and features have been used in scraping for a long time, the main value of Crawlee is that it bundles all of these things together and enables us to write clean, maintainable code.

image description
Karthik Devan

I work freelance on full-stack development of apps and websites, and I'm also trying to work on a SaaS product. When I'm not working, I like to travel, play board games, hike and climb rocks.