Scrapy Playwright Tutorial: How to Scrape Dynamic Websites

06 January 2025 | 25 min read

Playwright for Scrapy enables you to scrape javascript heavy dynamic websites at scale, with advanced web scraping features out of the box.

In this tutorial, we’ll show you the ins and outs of scraping using this popular browser automation library that was originally invented by Microsoft, combining it with Scrapy to extract the content you need with ease.

We’ll cover jobs to be done such as setting up your Python environment, inputting and submitting form data, all the way through to dealing with infinite scroll and scraping multiple pages.

What is Scrapy-Playwright?

Here are some key benefits of scrapy-playwright :

  • Extract Dynamic Content: Scrape content that’s rendered by JavaScript, such as interactive elements and data loaded after page load.
  • Automate User Interactions: Perform actions like clicking buttons, scrolling, or waiting for specific page events.
  • Scrape Modern Websites: Effectively scrape modern websites, including single-page applications (SPAs) that rely heavily on JavaScript.

💡In my experience, Scrapy-Playwright is an excellent integration. It’s valuable for scraping websites where content only becomes visible after interacting with the page—whether it’s clicking buttons or waiting for JavaScript to load—while still leveraging Scrapy’s powerful crawling capabilities.

How to Use Scrapy-Playwright

Let’s go step-by-step on how to set up and use Scrapy-Playwright effectively.

1. Setup and Prerequisites

Before getting started with Scrapy-Playwright, make sure your development environment is properly set up:

  • Install Python: Download and install the latest version of Python from the official Python website .
  • Choose an IDE: Use a code editor like Visual Studio Code for your development.
  • Basic Knowledge: It’s helpful to be familiar with CSS selectors and browser DevTools for inspecting page elements.
  • Library Knowledge: Having a basic understanding of Scrapy and Playwright is recommended, but this guide will walk you through the process step by step.

To keep your project dependencies isolated, it’s a best practice to use a virtual environment. You can create one using the venv module in Python.

python -m venv scrapy-playwright-tuto

Activate the virtual environment:

  • On Windows:

    cd scrapy-playwright-tuto
    Scripts\activate
    
  • On MacOS/Linux:

    source scrapy-playwright-tuto/bin/activate
    

Next, install Scrapy-Playwright and the browser binaries with the following commands:

pip install scrapy-playwright
playwright install  # This will install Firefox, WebKit, and Chromium

If you only need a specific browser (e.g., Chromium), you can install it with:

playwright install chromium

For this tutorial, we'll be using Chromium, so feel free to install just that.

2. Creating a Scrapy Project

Let’s start by creating a Scrapy project and setting up a spider

Create a new Scrapy project using the following command:

scrapy startproject scrpy_plwrt_demo

Navigate into the project directory:

cd scrpy_plwrt_demo

Generate a new spider called google_finance to scrape the Google Finance page for NVIDIA’s stock data:

scrapy genspider google_finance https://www.google.com/finance/quote/NVDA:NASDAQ

Move back to the parent directory and open the project in your code editor for further coding:

cd ..
code .

At this point, your project structure should look something like this:

google finance

Now, you're ready to start building the spider and integrating Playwright into your project!

3. Configuring Scrapy Settings for Playwright

To integrate Playwright with Scrapy, you'll need to update Scrapy's settings so that it can handle JavaScript-heavy websites. Here's how to configure Scrapy-Playwright.

  1. Replace Scrapy’s default HTTP/HTTPS download handlers with the ones provided by Scrapy-Playwright. Add the following lines to your settings.py file:

    DOWNLOAD_HANDLERS = {
        "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    }
    

    This configuration ensures that requests flagged with the playwright=True meta key will be processed by Playwright. Requests without this flag will be handled by Scrapy’s default download handler.

  2. Playwright requires an asyncio-compatible Twisted reactor to handle asynchronous tasks. Add this line to ensure Scrapy uses the required AsyncioSelectorReactor:

    # settings.py
    TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    

    Without this setting, Scrapy's default reactor won’t support Playwright, and asynchronous requests will fail.

Here's how your settings.py should look after the necessary changes:

# Enable Scrapy-Playwright
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

### (Optional)
PLAYWRIGHT_BROWSER_TYPE = "chromium"  # Choose 'chromium', 'firefox', or 'webkit'
PLAYWRIGHT_LAUNCH_OPTIONS = {
    "headless": False,  # Set to True if you prefer headless mode
}

Here, PLAYWRIGHT_BROWSER_TYPE specifies which browser Playwright will use. PLAYWRIGHT_LAUNCH_OPTIONS customizes Playwright’s browser launch options, such as enabling or disabling headless mode.

4. Scraping JavaScript-Rendered Websites

To scrape JavaScript-rendered content, you need to enable Playwright for specific requests in Scrapy. You can do this by passing the meta dictionary with the key "playwright": True in your Scrapy Request.

Here’s an example that shows how to scrape stock prices from Google Finance using Scrapy-Playwright:

import scrapy
from scrapy_playwright.page import PageMethod

class GoogleFinanceSpider(scrapy.Spider):
    name = "google_finance"

    def start_requests(self):
        # Send a request to Google Finance with Playwright enabled
        yield scrapy.Request(
            url="https://www.google.com/finance/quote/GOOG:NASDAQ",
            meta={
                "playwright": True,  # Enables Playwright for JavaScript rendering
                "playwright_page_methods": [
                    # Wait for the stock price element to appear
                    PageMethod("wait_for_selector", "div.YMlKec.fxKbKc"),
                ],
            },
            callback=self.parse,  # Call the parse method after the page loads
        )

    def parse(self, response):
        # Extract the stock price from the loaded page
        price = response.css("div.YMlKec.fxKbKc::text").get()
        if price:
            self.log(f"Google stock price: {price}")
        else:
            self.log("Failed to extract the price.")

Explanation:

  • In the start_requests() method, the meta dictionary with "playwright": True activates Playwright for handling JavaScript on the page.
  • The playwright_page_methods parameter uses PageMethod("wait_for_selector", "div.YMlKec.fxKbKc") to ensure Playwright waits for the stock price element to load before proceeding.
  • The parse() method extracts the stock price using CSS selectors once the page is fully rendered.

To run this spider, follow these steps:

cd scrpy_plwrt_demo
scrapy crawl google_finance

When executed, you’ll see the page like this:

navigating to page

When you run the spider, it will scrape the stock price:

Google stock price: $163.18

5. Automating Button Clicks with Scrapy-Playwright

Sometimes, you may need to interact with web pages by clicking buttons to access some hidden data. With Scrapy-Playwright, you can easily automate this by using the PageMethod function.

Here’s the code:

import scrapy
from scrapy_playwright.page import PageMethod

class GoogleFinanceSpider(scrapy.Spider):
    name = "google_finance"

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.google.com/finance/quote/NVDA:NASDAQ",
            meta={
                "playwright": True,  # Enable Playwright for JavaScript rendering
                "playwright_include_page": True,  # Include Playwright page object in the response
                "playwright_page_methods": [
                    # Click the Share button
                    PageMethod("click", "button[aria-label='Share']"),
                    # Click the Copy link button
                    PageMethod("click", "button[jsname='SbVnGf']"),
                ],
            },
            callback=self.parse,
        )

When you run the spider, it will first click the Share button:

share button

This triggers the share popup on the page. Next, the spider will click the Copy link button:

click share button

Finally, the Copy link button will be clicked, and the link will be copied:

copy button clicked

6. Waiting for Dynamic Content

Web elements might not load immediately, which can cause your scraper to fail if it tries to interact with elements before they are fully available. To address this, you can instruct the Playwright to wait for specific elements to appear before performing any actions using PageMethod("wait_for_selector", selector).

Here’s the modified code:

import scrapy
from scrapy_playwright.page import PageMethod

class GoogleFinanceSpider(scrapy.Spider):
    name = "google_finance"

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.google.com/finance/quote/NVDA:NASDAQ",
            meta={
                "playwright": True,  # Enable Playwright for JavaScript rendering
                "playwright_include_page": True,  # Include Playwright page object
                "playwright_page_methods": [
                    # Wait for the Share button to appear before interacting
                    PageMethod("wait_for_selector", "button[aria-label='Share']"),
                    # Click the Share button
                    PageMethod("click", "button[aria-label='Share']"),
                    # Wait for the Copy link button to appear before interacting
                    PageMethod("wait_for_selector", "button[jsname='SbVnGf']"),
                    # Click the Copy link button
                    PageMethod("click", "button[jsname='SbVnGf']"),
                ],
            },
        )

7. Waiting for Page Load States

In some cases, waiting for individual elements isn’t enough, especially when dealing with pages that load content dynamically or have ongoing network activity. To handle such scenarios, Playwright provides the wait_for_load_state() method, which lets you wait for the page to reach a specific loading state.

Available Load States:

  • "load": The page has fully loaded.
  • "domcontentloaded": The DOM has been completely loaded.
  • "networkidle": No more than two network connections are active for at least 500 ms, making it useful for pages with continuous network requests.

For static pages, waiting for the "load" state is typically sufficient. However, for dynamic websites or pages with frequent background activity, the "networkidle" state is more reliable.

Here’s how you can implement page load states in your spider:

import scrapy
from scrapy_playwright.page import PageMethod

class GoogleFinanceSpider(scrapy.Spider):
    name = "google_finance"

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.google.com/finance/quote/NVDA:NASDAQ",
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    # Wait for the network to be idle
                    PageMethod("wait_for_load_state", "networkidle"),
                    # Wait for the Share button to appear
                    PageMethod("wait_for_selector", "button[aria-label='Share']"),
                    # Click the Share button
                    PageMethod("click", "button[aria-label='Share']"),
                    # Wait for the Copy link button to appear
                    PageMethod("wait_for_selector", "button[jsname='SbVnGf']"),
                    # Click the Copy link button
                    PageMethod("click", "button[jsname='SbVnGf']"),
                ],
            },
            callback=self.parse,
        )

8. Waiting for a Specific Amount of Time

In Playwright, you can introduce a delay between actions using the wait_for_timeout() method, which pauses the script for a specified duration (in milliseconds). While it’s generally better to rely on event-based waits like wait_for_selector() or wait_for_load_state(), wait_for_timeout() can be useful in situations where precise timing is needed, such as waiting for animations or transitions to complete.

You can pause between actions using:

PageMethod("wait_for_timeout", milliseconds)

Here, milliseconds refers to the duration of the pause. For example, 2000 would pause for 2 seconds.

In the following code, the scraper waits for 2 seconds between clicking the Share and Copy link buttons on the Google Finance page:

import scrapy
from scrapy_playwright.page import PageMethod

class GoogleFinanceSpider(scrapy.Spider):
    name = "google_finance"

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.google.com/finance/quote/NVDA:NASDAQ",
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    # Wait for the network to be idle before interacting with the page
                    PageMethod("wait_for_load_state", "networkidle"),
                    # Wait for the Share button to appear
                    PageMethod("wait_for_selector", "button[aria-label='Share']"),
                    # Click the Share button
                    PageMethod("click", "button[aria-label='Share']"),
                    # Wait for 2 seconds before the next action
                    PageMethod("wait_for_timeout", 2000),
                    # Wait for the Copy link button to appear
                    PageMethod("wait_for_selector", "button[jsname='SbVnGf']"),
                    # Click the Copy link button
                    PageMethod("click", "button[jsname='SbVnGf']"),
                    # Wait for 2 seconds before closing the browser
                    PageMethod("wait_for_timeout", 2000),
                ],
            },
            callback=self.parse,
        )

Best Practices:

  • Whenever possible, use wait_for_selector() or wait_for_load_state() for more efficient, event-driven waits.
  • While wait_for_timeout() can handle specific cases like animations or slow-loading elements, overuse may lead to unnecessary delays in your scraping process.

9. Typing Text into an Input Field

Scrapy-Playwright provides the PageMethod("fill") to input text into form fields. Below is an example that shows how to navigate to YouTube, type "NASA" into the search bar, and trigger a search.

import scrapy
from scrapy_playwright.page import PageMethod

class YoutubeSearchSpider(scrapy.Spider):
    name = "youtube_search"

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.youtube.com",
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    # Wait for the network to become idle
                    PageMethod("wait_for_load_state", "networkidle"),
                    # Wait for the search input field to become available
                    PageMethod("wait_for_selector", "input#search"),
                    # Focus on the search input field
                    PageMethod("click", "input#search"),
                    # Type the search query 'NASA' into the search box
                    PageMethod("fill", "input#search", "NASA"),
                    # Wait briefly to ensure the input is processed
                    PageMethod("wait_for_timeout", 1000),
                    # Press Enter to initiate the search
                    PageMethod("press", "input#search", "Enter"),
                    # Wait for the search results to appear
                    PageMethod("wait_for_selector", "ytd-channel-renderer"),
                ],
            },
            callback=self.parse,
        )

To run the spider, use the following command:

cd scrapy_plwrt_demo
scrapy crawl youtube_search

Once the spider runs successfully, you’ll see the search results for "NASA" displayed on YouTube, as shown below:

typing input field

10. Extracting Text from an Element

A common task in web scraping is extracting text from specific elements, such as product descriptions, prices, or— in this case—YouTube subscriber counts. Scrapy-Playwright makes this straightforward by allowing you to interact with and extract text from the page’s HTML structure.

In this example, we’ll extract the subscriber count from a YouTube channel page. Here’s how to do it:

import scrapy
from scrapy_playwright.page import PageMethod

class YoutubeSearchSpider(scrapy.Spider):
    name = "youtube_search"

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.youtube.com",
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_load_state", "networkidle"),
                    PageMethod("wait_for_selector", "input#search"),
                    PageMethod("click", "input#search"),
                    PageMethod("fill", "input#search", "NASA"),
                    PageMethod("wait_for_timeout", 1000),
                    PageMethod("press", "input#search", "Enter"),
                    PageMethod("wait_for_selector", "ytd-channel-renderer"),
                ],
            },
            callback=self.parse,
        )

    async def parse(self, response):
        # Extract the number of subscribers
        subscribers = response.css("#video-count::text").get()
        if subscribers:
            self.log(f"NASA Channel Subscribers: {subscribers.strip()}")
        else:
            self.log("Could not find the subscriber count.")

The response.css("#video-count::text").get() method is used to extract the subscriber count from the page using the CSS selector #video-count.

When the spider successfully runs, it should log the subscriber count for the NASA channel, like this:

NASA Channel Subscribers: 12.2M subscribers

11. Taking Screenshots

Screenshots are very useful when scraping websites, as they allow you to verify the state of a webpage at any point during the scraping process. This is particularly helpful for debugging, tracking page changes, and ensuring that your scraper interacts with the page as expected. Playwright's built-in screenshot() method makes it easy to capture screenshots, and with Scrapy-Playwright, you can use PageMethod to take screenshots during your scraping tasks.

To capture a screenshot using Scrapy-Playwright, you can use:

PageMethod("screenshot", path="<screenshot_file_name>.<format>")

Where <format> can be either jpg or png.

Additional Screenshot Options:

  • full_page=True: Captures the entire webpage, not just the visible portion.
  • omit_background=True: Removes the background (usually white) and captures the page with transparency.

Example of a full-page screenshot with transparency:

PageMethod(
    "screenshot",
    path="full_page_screenshot.png",
    full_page=True,
    omit_background=True
)

Here’s the code that navigates to YouTube and takes a screenshot of the page:

import scrapy
from scrapy_playwright.page import PageMethod

class YoutubeSearchSpider(scrapy.Spider):
    name = "youtube_search"

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.youtube.com",
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    # Wait for the page and network activity to settle
                    PageMethod("wait_for_load_state", "networkidle"),
                    # Take a screenshot
                    PageMethod("screenshot", path="youtube.png", full_page=True),
                ],
            },
        )

Once the spider is executed, the screenshot will be saved with the name you specified. The captured screenshot might look something like this:

youtube screenshot

12. Running Custom JavaScript Code

With Scrapy-Playwright, you can run custom JavaScript directly on the pages you're scraping, which is useful for simulating user interactions or manipulating page content when Playwright’s API doesn’t support the action you need. For example, you can use JavaScript to scroll a page, update the DOM, or modify the displayed text.

In this example, we’ll modify the stock price on Google Finance from USD to EUR using a mock conversion rate. The custom JavaScript will:

  1. Select the stock price element using querySelector().
  2. Convert the USD price to EUR using a conversion rate.
  3. Replace the content with the updated EUR price.

Here’s the JavaScript we’ll run:

const element = document.querySelector('div.YMlKec.fxKbKc');
if (element) {
    let usdPrice = parseFloat(element.innerText.replace(/[$,]/g, ''));  // Remove dollar signs and commas
    let eurPrice = (usdPrice * 0.85).toFixed(2);  // Mock conversion rate from USD to EUR
    element.innerText = eurPrice + ' €';  // Update the price to EUR
}

We’ll pass this JavaScript to Playwright using the evaluate method within PageMethod(). The modified content will be available in the parse() method after the JavaScript runs.

import scrapy
from scrapy_playwright.page import PageMethod

class GoogleFinanceSpider(scrapy.Spider):
    name = "google_finance"

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.google.com/finance/quote/GOOG:NASDAQ",
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    # Wait for the stock price element to load
                    PageMethod("wait_for_selector", "div.YMlKec.fxKbKc"),
                    # Run the custom JavaScript to convert the price from USD to EUR
                    PageMethod(
                        "evaluate",
                        """
                        () => {
                            const element = document.querySelector('div.YMlKec.fxKbKc');
                            if (element) {
                                let usdPrice = parseFloat(element.innerText.replace(/[$,]/g, ''));  // Remove dollar signs and commas
                                let eurPrice = (usdPrice * 0.85).toFixed(2);  // Mock conversion rate (USD to EUR)
                                element.innerText = eurPrice + ' €';  // Replace price with EUR
                            }
                        }
                    """,
                    ),
                    # Take a screenshot after modifying the price
                    PageMethod(
                        "screenshot",
                        path="google_finance_eur_price.png",
                        full_page=True,
                    ),
                ],
            },
            callback=self.parse,  # Call the parse method after actions are completed
        )

    async def parse(self, response):
        # Extract the modified stock price in EUR after the page is loaded
        price = response.css("div.YMlKec.fxKbKc::text").get()
        if price:
            self.log(f"Google stock price (EUR): {price}")
        else:
            self.log("Failed to extract the price.")

Here are some key concepts:

  1. The PageMethod("evaluate") method allows you to run custom JavaScript on the page. In this case, we manipulate the stock price element.
  2. After running the JavaScript, we take a full-page screenshot using PageMethod("screenshot"), which captures the modified price.
  3. The parse() method logs the updated price in EUR, showing that the custom JavaScript worked as expected.

When you run the spider, the stock price will be logged in EUR:

[google_finance] DEBUG: Google stock price (EUR): 141.40 €

Additionally, a screenshot of the modified price will be saved as google_finance_eur_price.png:

google finance eur price

13. Handle Infinite Scrolling

Infinite scrolling is a technique where more content is loaded dynamically as you scroll down a webpage. To scrape such pages, you need to implement custom scrolling logic.

The following JavaScript code will continuously scroll to the bottom of the page and wait for new content to load before stopping.

let previousHeight = 0;

while (true) {
    window.scrollBy(0, document.body.scrollHeight); // Scroll to the bottom
    await new Promise(r => setTimeout(r, 3000)); // Wait for new content to load
    
    let newHeight = document.body.scrollHeight;
    if (newHeight === previousHeight) {
        break; // Stop if no new content is loaded
    }
    previousHeight = newHeight;
}

Here’s how you can implement infinite scrolling in Scrapy-Playwright for the Nike page:

import scrapy
from scrapy_playwright.page import PageMethod

class NikeShoeSpider(scrapy.Spider):
    name = "nike_shoes"

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.nike.com/in/w/new-shoes-3n82yzy7ok",
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod(
                        "wait_for_load_state", "networkidle"
                    ),  # Wait for initial page load
                    # Scroll the page until no more new content is loaded
                    PageMethod(
                        "evaluate",
                        """
                        async () => {
                            let previousHeight = 0;
                            while (true) {
                                window.scrollBy(0, document.body.scrollHeight);  // Scroll to the bottom
                                await new Promise(r => setTimeout(r, 3000));    // Wait for new content to load
                                
                                let newHeight = document.body.scrollHeight;
                                if (newHeight === previousHeight) {
                                    break;  // Stop if no new content is loaded
                                }
                                previousHeight = newHeight;
                            }
                        }
                        """,
                    ),
                ],
            },
            callback=self.parse,
        )

    async def parse(self, response):
        # Extract product names after scrolling
        product_names = response.css("div.product-card__title::text").getall()

        # Log the extracted product names
        for product in product_names:
            self.log(f"Product: {product}")

Explanation:

  1. PageMethod("evaluate"): Executes custom JavaScript directly on the page to simulate user scrolling.
  2. window.scrollBy(0, document.body.scrollHeight): Scrolls the page to the bottom, triggering the loading of new content.
  3. await new Promise(r => setTimeout(r, 3000)): Pauses the execution for 3 seconds, allowing time for the page to load more content.
  4. if (newHeight === previousHeight): If the page height doesn't change, the loop stops, telling no new content is being loaded.

Once the spider runs, it logs the names of the loaded products:

[nike_shoes] DEBUG: Product: Nike Metcon 9
[nike_shoes] DEBUG: Product: Nike Dunk Low Twist
[nike_shoes] DEBUG: Product: Air Max 1
[nike_shoes] DEBUG: Product: Nike Air Max 1 EasyOn
...

Here’s what the scrolling process looks like:

infinite scrolling

14. Scraping Multiple Pages

Websites often have "Next" or "More" buttons to load additional content. The goal is to click these buttons to retrieve the next batch of data.

Here’s an image showing the "More" button on Hacker News, identified by class=morelink:

handle multiple pages

In the example below, we’ll scrape multiple pages from Hacker News by navigating to the "More" link. The spider will follow the "More" link to load subsequent pages until no more pages are available or a set page limit is reached.

import scrapy
from scrapy_playwright.page import PageMethod

class HackerNewsSpider(scrapy.Spider):
    name = "hacker_news"

    def __init__(self, max_pages=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.max_pages = int(max_pages) if max_pages else None  # Optional page limit
        self.pages_crawled = 0  # Track the number of pages crawled

    def start_requests(self):
        yield scrapy.Request(
            url="https://news.ycombinator.com/",
            meta={
                "playwright": True,
                "playwright_include_page": True,  # Include Playwright page object for reuse
                "playwright_page_methods": [
                    PageMethod(
                        "wait_for_load_state", "networkidle"
                    )  # Ensure the page fully loads
                ],
            },
            callback=self.parse,
        )

    async def parse(self, response):
        # Extract post titles from the current page
        titles = response.css("span.titleline > a::text").getall()

        # Log the extracted titles
        for title in titles:
            self.log(f"Post title: {title}")
        # Increment the counter for pages crawled
        self.pages_crawled += 1

        # Check if we have reached the max page limit (if defined)
        if self.max_pages and self.pages_crawled >= self.max_pages:
            self.log(f"Reached the maximum number of pages: {self.max_pages}")
            return  # Stop crawling
        # Find the "More" link to navigate to the next page
        more_link = response.css("a.morelink::attr(href)").get()
        if more_link:
            next_page_url = response.urljoin(more_link)

            # Reuse the same Playwright page for the next request
            page = response.meta["playwright_page"]

            yield scrapy.Request(
                url=next_page_url,
                meta={
                    "playwright": True,
                    "playwright_include_page": True,
                    "playwright_page": page,  # Reuse the current page
                    "playwright_page_methods": [
                        PageMethod("wait_for_load_state", "networkidle")
                    ],
                },
                callback=self.parse,
            )

Here are some key points:

  1. Page Limit: You can define a maximum number of pages to scrape using the max_pages argument. If max_pages is not set, the spider will continue scraping until no more "More" links are available.
  2. Page Reuse: To improve performance, we reuse the same Playwright page object when navigating to the next page. This avoids the overhead of opening a new browser session for each request.
  3. Pagination: The spider identifies the "More" link at the bottom of the page using the a.morelink selector and sends a new request to the next page if the link is found.

To run the spider and scrape up to 5 pages, use the following command:

scrapy crawl hacker_news -a max_pages=5

If you don’t set the max_pages argument, the spider will continue until there are no more "More" links.

Here’s a sample of the extracted post titles:

[hacker_news] DEBUG: Post title: Gleam: A Basic Introduction
[hacker_news] DEBUG: Post title: Bear is a tool that generates a compilation database for Clang tooling
[hacker_news] DEBUG: Post title: A Lisp compiler to RISC-V written in Lisp
[hacker_news] DEBUG: Post title: Grokking at the edge of linear separability
[hacker_news] DEBUG: Post title: Techie took five minutes to fix problem Adobe and Microsoft couldn't solve
...

15. Capturing AJAX Data

Many modern websites use AJAX (Asynchronous JavaScript and XML) to update specific parts of a webpage dynamically without a full reload. Capturing AJAX data allows you to directly retrieve structured data (such as JSON) returned from server requests, making the scraping process more efficient than parsing HTML.

For example, the Nike website uses AJAX to load more products as you scroll. Instead of scraping the entire HTML, you can intercept AJAX requests to retrieve the product data directly.

Here’s an example of capturing the AJAX request:

intercepting ajax request

Here’s a more detailed look at the AJAX request data:

ajax request

Below is the code to capture AJAX requests:

import json
from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod

class NikeShoeSpider(Spider):
    name = "nike_shoe"

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.captured_ajax_data = []  # List to store captured AJAX data

    def start_requests(self):
        # Initial request to load the target page
        yield Request(
            url="https://www.nike.com/w/mens-dunk-shoes-90aohznik1zy7ok",
            meta={
                "playwright": True,
                "playwright_include_page": True,  # Include Playwright page object
                "playwright_page_methods": [
                    PageMethod(
                        "wait_for_load_state", "networkidle"
                    ),  # Wait for AJAX requests to finish
                ],
            },
            callback=self.parse,
        )

    async def parse(self, response):
        # Retrieve Playwright page object from the response metadata
        page = response.meta["playwright_page"]

        # Function to intercept and capture AJAX requests
        async def intercept_request(route):
            request = route.request
            # Intercept AJAX requests to Nike's API
            if "api.nike.com" in request.url and request.method == "GET":
                # Fetch the intercepted request's response
                response = await route.fetch()
                body = await response.text()  # Get response body as text
                try:
                    # Parse response body as JSON
                    ajax_data = json.loads(body)
                    self.captured_ajax_data.append(ajax_data)  # Store the JSON data
                    self.logger.info(f"Captured data from: {request.url}")
                except json.JSONDecodeError:
                    self.logger.warning(f"Failed to parse JSON from: {request.url}")
            # Continue processing the request
            await route.continue_()

        # Intercept network requests to capture AJAX data
        await page.route("**/*", intercept_request)

        # Function to scroll the page to load more data via AJAX
        async def scroll_page():
            previous_height = await page.evaluate("document.body.scrollHeight")
            # Scroll to the bottom of the page
            while True:
                await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
                await page.wait_for_timeout(2000)
                new_height = await page.evaluate("document.body.scrollHeight")
                if new_height == previous_height:
                    break
                previous_height = new_height

        # Scroll to load more products via AJAX
        await scroll_page()

        # Wait for network activity to settle after scrolling
        await page.wait_for_load_state("networkidle")

        # Close the Playwright page
        await page.close()

        # Save captured AJAX data to JSON
        self.save_data_to_json()

    def save_data_to_json(self):
        # Save captured AJAX data into a local JSON file
        file_path = "captured_nike_data.json"
        with open(file_path, "w") as f:
            json.dump(self.captured_ajax_data, f, indent=4)
        self.log(f"Saved AJAX data to {file_path}")

The page.route() method intercepts all network requests, and the intercept_request() function filters out Nike API requests. The scroll_page() function scrolls to the bottom of the page to trigger additional product loads via AJAX.

Once the spider completes, you’ll have a JSON file containing all the captured AJAX data. This file will include all the product details that were dynamically loaded on the Nike page.

Here’s what the captured JSON data looks like:

captured JSON data

16. Automating Form Submissions

Automating form submissions with Playwright is a powerful way to simulate user interactions like logging into websites or submitting search queries. In this example, we will show how to use Scrapy-Playwright to automate the login process on Hacker News. The script will:

  1. Fill in the login form with your username and password.
  2. Submit the form.
  3. Take a screenshot to confirm that the login was successful.

Here’s the code:

from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod

class HackerNewsLoginSpider(Spider):
    name = "hn_login"

    def start_requests(self):
        yield Request(
            url="https://news.ycombinator.com/login",
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    # Fill in the username and password fields
                    PageMethod(
                        "fill", 'input[name="acct"]', "your_username"
                    ),  # Replace with your username
                    PageMethod(
                        "fill", 'input[name="pw"]', "your_password"
                    ),  # Replace with your password
                    # Click the login button
                    PageMethod("click", 'input[type="submit"]'),
                    # Take a screenshot after logging in
                    PageMethod("screenshot", path="hn_login.png", full_page=False),
                ],
            },
        )

You might run into an issue where Scrapy blocks the request because of the robots.txt file:

[scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://news.ycombinator.com/login>

If you don’t know, the robots.txt file tells web crawlers which pages they are allowed to access. Scrapy respects this file by default, which may prevent you from accessing the login page.

If you need to access the page, you can disable the robots.txt middleware by adding this setting in your settings.py:

ROBOTSTXT_OBEY = False

Note: Disabling the robots.txt middleware can be risky, and it’s essential to check the terms of service for the website.

After running the spider, you should see a screenshot of the logged-in state:

hackernews login.

17. Running Parallel Requests to Test Session Persistence

When scraping websites that require login, session persistence is critical for maintaining access to authenticated pages. This ensures the spider remains logged in while making subsequent requests, avoiding the need to log in repeatedly.

Here's how to implement session persistence with Scrapy-Playwright:

import scrapy
from scrapy_playwright.page import PageMethod

class HackerNewsSpider(scrapy.Spider):
    name = "parallel_sessions"

    def start_requests(self):
        # Log in to Hacker News
        yield scrapy.Request(
            url="https://news.ycombinator.com/login",
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    # Fill in the login form
                    PageMethod(
                        "fill", 'input[name="acct"]', "your_username"
                    ),  # Replace with your username
                    PageMethod(
                        "fill", 'input[name="pw"]', "your_password"
                    ),  # Replace with your password
                    PageMethod(
                        "click", 'input[value="login"]'
                    ),  # Click the login button
                    PageMethod(
                        "wait_for_navigation"
                    ),  # Wait for the page to navigate after login
                ],
            },
            callback=self.after_login,
        )

    def after_login(self, response):
        # URLs to visit after login
        urls = [
            "https://news.ycombinator.com/?p=2",
            "https://news.ycombinator.com/?p=3",
            "https://news.ycombinator.com/?p=4",
            "https://news.ycombinator.com/?p=5",
            "https://news.ycombinator.com/?p=6",
        ]

        # Send parallel requests to other pages while maintaining the logged-in session
        for url in urls:
            yield scrapy.Request(
                url=url,
                meta={
                    "playwright": True,
                    "playwright_include_page": True,  # Use the logged-in session
                },
                callback=self.parse_headlines,
            )

    async def parse_headlines(self, response):
        # Access the Playwright page object
        page = response.meta["playwright_page"]

        # Extract headlines from the page
        headlines = await page.query_selector_all(".titleline > a")
        for headline in headlines:
            text = await headline.inner_text()
            yield {"headline": text}
        # Close the Playwright page after processing
        await page.close()

By setting playwright_include_page=True in both the login request and subsequent requests, the same Playwright page (with the same session) is reused.

18. Aborting Unwanted Requests

When scraping web pages, unnecessary requests like images, ads, and external scripts can slow down the process and consume bandwidth. By intercepting and aborting these requests, you can make your scraper faster and more efficient. Scrapy-Playwright allows you to filter out such requests, helping you to focus only on the actual data.

Here’s how to implement this:

import scrapy
from scrapy_playwright.page import PageMethod

def abort_request(request):
    return (
        request.resource_type in ["image", "media", "stylesheet"]  # Block resource-heavy types
        or any(ext in request.url for ext in [".jpg", ".png", ".gif", ".css", ".mp4", ".webm"])  # Block specific file extensions
    )

class AbortRequestSpider(scrapy.Spider):
    name = "abort_request"

    custom_settings = {
        "PLAYWRIGHT_ABORT_REQUEST": abort_request,  # Aborting unnecessary requests
    }

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.nike.com/w/mens-lifestyle-shoes-13jrmznik1zy7ok",
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_load_state", "domcontentloaded"),
                ],
            },
            callback=self.parse,
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]

        # Scroll down until no more new items are loaded
        previous_height = -1
        while True:
            current_height = await page.evaluate("() => document.body.scrollHeight")
            if current_height == previous_height:
                break
            previous_height = current_height
            await page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
            await page.wait_for_timeout(3000)

        # Extract product names from the page
        content = await page.content()
        response = scrapy.Selector(text=content)
        product_names = response.css('div.product-card__title::text').getall()

        # Log the extracted product names
        for product in product_names:
            self.log(f"Product: {product}")

        # Close the Playwright page to release resources
        await page.close()

In this code, we implement a function that blocks requests for resource-heavy elements such as images, media files, and stylesheets. The PLAYWRIGHT_ABORT_REQUEST setting applies the abort_request logic to intercept and block unwanted requests during scraping.

When running the spider, here’s what the process looks like:

aborted unwanted requests

The extracted product names:

[abort_request] DEBUG: Product: Nike Dunk Low By You
[abort_request] DEBUG: Product: Nike LeBron 9 Low
[abort_request] DEBUG: Product: Air Jordan 6 Retro Low x PSG
...

19. Managing Playwright Pages and Contexts

When scraping multiple pages, efficient management of browser pages and contexts is important to avoid overloading your system.

The PLAYWRIGHT_MAX_PAGES_PER_CONTEXT setting in Scrapy-Playwright helps you limit the number of open pages within a single browser context. Here’s how you can set this in your settings.py file:

PLAYWRIGHT_MAX_PAGES_PER_CONTEXT = 5  # Allow up to 5 pages per context

In addition to managing Playwright pages, you can also configure how many requests Scrapy processes simultaneously. CONCURRENT_REQUESTS defines how many requests Scrapy handles at once.

# settings.py
CONCURRENT_REQUESTS = 8  # Set to 8 concurrent requests at a time

20. Restarting Disconnected Browsers

During long or resource-intensive scraping tasks, browsers may crash or become disconnected. To handle such situations, Scrapy-Playwright provides a setting that allows the browser to restart automatically, ensuring your scraping job continues without manual intervention.

To enable this feature, you can add the following setting to your Scrapy settings.py file:

PLAYWRIGHT_RESTART_DISCONNECTED_BROWSER = True

Using Proxies With Scrapy Playwright

When scraping data from the web, it’s common to encounter anti-scraping measures like rate limiting, IP bans, and CAPTCHAs. One of the most effective ways to avoid these blocks is by using proxies.

Here’s how you can set up and use proxies with Scrapy Playwright.

To configure a global proxy for all requests in Scrapy Playwright, add the following settings to your settings.py file:

PLAYWRIGHT_LAUNCH_OPTIONS = {
    "proxy": {
        "server": "http://proxy_ip:proxy_port",  # Replace with your proxy server
        "username": "your_username",  # Proxy username
        "password": "your_password",  # Proxy password
    }
}

If your proxy doesn’t require authentication (when using free proxies), you can simplify the configuration by omitting the username and password:

PLAYWRIGHT_LAUNCH_OPTIONS = {
    "proxy": {
        "server": "http://proxy_ip:proxy_port",
    }
}

If you need to use different proxies for different requests, you can specify a proxy for each request by passing proxy details through playwright_context_kwargs in the meta field:

yield scrapy.Request(
    url="https://example.com",
    meta={
        "playwright": True,
        "playwright_context": "new",  # Create a new context for this request
        "playwright_context_kwargs": {  # Custom proxy configuration for this request
            "proxy": {
                "server": "http://proxy_ip:proxy_port",
                "username": "proxy_username",
                "password": "proxy_password",
            },
        },
    }
)

If you find managing proxies manually to be complicated, or if you want a hassle-free solution to bypass anti-scraping measures, you should consider using ScrapingBee. ScrapingBee is a web scraping API that handles IP rotation, user-agent management, and anti-bot bypassing for you. It integrates easily with Scrapy and allows you to scrape JavaScript-heavy and dynamic websites without worrying about getting blocked.

Conclusion

That's all folks, this tutorial showed you how to leverage Playwright with Scrapy to master scraping at scale with ease.

If while using Playwright with Scrapy you’re running into anti-scraping technologies that block you such as Cloudflare or Datadome, then we can help you bypass those hurdles while dealing with all of the infrastructure for you needed to scrape at scale.

Want to learn more about becoming a web scraping master? Then check out our handpicked resources to get you on track to becoming a data mining legend.

  1. Easy Web Scraping with Scrapy
  2. Playwright for Python Web Scraping Tutorial with Examples
  3. Web Scraping Without Getting Blocked
  4. How to Bypass Cloudflare Antibot Protection at Scale
image description
Satyam Tripathi

Satyam is a senior technical writer who is passionate about web scraping, automation, and data engineering. He has delivered over 130 blog posts since 2021.