Playwright for Scrapy enables you to scrape javascript heavy dynamic websites at scale, with advanced web scraping features out of the box.
In this tutorial, we’ll show you the ins and outs of scraping using this popular browser automation library that was originally invented by Microsoft, combining it with Scrapy to extract the content you need with ease.
We’ll cover jobs to be done such as setting up your Python environment, inputting and submitting form data, all the way through to dealing with infinite scroll and scraping multiple pages.
What is Scrapy-Playwright?
Here are some key benefits of scrapy-playwright :
- Extract Dynamic Content: Scrape content that’s rendered by JavaScript, such as interactive elements and data loaded after page load.
- Automate User Interactions: Perform actions like clicking buttons, scrolling, or waiting for specific page events.
- Scrape Modern Websites: Effectively scrape modern websites, including single-page applications (SPAs) that rely heavily on JavaScript.
💡In my experience, Scrapy-Playwright is an excellent integration. It’s valuable for scraping websites where content only becomes visible after interacting with the page—whether it’s clicking buttons or waiting for JavaScript to load—while still leveraging Scrapy’s powerful crawling capabilities.
How to Use Scrapy-Playwright
Let’s go step-by-step on how to set up and use Scrapy-Playwright effectively.
1. Setup and Prerequisites
Before getting started with Scrapy-Playwright, make sure your development environment is properly set up:
- Install Python: Download and install the latest version of Python from the official Python website .
- Choose an IDE: Use a code editor like Visual Studio Code for your development.
- Basic Knowledge: It’s helpful to be familiar with CSS selectors and browser DevTools for inspecting page elements.
- Library Knowledge: Having a basic understanding of Scrapy and Playwright is recommended, but this guide will walk you through the process step by step.
To keep your project dependencies isolated, it’s a best practice to use a virtual environment. You can create one using the venv module in Python.
python -m venv scrapy-playwright-tuto
Activate the virtual environment:
On Windows:
cd scrapy-playwright-tuto Scripts\activate
On MacOS/Linux:
source scrapy-playwright-tuto/bin/activate
Next, install Scrapy-Playwright and the browser binaries with the following commands:
pip install scrapy-playwright
playwright install # This will install Firefox, WebKit, and Chromium
If you only need a specific browser (e.g., Chromium), you can install it with:
playwright install chromium
For this tutorial, we'll be using Chromium, so feel free to install just that.
2. Creating a Scrapy Project
Let’s start by creating a Scrapy project and setting up a spider
Create a new Scrapy project using the following command:
scrapy startproject scrpy_plwrt_demo
Navigate into the project directory:
cd scrpy_plwrt_demo
Generate a new spider called google_finance
to scrape the
Google Finance page
for NVIDIA’s stock data:
scrapy genspider google_finance https://www.google.com/finance/quote/NVDA:NASDAQ
Move back to the parent directory and open the project in your code editor for further coding:
cd ..
code .
At this point, your project structure should look something like this:
Now, you're ready to start building the spider and integrating Playwright into your project!
3. Configuring Scrapy Settings for Playwright
To integrate Playwright with Scrapy, you'll need to update Scrapy's settings so that it can handle JavaScript-heavy websites. Here's how to configure Scrapy-Playwright.
Replace Scrapy’s default HTTP/HTTPS download handlers with the ones provided by Scrapy-Playwright. Add the following lines to your
settings.py
file:DOWNLOAD_HANDLERS = { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }
This configuration ensures that requests flagged with the
playwright=True
meta key will be processed by Playwright. Requests without this flag will be handled by Scrapy’s default download handler.Playwright requires an asyncio-compatible Twisted reactor to handle asynchronous tasks. Add this line to ensure Scrapy uses the required
AsyncioSelectorReactor
:# settings.py TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
Without this setting, Scrapy's default reactor won’t support Playwright, and asynchronous requests will fail.
Here's how your settings.py
should look after the necessary changes:
# Enable Scrapy-Playwright
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
### (Optional)
PLAYWRIGHT_BROWSER_TYPE = "chromium" # Choose 'chromium', 'firefox', or 'webkit'
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": False, # Set to True if you prefer headless mode
}
Here, PLAYWRIGHT_BROWSER_TYPE
specifies which browser Playwright will use. PLAYWRIGHT_LAUNCH_OPTIONS
customizes Playwright’s browser launch options, such as enabling or disabling headless mode.
4. Scraping JavaScript-Rendered Websites
To scrape JavaScript-rendered content, you need to enable Playwright for specific requests in Scrapy. You can do this by passing the meta
dictionary with the key "playwright": True
in your Scrapy Request
.
Here’s an example that shows how to scrape stock prices from Google Finance using Scrapy-Playwright:
import scrapy
from scrapy_playwright.page import PageMethod
class GoogleFinanceSpider(scrapy.Spider):
name = "google_finance"
def start_requests(self):
# Send a request to Google Finance with Playwright enabled
yield scrapy.Request(
url="https://www.google.com/finance/quote/GOOG:NASDAQ",
meta={
"playwright": True, # Enables Playwright for JavaScript rendering
"playwright_page_methods": [
# Wait for the stock price element to appear
PageMethod("wait_for_selector", "div.YMlKec.fxKbKc"),
],
},
callback=self.parse, # Call the parse method after the page loads
)
def parse(self, response):
# Extract the stock price from the loaded page
price = response.css("div.YMlKec.fxKbKc::text").get()
if price:
self.log(f"Google stock price: {price}")
else:
self.log("Failed to extract the price.")
Explanation:
- In the
start_requests()
method, themeta
dictionary with"playwright": True
activates Playwright for handling JavaScript on the page. - The
playwright_page_methods
parameter usesPageMethod("wait_for_selector", "div.YMlKec.fxKbKc")
to ensure Playwright waits for the stock price element to load before proceeding. - The
parse()
method extracts the stock price using CSS selectors once the page is fully rendered.
To run this spider, follow these steps:
cd scrpy_plwrt_demo
scrapy crawl google_finance
When executed, you’ll see the page like this:
When you run the spider, it will scrape the stock price:
Google stock price: $163.18
5. Automating Button Clicks with Scrapy-Playwright
Sometimes, you may need to interact with web pages by clicking buttons to access some hidden data. With Scrapy-Playwright, you can easily automate this by using the PageMethod
function.
Here’s the code:
import scrapy
from scrapy_playwright.page import PageMethod
class GoogleFinanceSpider(scrapy.Spider):
name = "google_finance"
def start_requests(self):
yield scrapy.Request(
url="https://www.google.com/finance/quote/NVDA:NASDAQ",
meta={
"playwright": True, # Enable Playwright for JavaScript rendering
"playwright_include_page": True, # Include Playwright page object in the response
"playwright_page_methods": [
# Click the Share button
PageMethod("click", "button[aria-label='Share']"),
# Click the Copy link button
PageMethod("click", "button[jsname='SbVnGf']"),
],
},
callback=self.parse,
)
When you run the spider, it will first click the Share button:
This triggers the share popup on the page. Next, the spider will click the Copy link button:
Finally, the Copy link button will be clicked, and the link will be copied:
6. Waiting for Dynamic Content
Web elements might not load immediately, which can cause your scraper to fail if it tries to interact with elements before they are fully available. To address this, you can instruct the Playwright to wait for specific elements to appear before performing any actions using PageMethod("wait_for_selector", selector)
.
Here’s the modified code:
import scrapy
from scrapy_playwright.page import PageMethod
class GoogleFinanceSpider(scrapy.Spider):
name = "google_finance"
def start_requests(self):
yield scrapy.Request(
url="https://www.google.com/finance/quote/NVDA:NASDAQ",
meta={
"playwright": True, # Enable Playwright for JavaScript rendering
"playwright_include_page": True, # Include Playwright page object
"playwright_page_methods": [
# Wait for the Share button to appear before interacting
PageMethod("wait_for_selector", "button[aria-label='Share']"),
# Click the Share button
PageMethod("click", "button[aria-label='Share']"),
# Wait for the Copy link button to appear before interacting
PageMethod("wait_for_selector", "button[jsname='SbVnGf']"),
# Click the Copy link button
PageMethod("click", "button[jsname='SbVnGf']"),
],
},
)
7. Waiting for Page Load States
In some cases, waiting for individual elements isn’t enough, especially when dealing with pages that load content dynamically or have ongoing network activity. To handle such scenarios, Playwright provides the wait_for_load_state()
method, which lets you wait for the page to reach a specific loading state.
Available Load States:
"load"
: The page has fully loaded."domcontentloaded"
: The DOM has been completely loaded."networkidle"
: No more than two network connections are active for at least 500 ms, making it useful for pages with continuous network requests.
For static pages, waiting for the "load"
state is typically sufficient. However, for dynamic websites or pages with frequent background activity, the "networkidle"
state is more reliable.
Here’s how you can implement page load states in your spider:
import scrapy
from scrapy_playwright.page import PageMethod
class GoogleFinanceSpider(scrapy.Spider):
name = "google_finance"
def start_requests(self):
yield scrapy.Request(
url="https://www.google.com/finance/quote/NVDA:NASDAQ",
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
# Wait for the network to be idle
PageMethod("wait_for_load_state", "networkidle"),
# Wait for the Share button to appear
PageMethod("wait_for_selector", "button[aria-label='Share']"),
# Click the Share button
PageMethod("click", "button[aria-label='Share']"),
# Wait for the Copy link button to appear
PageMethod("wait_for_selector", "button[jsname='SbVnGf']"),
# Click the Copy link button
PageMethod("click", "button[jsname='SbVnGf']"),
],
},
callback=self.parse,
)
8. Waiting for a Specific Amount of Time
In Playwright, you can introduce a delay between actions using the wait_for_timeout()
method, which pauses the script for a specified duration (in milliseconds). While it’s generally better to rely on event-based waits like wait_for_selector()
or wait_for_load_state()
, wait_for_timeout()
can be useful in situations where precise timing is needed, such as waiting for animations or transitions to complete.
You can pause between actions using:
PageMethod("wait_for_timeout", milliseconds)
Here, milliseconds
refers to the duration of the pause. For example, 2000
would pause for 2 seconds.
In the following code, the scraper waits for 2 seconds between clicking the Share and Copy link buttons on the Google Finance page:
import scrapy
from scrapy_playwright.page import PageMethod
class GoogleFinanceSpider(scrapy.Spider):
name = "google_finance"
def start_requests(self):
yield scrapy.Request(
url="https://www.google.com/finance/quote/NVDA:NASDAQ",
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
# Wait for the network to be idle before interacting with the page
PageMethod("wait_for_load_state", "networkidle"),
# Wait for the Share button to appear
PageMethod("wait_for_selector", "button[aria-label='Share']"),
# Click the Share button
PageMethod("click", "button[aria-label='Share']"),
# Wait for 2 seconds before the next action
PageMethod("wait_for_timeout", 2000),
# Wait for the Copy link button to appear
PageMethod("wait_for_selector", "button[jsname='SbVnGf']"),
# Click the Copy link button
PageMethod("click", "button[jsname='SbVnGf']"),
# Wait for 2 seconds before closing the browser
PageMethod("wait_for_timeout", 2000),
],
},
callback=self.parse,
)
Best Practices:
- Whenever possible, use
wait_for_selector()
orwait_for_load_state()
for more efficient, event-driven waits. - While
wait_for_timeout()
can handle specific cases like animations or slow-loading elements, overuse may lead to unnecessary delays in your scraping process.
9. Typing Text into an Input Field
Scrapy-Playwright provides the PageMethod("fill")
to input text into form fields. Below is an example that shows how to navigate to YouTube, type "NASA" into the search bar, and trigger a search.
import scrapy
from scrapy_playwright.page import PageMethod
class YoutubeSearchSpider(scrapy.Spider):
name = "youtube_search"
def start_requests(self):
yield scrapy.Request(
url="https://www.youtube.com",
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
# Wait for the network to become idle
PageMethod("wait_for_load_state", "networkidle"),
# Wait for the search input field to become available
PageMethod("wait_for_selector", "input#search"),
# Focus on the search input field
PageMethod("click", "input#search"),
# Type the search query 'NASA' into the search box
PageMethod("fill", "input#search", "NASA"),
# Wait briefly to ensure the input is processed
PageMethod("wait_for_timeout", 1000),
# Press Enter to initiate the search
PageMethod("press", "input#search", "Enter"),
# Wait for the search results to appear
PageMethod("wait_for_selector", "ytd-channel-renderer"),
],
},
callback=self.parse,
)
To run the spider, use the following command:
cd scrapy_plwrt_demo
scrapy crawl youtube_search
Once the spider runs successfully, you’ll see the search results for "NASA" displayed on YouTube, as shown below:
10. Extracting Text from an Element
A common task in web scraping is extracting text from specific elements, such as product descriptions, prices, or— in this case—YouTube subscriber counts. Scrapy-Playwright makes this straightforward by allowing you to interact with and extract text from the page’s HTML structure.
In this example, we’ll extract the subscriber count from a YouTube channel page. Here’s how to do it:
import scrapy
from scrapy_playwright.page import PageMethod
class YoutubeSearchSpider(scrapy.Spider):
name = "youtube_search"
def start_requests(self):
yield scrapy.Request(
url="https://www.youtube.com",
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_load_state", "networkidle"),
PageMethod("wait_for_selector", "input#search"),
PageMethod("click", "input#search"),
PageMethod("fill", "input#search", "NASA"),
PageMethod("wait_for_timeout", 1000),
PageMethod("press", "input#search", "Enter"),
PageMethod("wait_for_selector", "ytd-channel-renderer"),
],
},
callback=self.parse,
)
async def parse(self, response):
# Extract the number of subscribers
subscribers = response.css("#video-count::text").get()
if subscribers:
self.log(f"NASA Channel Subscribers: {subscribers.strip()}")
else:
self.log("Could not find the subscriber count.")
The response.css("#video-count::text").get()
method is used to extract the subscriber count from the page using the CSS selector #video-count
.
When the spider successfully runs, it should log the subscriber count for the NASA channel, like this:
NASA Channel Subscribers: 12.2M subscribers
11. Taking Screenshots
Screenshots are very useful when scraping websites, as they allow you to verify the state of a webpage at any point during the scraping process. This is particularly helpful for debugging, tracking page changes, and ensuring that your scraper interacts with the page as expected. Playwright's built-in screenshot()
method makes it easy to capture screenshots, and with Scrapy-Playwright, you can use PageMethod
to take screenshots during your scraping tasks.
To capture a screenshot using Scrapy-Playwright, you can use:
PageMethod("screenshot", path="<screenshot_file_name>.<format>")
Where <format>
can be either jpg
or png
.
Additional Screenshot Options:
full_page=True
: Captures the entire webpage, not just the visible portion.omit_background=True
: Removes the background (usually white) and captures the page with transparency.
Example of a full-page screenshot with transparency:
PageMethod(
"screenshot",
path="full_page_screenshot.png",
full_page=True,
omit_background=True
)
Here’s the code that navigates to YouTube and takes a screenshot of the page:
import scrapy
from scrapy_playwright.page import PageMethod
class YoutubeSearchSpider(scrapy.Spider):
name = "youtube_search"
def start_requests(self):
yield scrapy.Request(
url="https://www.youtube.com",
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
# Wait for the page and network activity to settle
PageMethod("wait_for_load_state", "networkidle"),
# Take a screenshot
PageMethod("screenshot", path="youtube.png", full_page=True),
],
},
)
Once the spider is executed, the screenshot will be saved with the name you specified. The captured screenshot might look something like this:
12. Running Custom JavaScript Code
With Scrapy-Playwright, you can run custom JavaScript directly on the pages you're scraping, which is useful for simulating user interactions or manipulating page content when Playwright’s API doesn’t support the action you need. For example, you can use JavaScript to scroll a page, update the DOM, or modify the displayed text.
In this example, we’ll modify the stock price on Google Finance from USD to EUR using a mock conversion rate. The custom JavaScript will:
- Select the stock price element using
querySelector()
. - Convert the USD price to EUR using a conversion rate.
- Replace the content with the updated EUR price.
Here’s the JavaScript we’ll run:
const element = document.querySelector('div.YMlKec.fxKbKc');
if (element) {
let usdPrice = parseFloat(element.innerText.replace(/[$,]/g, '')); // Remove dollar signs and commas
let eurPrice = (usdPrice * 0.85).toFixed(2); // Mock conversion rate from USD to EUR
element.innerText = eurPrice + ' €'; // Update the price to EUR
}
We’ll pass this JavaScript to Playwright using the evaluate
method within PageMethod()
. The modified content will be available in the parse()
method after the JavaScript runs.
import scrapy
from scrapy_playwright.page import PageMethod
class GoogleFinanceSpider(scrapy.Spider):
name = "google_finance"
def start_requests(self):
yield scrapy.Request(
url="https://www.google.com/finance/quote/GOOG:NASDAQ",
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
# Wait for the stock price element to load
PageMethod("wait_for_selector", "div.YMlKec.fxKbKc"),
# Run the custom JavaScript to convert the price from USD to EUR
PageMethod(
"evaluate",
"""
() => {
const element = document.querySelector('div.YMlKec.fxKbKc');
if (element) {
let usdPrice = parseFloat(element.innerText.replace(/[$,]/g, '')); // Remove dollar signs and commas
let eurPrice = (usdPrice * 0.85).toFixed(2); // Mock conversion rate (USD to EUR)
element.innerText = eurPrice + ' €'; // Replace price with EUR
}
}
""",
),
# Take a screenshot after modifying the price
PageMethod(
"screenshot",
path="google_finance_eur_price.png",
full_page=True,
),
],
},
callback=self.parse, # Call the parse method after actions are completed
)
async def parse(self, response):
# Extract the modified stock price in EUR after the page is loaded
price = response.css("div.YMlKec.fxKbKc::text").get()
if price:
self.log(f"Google stock price (EUR): {price}")
else:
self.log("Failed to extract the price.")
Here are some key concepts:
- The
PageMethod("evaluate")
method allows you to run custom JavaScript on the page. In this case, we manipulate the stock price element. - After running the JavaScript, we take a full-page screenshot using
PageMethod("screenshot")
, which captures the modified price. - The
parse()
method logs the updated price in EUR, showing that the custom JavaScript worked as expected.
When you run the spider, the stock price will be logged in EUR:
[google_finance] DEBUG: Google stock price (EUR): 141.40 €
Additionally, a screenshot of the modified price will be saved as google_finance_eur_price.png
:
13. Handle Infinite Scrolling
Infinite scrolling is a technique where more content is loaded dynamically as you scroll down a webpage. To scrape such pages, you need to implement custom scrolling logic.
The following JavaScript code will continuously scroll to the bottom of the page and wait for new content to load before stopping.
let previousHeight = 0;
while (true) {
window.scrollBy(0, document.body.scrollHeight); // Scroll to the bottom
await new Promise(r => setTimeout(r, 3000)); // Wait for new content to load
let newHeight = document.body.scrollHeight;
if (newHeight === previousHeight) {
break; // Stop if no new content is loaded
}
previousHeight = newHeight;
}
Here’s how you can implement infinite scrolling in Scrapy-Playwright for the Nike page:
import scrapy
from scrapy_playwright.page import PageMethod
class NikeShoeSpider(scrapy.Spider):
name = "nike_shoes"
def start_requests(self):
yield scrapy.Request(
url="https://www.nike.com/in/w/new-shoes-3n82yzy7ok",
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod(
"wait_for_load_state", "networkidle"
), # Wait for initial page load
# Scroll the page until no more new content is loaded
PageMethod(
"evaluate",
"""
async () => {
let previousHeight = 0;
while (true) {
window.scrollBy(0, document.body.scrollHeight); // Scroll to the bottom
await new Promise(r => setTimeout(r, 3000)); // Wait for new content to load
let newHeight = document.body.scrollHeight;
if (newHeight === previousHeight) {
break; // Stop if no new content is loaded
}
previousHeight = newHeight;
}
}
""",
),
],
},
callback=self.parse,
)
async def parse(self, response):
# Extract product names after scrolling
product_names = response.css("div.product-card__title::text").getall()
# Log the extracted product names
for product in product_names:
self.log(f"Product: {product}")
Explanation:
PageMethod("evaluate")
: Executes custom JavaScript directly on the page to simulate user scrolling.window.scrollBy(0, document.body.scrollHeight)
: Scrolls the page to the bottom, triggering the loading of new content.await new Promise(r => setTimeout(r, 3000))
: Pauses the execution for 3 seconds, allowing time for the page to load more content.if (newHeight === previousHeight)
: If the page height doesn't change, the loop stops, telling no new content is being loaded.
Once the spider runs, it logs the names of the loaded products:
[nike_shoes] DEBUG: Product: Nike Metcon 9
[nike_shoes] DEBUG: Product: Nike Dunk Low Twist
[nike_shoes] DEBUG: Product: Air Max 1
[nike_shoes] DEBUG: Product: Nike Air Max 1 EasyOn
...
Here’s what the scrolling process looks like:
14. Scraping Multiple Pages
Websites often have "Next" or "More" buttons to load additional content. The goal is to click these buttons to retrieve the next batch of data.
Here’s an image showing the "More" button on Hacker News, identified by class=morelink
:
In the example below, we’ll scrape multiple pages from Hacker News by navigating to the "More" link. The spider will follow the "More" link to load subsequent pages until no more pages are available or a set page limit is reached.
import scrapy
from scrapy_playwright.page import PageMethod
class HackerNewsSpider(scrapy.Spider):
name = "hacker_news"
def __init__(self, max_pages=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.max_pages = int(max_pages) if max_pages else None # Optional page limit
self.pages_crawled = 0 # Track the number of pages crawled
def start_requests(self):
yield scrapy.Request(
url="https://news.ycombinator.com/",
meta={
"playwright": True,
"playwright_include_page": True, # Include Playwright page object for reuse
"playwright_page_methods": [
PageMethod(
"wait_for_load_state", "networkidle"
) # Ensure the page fully loads
],
},
callback=self.parse,
)
async def parse(self, response):
# Extract post titles from the current page
titles = response.css("span.titleline > a::text").getall()
# Log the extracted titles
for title in titles:
self.log(f"Post title: {title}")
# Increment the counter for pages crawled
self.pages_crawled += 1
# Check if we have reached the max page limit (if defined)
if self.max_pages and self.pages_crawled >= self.max_pages:
self.log(f"Reached the maximum number of pages: {self.max_pages}")
return # Stop crawling
# Find the "More" link to navigate to the next page
more_link = response.css("a.morelink::attr(href)").get()
if more_link:
next_page_url = response.urljoin(more_link)
# Reuse the same Playwright page for the next request
page = response.meta["playwright_page"]
yield scrapy.Request(
url=next_page_url,
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page": page, # Reuse the current page
"playwright_page_methods": [
PageMethod("wait_for_load_state", "networkidle")
],
},
callback=self.parse,
)
Here are some key points:
- Page Limit: You can define a maximum number of pages to scrape using the
max_pages
argument. Ifmax_pages
is not set, the spider will continue scraping until no more "More" links are available. - Page Reuse: To improve performance, we reuse the same Playwright page object when navigating to the next page. This avoids the overhead of opening a new browser session for each request.
- Pagination: The spider identifies the "More" link at the bottom of the page using the
a.morelink
selector and sends a new request to the next page if the link is found.
To run the spider and scrape up to 5 pages, use the following command:
scrapy crawl hacker_news -a max_pages=5
If you don’t set the max_pages
argument, the spider will continue until there are no more "More" links.
Here’s a sample of the extracted post titles:
[hacker_news] DEBUG: Post title: Gleam: A Basic Introduction
[hacker_news] DEBUG: Post title: Bear is a tool that generates a compilation database for Clang tooling
[hacker_news] DEBUG: Post title: A Lisp compiler to RISC-V written in Lisp
[hacker_news] DEBUG: Post title: Grokking at the edge of linear separability
[hacker_news] DEBUG: Post title: Techie took five minutes to fix problem Adobe and Microsoft couldn't solve
...
15. Capturing AJAX Data
Many modern websites use AJAX (Asynchronous JavaScript and XML) to update specific parts of a webpage dynamically without a full reload. Capturing AJAX data allows you to directly retrieve structured data (such as JSON) returned from server requests, making the scraping process more efficient than parsing HTML.
For example, the Nike website uses AJAX to load more products as you scroll. Instead of scraping the entire HTML, you can intercept AJAX requests to retrieve the product data directly.
Here’s an example of capturing the AJAX request:
Here’s a more detailed look at the AJAX request data:
Below is the code to capture AJAX requests:
import json
from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod
class NikeShoeSpider(Spider):
name = "nike_shoe"
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.captured_ajax_data = [] # List to store captured AJAX data
def start_requests(self):
# Initial request to load the target page
yield Request(
url="https://www.nike.com/w/mens-dunk-shoes-90aohznik1zy7ok",
meta={
"playwright": True,
"playwright_include_page": True, # Include Playwright page object
"playwright_page_methods": [
PageMethod(
"wait_for_load_state", "networkidle"
), # Wait for AJAX requests to finish
],
},
callback=self.parse,
)
async def parse(self, response):
# Retrieve Playwright page object from the response metadata
page = response.meta["playwright_page"]
# Function to intercept and capture AJAX requests
async def intercept_request(route):
request = route.request
# Intercept AJAX requests to Nike's API
if "api.nike.com" in request.url and request.method == "GET":
# Fetch the intercepted request's response
response = await route.fetch()
body = await response.text() # Get response body as text
try:
# Parse response body as JSON
ajax_data = json.loads(body)
self.captured_ajax_data.append(ajax_data) # Store the JSON data
self.logger.info(f"Captured data from: {request.url}")
except json.JSONDecodeError:
self.logger.warning(f"Failed to parse JSON from: {request.url}")
# Continue processing the request
await route.continue_()
# Intercept network requests to capture AJAX data
await page.route("**/*", intercept_request)
# Function to scroll the page to load more data via AJAX
async def scroll_page():
previous_height = await page.evaluate("document.body.scrollHeight")
# Scroll to the bottom of the page
while True:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000)
new_height = await page.evaluate("document.body.scrollHeight")
if new_height == previous_height:
break
previous_height = new_height
# Scroll to load more products via AJAX
await scroll_page()
# Wait for network activity to settle after scrolling
await page.wait_for_load_state("networkidle")
# Close the Playwright page
await page.close()
# Save captured AJAX data to JSON
self.save_data_to_json()
def save_data_to_json(self):
# Save captured AJAX data into a local JSON file
file_path = "captured_nike_data.json"
with open(file_path, "w") as f:
json.dump(self.captured_ajax_data, f, indent=4)
self.log(f"Saved AJAX data to {file_path}")
The page.route()
method intercepts all network requests, and the intercept_request()
function filters out Nike API requests. The scroll_page()
function scrolls to the bottom of the page to trigger additional product loads via AJAX.
Once the spider completes, you’ll have a JSON file containing all the captured AJAX data. This file will include all the product details that were dynamically loaded on the Nike page.
Here’s what the captured JSON data looks like:
16. Automating Form Submissions
Automating form submissions with Playwright is a powerful way to simulate user interactions like logging into websites or submitting search queries. In this example, we will show how to use Scrapy-Playwright to automate the login process on Hacker News. The script will:
- Fill in the login form with your username and password.
- Submit the form.
- Take a screenshot to confirm that the login was successful.
Here’s the code:
from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod
class HackerNewsLoginSpider(Spider):
name = "hn_login"
def start_requests(self):
yield Request(
url="https://news.ycombinator.com/login",
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
# Fill in the username and password fields
PageMethod(
"fill", 'input[name="acct"]', "your_username"
), # Replace with your username
PageMethod(
"fill", 'input[name="pw"]', "your_password"
), # Replace with your password
# Click the login button
PageMethod("click", 'input[type="submit"]'),
# Take a screenshot after logging in
PageMethod("screenshot", path="hn_login.png", full_page=False),
],
},
)
You might run into an issue where Scrapy blocks the request because of the robots.txt
file:
[scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://news.ycombinator.com/login>
If you don’t know, the robots.txt
file tells web crawlers which pages they are allowed to access. Scrapy respects this file by default, which may prevent you from accessing the login page.
If you need to access the page, you can disable the robots.txt
middleware by adding this setting in your settings.py
:
ROBOTSTXT_OBEY = False
Note: Disabling the robots.txt
middleware can be risky, and it’s essential to check the terms of service for the website.
After running the spider, you should see a screenshot of the logged-in state:
17. Running Parallel Requests to Test Session Persistence
When scraping websites that require login, session persistence is critical for maintaining access to authenticated pages. This ensures the spider remains logged in while making subsequent requests, avoiding the need to log in repeatedly.
Here's how to implement session persistence with Scrapy-Playwright:
import scrapy
from scrapy_playwright.page import PageMethod
class HackerNewsSpider(scrapy.Spider):
name = "parallel_sessions"
def start_requests(self):
# Log in to Hacker News
yield scrapy.Request(
url="https://news.ycombinator.com/login",
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
# Fill in the login form
PageMethod(
"fill", 'input[name="acct"]', "your_username"
), # Replace with your username
PageMethod(
"fill", 'input[name="pw"]', "your_password"
), # Replace with your password
PageMethod(
"click", 'input[value="login"]'
), # Click the login button
PageMethod(
"wait_for_navigation"
), # Wait for the page to navigate after login
],
},
callback=self.after_login,
)
def after_login(self, response):
# URLs to visit after login
urls = [
"https://news.ycombinator.com/?p=2",
"https://news.ycombinator.com/?p=3",
"https://news.ycombinator.com/?p=4",
"https://news.ycombinator.com/?p=5",
"https://news.ycombinator.com/?p=6",
]
# Send parallel requests to other pages while maintaining the logged-in session
for url in urls:
yield scrapy.Request(
url=url,
meta={
"playwright": True,
"playwright_include_page": True, # Use the logged-in session
},
callback=self.parse_headlines,
)
async def parse_headlines(self, response):
# Access the Playwright page object
page = response.meta["playwright_page"]
# Extract headlines from the page
headlines = await page.query_selector_all(".titleline > a")
for headline in headlines:
text = await headline.inner_text()
yield {"headline": text}
# Close the Playwright page after processing
await page.close()
By setting playwright_include_page=True
in both the login request and subsequent requests, the same Playwright page (with the same session) is reused.
18. Aborting Unwanted Requests
When scraping web pages, unnecessary requests like images, ads, and external scripts can slow down the process and consume bandwidth. By intercepting and aborting these requests, you can make your scraper faster and more efficient. Scrapy-Playwright allows you to filter out such requests, helping you to focus only on the actual data.
Here’s how to implement this:
import scrapy
from scrapy_playwright.page import PageMethod
def abort_request(request):
return (
request.resource_type in ["image", "media", "stylesheet"] # Block resource-heavy types
or any(ext in request.url for ext in [".jpg", ".png", ".gif", ".css", ".mp4", ".webm"]) # Block specific file extensions
)
class AbortRequestSpider(scrapy.Spider):
name = "abort_request"
custom_settings = {
"PLAYWRIGHT_ABORT_REQUEST": abort_request, # Aborting unnecessary requests
}
def start_requests(self):
yield scrapy.Request(
url="https://www.nike.com/w/mens-lifestyle-shoes-13jrmznik1zy7ok",
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_load_state", "domcontentloaded"),
],
},
callback=self.parse,
)
async def parse(self, response):
page = response.meta["playwright_page"]
# Scroll down until no more new items are loaded
previous_height = -1
while True:
current_height = await page.evaluate("() => document.body.scrollHeight")
if current_height == previous_height:
break
previous_height = current_height
await page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
await page.wait_for_timeout(3000)
# Extract product names from the page
content = await page.content()
response = scrapy.Selector(text=content)
product_names = response.css('div.product-card__title::text').getall()
# Log the extracted product names
for product in product_names:
self.log(f"Product: {product}")
# Close the Playwright page to release resources
await page.close()
In this code, we implement a function that blocks requests for resource-heavy elements such as images, media files, and stylesheets. The PLAYWRIGHT_ABORT_REQUEST
setting applies the abort_request
logic to intercept and block unwanted requests during scraping.
When running the spider, here’s what the process looks like:
The extracted product names:
[abort_request] DEBUG: Product: Nike Dunk Low By You
[abort_request] DEBUG: Product: Nike LeBron 9 Low
[abort_request] DEBUG: Product: Air Jordan 6 Retro Low x PSG
...
19. Managing Playwright Pages and Contexts
When scraping multiple pages, efficient management of browser pages and contexts is important to avoid overloading your system.
The PLAYWRIGHT_MAX_PAGES_PER_CONTEXT
setting in Scrapy-Playwright helps you limit the number of open pages within a single browser context. Here’s how you can set this in your settings.py
file:
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT = 5 # Allow up to 5 pages per context
In addition to managing Playwright pages, you can also configure how many requests Scrapy processes simultaneously. CONCURRENT_REQUESTS
defines how many requests Scrapy handles at once.
# settings.py
CONCURRENT_REQUESTS = 8 # Set to 8 concurrent requests at a time
20. Restarting Disconnected Browsers
During long or resource-intensive scraping tasks, browsers may crash or become disconnected. To handle such situations, Scrapy-Playwright provides a setting that allows the browser to restart automatically, ensuring your scraping job continues without manual intervention.
To enable this feature, you can add the following setting to your Scrapy settings.py
file:
PLAYWRIGHT_RESTART_DISCONNECTED_BROWSER = True
Using Proxies With Scrapy Playwright
When scraping data from the web, it’s common to encounter anti-scraping measures like rate limiting, IP bans, and CAPTCHAs. One of the most effective ways to avoid these blocks is by using proxies.
Here’s how you can set up and use proxies with Scrapy Playwright.
To configure a global proxy for all requests in Scrapy Playwright, add the following settings to your settings.py
file:
PLAYWRIGHT_LAUNCH_OPTIONS = {
"proxy": {
"server": "http://proxy_ip:proxy_port", # Replace with your proxy server
"username": "your_username", # Proxy username
"password": "your_password", # Proxy password
}
}
If your proxy doesn’t require authentication (when using free proxies), you can simplify the configuration by omitting the username and password:
PLAYWRIGHT_LAUNCH_OPTIONS = {
"proxy": {
"server": "http://proxy_ip:proxy_port",
}
}
If you need to use different proxies for different requests, you can specify a proxy for each request by passing proxy details through playwright_context_kwargs
in the meta
field:
yield scrapy.Request(
url="https://example.com",
meta={
"playwright": True,
"playwright_context": "new", # Create a new context for this request
"playwright_context_kwargs": { # Custom proxy configuration for this request
"proxy": {
"server": "http://proxy_ip:proxy_port",
"username": "proxy_username",
"password": "proxy_password",
},
},
}
)
If you find managing proxies manually to be complicated, or if you want a hassle-free solution to bypass anti-scraping measures, you should consider using ScrapingBee. ScrapingBee is a web scraping API that handles IP rotation, user-agent management, and anti-bot bypassing for you. It integrates easily with Scrapy and allows you to scrape JavaScript-heavy and dynamic websites without worrying about getting blocked.
Conclusion
That's all folks, this tutorial showed you how to leverage Playwright with Scrapy to master scraping at scale with ease.
If while using Playwright with Scrapy you’re running into anti-scraping technologies that block you such as Cloudflare or Datadome, then we can help you bypass those hurdles while dealing with all of the infrastructure for you needed to scrape at scale.
Want to learn more about becoming a web scraping master? Then check out our handpicked resources to get you on track to becoming a data mining legend.