Puppeteer is a robust headless browser library created mainly to automate user interactions. However, it can be easily detected and blocked by anti-scraping measures due to its lack of built-in stealth capabilities. This is where Puppeteer Extra comes in, offering plugins like Stealth to address this limitation.
This tutorial will explore how to utilize Puppeteer Stealth to attempt to evade detection while scraping websites effectively. We also cover solutions and alternatives for by-passing the latest cutting edge anti-bot tech which Puppeteer Stealth sometimes struggles to evade.
Without further ado, let’s get started!
What is puppeteer stealth and how does it work?
Puppeteer Stealth
, also known as puppeteer-extra-plugin-stealth
, is a powerful extension for Puppeteer, built on top of
Puppeteer Extra
. It is designed to address browser fingerprinting and evasion issues, enhancing Puppeteer's ability to avoid detection by anti-scraping mechanisms used by many websites. With
17 evasion modules
, this plugin tackles various fingerprinting discrepancies that can lead to bot detection.
Key Features and Mechanisms
- Browser Fingerprint Modification: Puppeteer Stealth modifies key browser properties, or "fingerprints", that websites use to detect and block automated requests. It masks default headless properties such as
headless: true
andnavigator.webdriver
, making the browser appear more like a regular browser. It also changes other properties and behaviors that might reveal automation. - Built-In Evasion Modules: This plugin comes packed with several evasion modules that address various detection techniques. For instance, it hides the
sourceurl
attribute and alters thenavigator.languages
property to simulate a standard Chrome browser. You can find a complete list of these modules in the Puppeteer Stealth documentation . - Human-Like Behavior: To further avoid triggering anti-scraping scripts, Puppeteer Stealth mimics real user interactions, such as mouse movements and keyboard inputs.
Puppeteer Stealth is effective at bypassing many anti-scraping mechanisms, but it's not foolproof. According to the official documentation , some advanced detection systems may still be able to detect automated activity. However, the goal of the project is to make detection so challenging and resource-intensive that it becomes impractical.
How to scrape with puppeteer stealth enabled
Let's see how to integrate Puppeteer Stealth into a puppeteer
scraping script to avoid getting blocked. Before diving into stealth mode, let's run two tests to see how easily the default Puppeteer setup is detected.
Initial tests without stealth mode
Test 1: Detecting Headless Mode
First, we'll check how a website detects a headless browser. We'll use the bot detection test by Antoine Vastel. If you open this page in a regular browser, you'll see a message confirming that you are not using a headless browser, which is expected.
Now, let’s try visiting the same site using Puppeteer with the following script:
import puppeteer from 'puppeteer';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://arh.antoinevastel.com/bots/areyouheadless');
await page.screenshot({ path: 'screenshot.png' });
await browser.close();
})();
When you run this script, the website detects that you are using a headless browser, and the result looks like this:
This confirms that the test failed, as the page successfully detected the headless browser.
Test 2: Detecting Browser Automation Next, let's examine another bot detection test on Sannysoft . We'll use the following script to test if Puppeteer’s default setup can bypass the detection.
import puppeteer from 'puppeteer';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://bot.sannysoft.com/');
await page.screenshot({ path: 'screenshot.png', fullPage: true });
await browser.close();
})();
The below screenshot shows red bars, which indicate that bot fingerprints were detected, highlighting how the default Puppeteer setup fails the Browser Fingerprinting tests.
By default, Puppeteer has limited ability to bypass bot detection. To work around these detection mechanisms, you would typically need to manually tweak and override its default settings.
Let's do this using the Puppeteer Stealth plugin from Puppeteer Extra.
Step 1: Installing puppeteer stealth
First, you need to install Puppeteer Extra and the Puppeteer Stealth plugin. puppeteer-extra
is required because it allows you to add plugins like puppeteer-extra-plugin-stealth
to extend Puppeteer's capabilities. Install both packages using the following command:
npm install puppeteer-extra puppeteer-extra-plugin-stealth
Step 2: Configuring puppeteer stealth
To configure Puppeteer Stealth, start by importing Puppeteer Extra:
import puppeteer from "puppeteer-extra"
Next, import the StealthPlugin
from puppeteer-extra-plugin-stealth
:
import StealthPlugin from "puppeteer-extra-plugin-stealth";
If you’re using CommonJS, use the following instead:
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
Add the Stealth plugin and apply it using the default settings, which include all evasion techniques:
// Add the Stealth plugin and use default settings (all evasion techniques)
puppeteer.use(StealthPlugin());
Note: If you want to customize the evasion modules, refer to this README file .
Next, launch Puppeteer Stealth with the headless option and open an async
function to write your code:
(async () => {
// Launch Puppeteer Stealth with the headless option
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
})();
Step 3: Launching the browser
Let’s set the viewport size and navigate to the target website.
(async () => {
// ...
// Set the viewport size
await page.setViewport({ width: 1280, height: 720 });
// Navigate to the target website
await page.goto("https://www.amazon.com/Oculus-Quest-Advanced-All-One-2/dp/B09P4F68WT");
})();
Next, capture a screenshot and close the browser instance.
// Take a screenshot of the page
await page.screenshot({ path: "screenshot.png"});
// Close the browser
await browser.close()
Putting it all together, here’s the code so far:
import puppeteer from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
await page.goto("https://www.amazon.com/Oculus-Quest-Advanced-All-One-2/dp/B09P4F68WT");
await page.screenshot({ path: "screenshot.png" });
await browser.close();
})();
Here's what our result looks like:
Great! You've successfully avoided bot detection using Puppeteer Extra Stealth. Now, let's take it a step further and scrape the page.
Step 4: Scraping the page
In this step, we’ll extract key product data such as the title, price, ratings, reviews, product description, and image URL.
Extracting the product title:
The product title is located in a span
element with the ID "productTitle".
To extract it, use the following code:
// Extract the product title
const title = await page.$eval("#productTitle", el => el.textContent.trim());
The $eval
method selects the element with the ID productTitle
and retrieves its text content, trimming any extra whitespace.
Extracting the product price:
The product price can be found in a span
element with the class a-offscreen
, typically located near the Buy Now box.
Use the following code to extract it:
// Extract the product price
const price = await page.$eval("span.a-offscreen", el => el.textContent.trim());
This code selects the element with the class a-offscreen
and retrieves its text content, ensuring it’s properly cleaned.
Extracting product ratings and reviews:
To scrape the product rating and number of reviews:
Here’s the code:
// Extract the product rating
const rating = await page.$eval("#acrPopover", el => {
const title = el.getAttribute("title");
return title ? title.replace("out of 5 stars", "").trim() : null;
});
// Extract the number of reviews
const reviews = await page.$eval("#acrCustomerReviewText", el => el.textContent.trim());
The rating is retrieved from the title
attribute of the element with the ID acrPopover
, while the number of reviews is obtained from the span
element with the ID acrCustomerReviewText
.
Extracting the product image URL:
The URL of the product image is found in an element with the ID landingImage
.
Use the following code to get it:
// Extract the product image URL
const imageUrl = await page.$eval("#landingImage", el => el.getAttribute("src"));
This code retrieves the URL from the src
attribute of the image element.
Extracting the product description:
The product description is contained in an element with the ID "productDescription".
Here’s how to extract it:
// Extract the product description
const description = await page.$eval("#productDescription", el => el.textContent.trim());
Creating a product data object:
Store all the extracted information in a dictionary:
const productData = {
"Title": title,
"Price": price,
"Description": description,
"Image Link": imageUrl,
"Rating": rating,
"Number of Reviews": reviews,
};
Saving data to a JSON file:
Save the scraped data to a JSON file using the fs
module:
import { writeFile } from 'fs/promises';
await writeFile("product_data.json", JSON.stringify(productData, null, 2), "utf-8");
Closing the browser:
Finally, close the browser to free up resources:
await browser.close();
Putting it all together:
Here’s the full code for extracting and saving product data:
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import { writeFile } from 'fs/promises';
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Navigate to the Amazon product page
await page.goto('https://www.amazon.com/Oculus-Quest-Advanced-All-One-2/dp/B09P4F68WT');
// Extract product details
const title = await page.$eval("#productTitle", el => el.textContent.trim());
const price = await page.$eval("span.a-offscreen", el => el.textContent.trim());
const description = await page.$eval("#productDescription", el => el.textContent.trim());
const imageUrl = await page.$eval("#landingImage", el => el.getAttribute("src"));
const rating = await page.$eval("#acrPopover", el => {
const title = el.getAttribute("title");
return title ? title.replace("out of 5 stars", "").trim() : null;
});
const reviews = await page.$eval("#acrCustomerReviewText", el => el.textContent.trim());
// Create a product data object
const productData = {
"Title": title,
"Price": price,
"Description": description,
"Image Link": imageUrl,
"Rating": rating,
"Number of Reviews": reviews,
};
// Save the product data to a JSON file
try {
await writeFile("product_data.json", JSON.stringify(productData, null, 2), "utf-8");
console.log("Product data saved to product_data.json");
} catch (err) {
console.error("Error writing file:", err);
}
// Close the browser
await browser.close();
})();
Once executed, this code will save all the extracted product data to a JSON file.
Awesome! You’ve successfully scraped all the desired data from Amazon and stored it in a JSON file. The data looks clean and readable, ready for further analysis or processing.
Verifying stealth mode with puppeteer
Test 1: Verifying Stealth Mode
In this test, we'll use Puppeteer Stealth to check if our browser automation is detectable.
import puppeteer from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://arh.antoinevastel.com/bots/areyouheadless");
await page.screenshot({ path: "screenshot.png" });
await browser.close();
})();
The result is:
The browser is not detected as headless, indicating that the stealth mode is functioning correctly.
Test 2: Detecting Browser Automation
In the second test, we will use a different website to further ensure that our browser automation is undetectable.
import puppeteer from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://bot.sannysoft.com/");
await page.screenshot({ path: "screenshot.png", fullPage: true });
await browser.close();
})();
The result is:
The stealth mode successfully prevents detection, as shown in the screenshot above.
Both tests passed successfully, demonstrating that Puppeteer’s Stealth Plugin effectively masks browser automation, making it undetectable by most fingerprinting techniques.
Drawbacks of puppeteer stealth
The Puppeteer-Extra-Plugin-Stealth is a tool designed to make headless Puppeteer scripts less detectable by anti-bot mechanisms. However, it has certain drawbacks:
- Challenges with Advanced Anti-Bot Systems: Even with the stealth plugin, sophisticated anti-bot services like Cloudflare and DataDome can still detect and block Puppeteer scripts.
- Performance Overheads: Using the stealth plugin can introduce additional overhead, potentially slowing down the execution of scripts.
- Maintenance Complexity: The plugin needs to be configured correctly, and keeping it up to date with the latest anti-bot detection methods can be time-consuming.
When scraping data from Amazon, making too many requests in a short amount of time can trigger anti-bot mechanisms. This can lead to challenges such as having to complete CAPTCHA verifications. For example, when attempting to run a web scraper multiple times, I encountered Amazon's CAPTCHA as shown in the image below.
To address this issue, you can consider the following approaches:
- Throttling Requests: Reduce the frequency of your requests or introduce delays between them to avoid triggering CAPTCHA.
- Rotating Proxies: Utilize rotating proxies to distribute requests across different IP addresses.
- CAPTCHA Solving Services: Integrate a CAPTCHA-solving service to automatically handle CAPTCHA challenges when they arise.
While the strategies mentioned can help reduce detection, advanced anti-bot solutions such as Cloudflare and DataDome are continually evolving. These systems may still block your requests despite the use of Puppeteer Stealth or other evasion techniques.
Given these challenges, it’s important to explore alternative solutions to Puppeteer Stealth for more robust scraping operations. Let’s examine some of these alternatives next.
Alternatives to puppeteer stealth
When dealing with advanced anti-bot systems, relying solely on Puppeteer Stealth may not be enough. Luckily, several alternatives offer enhanced capabilities for bypassing detection mechanisms . Here are some of the most effective options:
1. Selenium Stealth
Anti-bot systems can easily detect Selenium's built-in automation features and command-line flags. This is where the Selenium Stealth plugin becomes useful. Selenium Stealth helps you bypass many common anti-bot detection mechanisms by masking various detection leaks.
2. Undetected ChromeDriver
The Selenium Undetected ChromeDriver is an enhanced version of the standard ChromeDriver which is designed to bypass anti-bot services such as Cloudflare, Distill Network, Imperva, or DataDome. Undetectable ChromeDriver is capable of patching most of the ways through which anti-bot systems can detect your Selenium bot or scraper. It enables your bot to appear more human-like as compared to a regular one.
3. Nodriver
Nodriver is a Python library derived from Undetected ChromeDriver, designed specifically to bypass CAPTCHAs and Web Application Firewalls (WAFs) like Cloudflare. Nodriver is an asynchronous tool that replaces traditional components such as Selenium or webdriver binaries, providing direct communication with browsers. This approach not only reduces the detection rate by most anti-bot solutions but also significantly improves the tool's performance.
4. Cloudscraper
Cloudscraper is a Python library implemented with the Requests library, designed to bypass Cloudflare's anti-bot challenges. It is specifically created to scrape data from websites protected by Cloudflare. Cloudflare uses various browser fingerprinting challenges and checks to distinguish between genuine users and scrapers/bots. Cloudscraper circumvents these challenges by mimicking the behavior of a real web browser.
One major challenge with open-source packages, such as the ones mentioned, is that anti-bot companies can detect their methods for bypassing anti-bot protection systems and easily fix the vulnerabilities they exploit. This leads to an ongoing arms race where these packages come up with new workarounds, and anti-bot companies patch these workarounds as well. Therefore, it's important to find a long-term effective solution.
There's an ultimate solution: with ScrapingBee , you can bypass any anti-bot system, regardless of its complexity, and frequent updates.
Scraping undetected with ScrapingBee
ScrapingBee is an all-in-one web scraping solution that handles all anti-bot bypasses for you, allowing you to focus on getting the data you want.
To start, sign up for a free ScrapingBee trial no credit card is needed, and you'll receive 1000 credits to begin. Each request costs approximately 25 credits.
Next, install ScrapingBee Node SDK with npm.
npm install scrapingbee
Next, import the ScrapingBee client and create a new instance using your API key.
import scrapingbee from "scrapingbee";
const client = new scrapingbee.ScrapingBeeClient("YOUR_SCRAPINGBEE_API_KEY");
Here’s the quick start code:
import scrapingbee from "scrapingbee";
(async () => {
const client = new scrapingbee.ScrapingBeeClient("YOUR_SCRAPINGBEE_API_KEY");
try {
const response = await client.get({
url: "https://www.g2.com/products/anaconda/reviews",
params: {
"stealth_proxy": true,
"country_code": "gb",
"block_resources": true,
"device": "desktop",
"wait": "1500",
},
});
console.log("Status Code:", response.status);
const htmlSnippet = response.data.toString("utf-8");
console.log("HTML Snippet:", htmlSnippet);
} catch (error) {
console.error("A problem occurred: ", error.response?.data || error.message);
}
})();
Here’s our result:
Yes, it’s that simple! The status code 200 indicates that the G2 anti-bot has been bypassed.
Wrapping Up
You have learned about the challenges Puppeteer faces with bot detection and how to address them. By using Puppeteer Extra, you can expand Puppeteer’s capabilities with various plugins. In this discussion, you have learned about the stealth plugin, which effectively bypasses bot detection mechanisms.
However, even with a sophisticated Puppeteer Extra script, advanced anti-bot technologies like Cloudflare can still detect and block your scripts. In such cases, consider simple solutions like ScrapingBee 🚀