Puppeteer Stealth Tutorial; How to Set Up & Use (+ Working Alternatives)

29 August 2024 | 13 min read

Puppeteer is a robust headless browser library created mainly to automate user interactions. However, it can be easily detected and blocked by anti-scraping measures due to its lack of built-in stealth capabilities. This is where Puppeteer Extra comes in, offering plugins like Stealth to address this limitation.

This tutorial will explore how to utilize Puppeteer Stealth to attempt to evade detection while scraping websites effectively. We also cover solutions and alternatives for by-passing the latest cutting edge anti-bot tech which Puppeteer Stealth sometimes struggles to evade.

Without further ado, let’s get started!

What is puppeteer stealth and how does it work?

Puppeteer Stealth , also known as puppeteer-extra-plugin-stealth, is a powerful extension for Puppeteer, built on top of Puppeteer Extra . It is designed to address browser fingerprinting and evasion issues, enhancing Puppeteer's ability to avoid detection by anti-scraping mechanisms used by many websites. With 17 evasion modules , this plugin tackles various fingerprinting discrepancies that can lead to bot detection.

Key Features and Mechanisms

  • Browser Fingerprint Modification: Puppeteer Stealth modifies key browser properties, or "fingerprints", that websites use to detect and block automated requests. It masks default headless properties such as headless: true and navigator.webdriver, making the browser appear more like a regular browser. It also changes other properties and behaviors that might reveal automation.
  • Built-In Evasion Modules: This plugin comes packed with several evasion modules that address various detection techniques. For instance, it hides the sourceurl attribute and alters the navigator.languages property to simulate a standard Chrome browser. You can find a complete list of these modules in the Puppeteer Stealth documentation .
  • Human-Like Behavior: To further avoid triggering anti-scraping scripts, Puppeteer Stealth mimics real user interactions, such as mouse movements and keyboard inputs.

Puppeteer Stealth is effective at bypassing many anti-scraping mechanisms, but it's not foolproof. According to the official documentation , some advanced detection systems may still be able to detect automated activity. However, the goal of the project is to make detection so challenging and resource-intensive that it becomes impractical.

How to scrape with puppeteer stealth enabled

Let's see how to integrate Puppeteer Stealth into a puppeteer scraping script to avoid getting blocked. Before diving into stealth mode, let's run two tests to see how easily the default Puppeteer setup is detected.

Initial tests without stealth mode

Test 1: Detecting Headless Mode

First, we'll check how a website detects a headless browser. We'll use the bot detection test by Antoine Vastel. If you open this page in a regular browser, you'll see a message confirming that you are not using a headless browser, which is expected.

not chrome headless

Now, let’s try visiting the same site using Puppeteer with the following script:

import puppeteer from 'puppeteer';

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    
    await page.goto('https://arh.antoinevastel.com/bots/areyouheadless');
    await page.screenshot({ path: 'screenshot.png' });
    
    await browser.close();
})();

When you run this script, the website detects that you are using a headless browser, and the result looks like this:

chrome headless

This confirms that the test failed, as the page successfully detected the headless browser.

Test 2: Detecting Browser Automation Next, let's examine another bot detection test on Sannysoft . We'll use the following script to test if Puppeteer’s default setup can bypass the detection.

import puppeteer from 'puppeteer';

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    
    await page.goto('https://bot.sannysoft.com/');
    await page.screenshot({ path: 'screenshot.png', fullPage: true });
    
    await browser.close();
})();

The below screenshot shows red bars, which indicate that bot fingerprints were detected, highlighting how the default Puppeteer setup fails the Browser Fingerprinting tests.

fingerprints detected

By default, Puppeteer has limited ability to bypass bot detection. To work around these detection mechanisms, you would typically need to manually tweak and override its default settings.

Let's do this using the Puppeteer Stealth plugin from Puppeteer Extra.

Step 1: Installing puppeteer stealth

First, you need to install Puppeteer Extra and the Puppeteer Stealth plugin. puppeteer-extra is required because it allows you to add plugins like puppeteer-extra-plugin-stealth to extend Puppeteer's capabilities. Install both packages using the following command:

npm install puppeteer-extra puppeteer-extra-plugin-stealth

Step 2: Configuring puppeteer stealth

To configure Puppeteer Stealth, start by importing Puppeteer Extra:

import puppeteer from "puppeteer-extra"

Next, import the StealthPlugin from puppeteer-extra-plugin-stealth:

import StealthPlugin from "puppeteer-extra-plugin-stealth";

If you’re using CommonJS, use the following instead:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

Add the Stealth plugin and apply it using the default settings, which include all evasion techniques:

// Add the Stealth plugin and use default settings (all evasion techniques)
puppeteer.use(StealthPlugin());

Note: If you want to customize the evasion modules, refer to this README file .

Next, launch Puppeteer Stealth with the headless option and open an async function to write your code:

(async () => {
    // Launch Puppeteer Stealth with the headless option
    const browser = await puppeteer.launch({ headless: "new" });
    const page = await browser.newPage();
})();

Step 3: Launching the browser

Let’s set the viewport size and navigate to the target website.

(async () => {
    // ...
    
    // Set the viewport size
    await page.setViewport({ width: 1280, height: 720 });

    // Navigate to the target website
    await page.goto("https://www.amazon.com/Oculus-Quest-Advanced-All-One-2/dp/B09P4F68WT");
})();

Next, capture a screenshot and close the browser instance.

// Take a screenshot of the page
await page.screenshot({ path: "screenshot.png"});

// Close the browser
await browser.close()

Putting it all together, here’s the code so far:

import puppeteer from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";

puppeteer.use(StealthPlugin());

(async () => {
    const browser = await puppeteer.launch({ headless: "new" });
    const page = await browser.newPage();
    await page.setViewport({ width: 1280, height: 720 });
    await page.goto("https://www.amazon.com/Oculus-Quest-Advanced-All-One-2/dp/B09P4F68WT");
    await page.screenshot({ path: "screenshot.png" });
    await browser.close();
})();

Here's what our result looks like:

product home page

Great! You've successfully avoided bot detection using Puppeteer Extra Stealth. Now, let's take it a step further and scrape the page.

Step 4: Scraping the page

In this step, we’ll extract key product data such as the title, price, ratings, reviews, product description, and image URL.

Extracting the product title:

The product title is located in a span element with the ID "productTitle".

product title

To extract it, use the following code:

// Extract the product title
const title = await page.$eval("#productTitle", el => el.textContent.trim());

The $eval method selects the element with the ID productTitle and retrieves its text content, trimming any extra whitespace.

Extracting the product price:

The product price can be found in a span element with the class a-offscreen, typically located near the Buy Now box.

product price

Use the following code to extract it:

// Extract the product price
const price = await page.$eval("span.a-offscreen", el => el.textContent.trim());

This code selects the element with the class a-offscreen and retrieves its text content, ensuring it’s properly cleaned.

Extracting product ratings and reviews:

To scrape the product rating and number of reviews:

product ratings and reviews

Here’s the code:

// Extract the product rating
const rating = await page.$eval("#acrPopover", el => {
    const title = el.getAttribute("title");
    return title ? title.replace("out of 5 stars", "").trim() : null;
});

// Extract the number of reviews
const reviews = await page.$eval("#acrCustomerReviewText", el => el.textContent.trim());

The rating is retrieved from the title attribute of the element with the ID acrPopover, while the number of reviews is obtained from the span element with the ID acrCustomerReviewText.

Extracting the product image URL:

The URL of the product image is found in an element with the ID landingImage.

product image url

Use the following code to get it:

// Extract the product image URL
const imageUrl = await page.$eval("#landingImage", el => el.getAttribute("src"));

This code retrieves the URL from the src attribute of the image element.

Extracting the product description:

The product description is contained in an element with the ID "productDescription".

product description

Here’s how to extract it:

// Extract the product description
const description = await page.$eval("#productDescription", el => el.textContent.trim());

Creating a product data object:

Store all the extracted information in a dictionary:

const productData = {
    "Title": title,
    "Price": price,
    "Description": description,
    "Image Link": imageUrl,
    "Rating": rating,
    "Number of Reviews": reviews,
};

Saving data to a JSON file: Save the scraped data to a JSON file using the fs module:

import { writeFile } from 'fs/promises';

await writeFile("product_data.json", JSON.stringify(productData, null, 2), "utf-8");

Closing the browser:

Finally, close the browser to free up resources:

await browser.close();

Putting it all together:

Here’s the full code for extracting and saving product data:

import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import { writeFile } from 'fs/promises';

puppeteer.use(StealthPlugin());

(async () => {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();

    // Navigate to the Amazon product page
    await page.goto('https://www.amazon.com/Oculus-Quest-Advanced-All-One-2/dp/B09P4F68WT');

    // Extract product details
    const title = await page.$eval("#productTitle", el => el.textContent.trim());
    const price = await page.$eval("span.a-offscreen", el => el.textContent.trim());
    const description = await page.$eval("#productDescription", el => el.textContent.trim());
    const imageUrl = await page.$eval("#landingImage", el => el.getAttribute("src"));
    const rating = await page.$eval("#acrPopover", el => {
        const title = el.getAttribute("title");
        return title ? title.replace("out of 5 stars", "").trim() : null;
    });
    const reviews = await page.$eval("#acrCustomerReviewText", el => el.textContent.trim());

    // Create a product data object
    const productData = {
        "Title": title,
        "Price": price,
        "Description": description,
        "Image Link": imageUrl,
        "Rating": rating,
        "Number of Reviews": reviews,
    };

    // Save the product data to a JSON file
    try {
        await writeFile("product_data.json", JSON.stringify(productData, null, 2), "utf-8");
        console.log("Product data saved to product_data.json");
    } catch (err) {
        console.error("Error writing file:", err);
    }

    // Close the browser
    await browser.close();
})();

Once executed, this code will save all the extracted product data to a JSON file.

json output

Awesome! You’ve successfully scraped all the desired data from Amazon and stored it in a JSON file. The data looks clean and readable, ready for further analysis or processing.

Verifying stealth mode with puppeteer

Test 1: Verifying Stealth Mode

In this test, we'll use Puppeteer Stealth to check if our browser automation is detectable.

import puppeteer from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";

puppeteer.use(StealthPlugin());

(async () => {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();
    await page.goto("https://arh.antoinevastel.com/bots/areyouheadless");
    await page.screenshot({ path: "screenshot.png" });

    await browser.close();
})();

The result is:

not chrome headless

The browser is not detected as headless, indicating that the stealth mode is functioning correctly.

Test 2: Detecting Browser Automation

In the second test, we will use a different website to further ensure that our browser automation is undetectable.

import puppeteer from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";

puppeteer.use(StealthPlugin());

(async () => {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();

    await page.goto("https://bot.sannysoft.com/");
    await page.screenshot({ path: "screenshot.png", fullPage: true });

    await browser.close();
})();

The result is:

fingerprints not detected

The stealth mode successfully prevents detection, as shown in the screenshot above.

Both tests passed successfully, demonstrating that Puppeteer’s Stealth Plugin effectively masks browser automation, making it undetectable by most fingerprinting techniques.

Drawbacks of puppeteer stealth

The Puppeteer-Extra-Plugin-Stealth is a tool designed to make headless Puppeteer scripts less detectable by anti-bot mechanisms. However, it has certain drawbacks:

  1. Challenges with Advanced Anti-Bot Systems: Even with the stealth plugin, sophisticated anti-bot services like Cloudflare and DataDome can still detect and block Puppeteer scripts.
  2. Performance Overheads: Using the stealth plugin can introduce additional overhead, potentially slowing down the execution of scripts.
  3. Maintenance Complexity: The plugin needs to be configured correctly, and keeping it up to date with the latest anti-bot detection methods can be time-consuming.

When scraping data from Amazon, making too many requests in a short amount of time can trigger anti-bot mechanisms. This can lead to challenges such as having to complete CAPTCHA verifications. For example, when attempting to run a web scraper multiple times, I encountered Amazon's CAPTCHA as shown in the image below.

amazon captcha challenge

To address this issue, you can consider the following approaches:

  • Throttling Requests: Reduce the frequency of your requests or introduce delays between them to avoid triggering CAPTCHA.
  • Rotating Proxies: Utilize rotating proxies to distribute requests across different IP addresses.
  • CAPTCHA Solving Services: Integrate a CAPTCHA-solving service to automatically handle CAPTCHA challenges when they arise.

While the strategies mentioned can help reduce detection, advanced anti-bot solutions such as Cloudflare and DataDome are continually evolving. These systems may still block your requests despite the use of Puppeteer Stealth or other evasion techniques.

Given these challenges, it’s important to explore alternative solutions to Puppeteer Stealth for more robust scraping operations. Let’s examine some of these alternatives next.

Alternatives to puppeteer stealth

When dealing with advanced anti-bot systems, relying solely on Puppeteer Stealth may not be enough. Luckily, several alternatives offer enhanced capabilities for bypassing detection mechanisms . Here are some of the most effective options:

1. Selenium Stealth

Anti-bot systems can easily detect Selenium's built-in automation features and command-line flags. This is where the Selenium Stealth plugin becomes useful. Selenium Stealth helps you bypass many common anti-bot detection mechanisms by masking various detection leaks.

2. Undetected ChromeDriver

The Selenium  Undetected ChromeDriver  is an enhanced version of the standard ChromeDriver which is designed to bypass anti-bot services such as Cloudflare, Distill Network, Imperva, or DataDome. Undetectable ChromeDriver is capable of patching most of the ways through which anti-bot systems can detect your Selenium bot or scraper. It enables your bot to appear more human-like as compared to a regular one.

3. Nodriver

Nodriver is a Python library derived from Undetected ChromeDriver, designed specifically to bypass CAPTCHAs and Web Application Firewalls (WAFs) like Cloudflare. Nodriver is an asynchronous tool that replaces traditional components such as Selenium or webdriver binaries, providing direct communication with browsers. This approach not only reduces the detection rate by most anti-bot solutions but also significantly improves the tool's performance.

4. Cloudscraper

Cloudscraper  is a Python library implemented with the  Requests  library, designed to bypass Cloudflare's anti-bot challenges. It is specifically created to scrape data from websites protected by Cloudflare. Cloudflare uses various browser fingerprinting challenges and checks to distinguish between genuine users and scrapers/bots. Cloudscraper circumvents these challenges by mimicking the behavior of a real web browser.

One major challenge with open-source packages, such as the ones mentioned, is that anti-bot companies can detect their methods for bypassing anti-bot protection systems and easily fix the vulnerabilities they exploit. This leads to an ongoing arms race where these packages come up with new workarounds, and anti-bot companies patch these workarounds as well. Therefore, it's important to find a long-term effective solution.

There's an ultimate solution: with ScrapingBee , you can bypass any anti-bot system, regardless of its complexity, and frequent updates.

Scraping undetected with ScrapingBee

ScrapingBee is an all-in-one web scraping solution that handles all anti-bot bypasses for you, allowing you to focus on getting the data you want.

To start, sign up for a free ScrapingBee trial no credit card is needed, and you'll receive 1000 credits to begin. Each request costs approximately 25 credits.

Next, install ScrapingBee Node SDK with npm.

npm install scrapingbee

Next, import the ScrapingBee client and create a new instance using your API key.

import scrapingbee from "scrapingbee";
const client = new scrapingbee.ScrapingBeeClient("YOUR_SCRAPINGBEE_API_KEY");

Here’s the quick start code:

import scrapingbee from "scrapingbee";

(async () => {
    const client = new scrapingbee.ScrapingBeeClient("YOUR_SCRAPINGBEE_API_KEY");

    try {
        const response = await client.get({
            url: "https://www.g2.com/products/anaconda/reviews",
            params: {
                "stealth_proxy": true,
                "country_code": "gb",
                "block_resources": true,
                "device": "desktop",
                "wait": "1500",
            },
        });

        console.log("Status Code:", response.status);

        const htmlSnippet = response.data.toString("utf-8");
        console.log("HTML Snippet:", htmlSnippet);

    } catch (error) {
        console.error("A problem occurred: ", error.response?.data || error.message);
    }
})();

Here’s our result:

html of G2

Yes, it’s that simple! The status code 200 indicates that the G2 anti-bot has been bypassed.

Wrapping Up

You have learned about the challenges Puppeteer faces with bot detection and how to address them. By using Puppeteer Extra, you can expand Puppeteer’s capabilities with various plugins. In this discussion, you have learned about the stealth plugin, which effectively bypasses bot detection mechanisms.

However, even with a sophisticated Puppeteer Extra script, advanced anti-bot technologies like Cloudflare can still detect and block your scripts. In such cases, consider simple solutions like ScrapingBee 🚀

image description
Satyam Tripathi

Satyam is a senior technical writer who is passionate about web scraping, automation, and data engineering. He has delivered over 130 blog posts since 2021.