How to scrape websites with cloudscraper (python example)

Satyam Tripathi | 07 March 2024 | 11 min read

Table of contents

Over 7.59 million active websites use Cloudflare. There's a high chance that the website you intend to scrape might be protected by it. Websites protected by services like Cloudflare can be challenging to scrape due to the various anti-bot measures they implement. If you've tried scraping such websites, you're likely already aware of the difficulty in bypassing Cloudflare's bot detection system.

In this article, you’ll learn how to use Cloudscraper, an open-source Python library, to scrape Cloudflare-protected websites. You’ll learn about some of the advanced features of Cloudscraper, such as CAPTCHA bypass and user-agent manipulation. Furthermore, we’ll discuss the limitations of Cloudscraper and suggest the most effective alternative method.

Without further ado, let’s get started!

TL;DR Cloudscraper Python example code

If you're in a hurry, here's the code we'll write in this tutorial. We'll scrape some useful data from the Cloudflare-protected website (namecheap.com). We'll scrape the domain names and their prices.

If your target website is protected by Captcha, remember to swap out "YOUR_2CAPTCHA_API_KEY" with your actual 2Captcha API key. First, sign up on 2Captcha to get an API key (which is a paid service). If your target website isn't Captcha protected, just comment out the "captcha" variable.

import cloudscraper
from bs4 import BeautifulSoup
import json

def fetch_html_content(url):
    """
    Fetches HTML content from the specified URL using CloudScraper.

    Args:
        url (str): The URL to fetch HTML content from.

    Returns:
        str: The HTML content of the webpage, or None if an error occurs.
    """
    try:
        # Create a CloudScraper instance to bypass bot protection
        scraper = cloudscraper.create_scraper(
            interpreter="nodejs",
            delay=10,
            browser={
                "browser": "chrome",
                "platform": "ios",
                "desktop": False,
            },
            captcha={
                "provider": "2captcha",
                "api_key": "YOUR_2CAPTCHA_API",
            },
        )

        # Send a GET request to the URL
        response = scraper.get(url)

        # Check if the request was successful
        if response.status_code == 200:
            # Return the HTML content of the webpage
            return response.text
        else:
            # Print an error message if the request fails
            print(f"Failed to fetch URL: {
                  url}. Status code: {response.status_code}")
            return None
    except Exception as e:
        # Print an error message if an exception occurs during the request
        print(f"An error occurred while fetching URL: {url}. Error: {str(e)}")
        return None

def extract_domain_prices(html_content):
    """
    Extracts domain names and their prices from HTML content.

    Args:
        html_content (str): The HTML content of the webpage.

    Returns:
        dict: A dictionary containing domain names as keys and their prices as values.
    """
    try:
        # Create a BeautifulSoup object to parse HTML content
        soup = BeautifulSoup(html_content, "html.parser")

        # Initialize an empty dictionary to store domain names and prices
        domain_prices = {}

        # Find all product card elements on the webpage
        product_cards = soup.find_all(class_="nchp-product-card")

        # Iterate over each product card
        for card in product_cards:
            # Find the element containing the domain name
            domain_name_elem = card.find(class_="nchp-product-card__name")

            # Extract the text of the domain name element and remove leading/trailing whitespace
            domain_name = domain_name_elem.text.strip() if domain_name_elem else None

            # Find the element containing the price information
            price_elem = card.find(class_="nchp-product-card__price-text")

            # Extract the text of the price element and remove leading/trailing whitespace
            price_text = price_elem.text.strip() if price_elem else None

            # Check if the domain name starts with a dot
            if domain_name and domain_name.startswith("."):
                # Add the domain name and its corresponding price to the dictionary
                domain_prices[domain_name] = price_text

        # Return the dictionary containing domain names and prices
        return domain_prices
    except Exception as e:
        # Print an error message if an exception occurs during the extraction process
        print(
            f"An error occurred while extracting domain prices. Error: {str(e)}")
        # Return an empty dictionary to indicate that no domain prices were extracted
        return {}

if __name__ == "__main__":
    # Define the URL of the webpage to scrape
    url = "https://www.namecheap.com/"

    # Fetch HTML content from the specified URL
    html_content = fetch_html_content(url)

    # Check if HTML content was successfully retrieved
    if html_content:
        # Extract domain prices from the HTML content
        domain_prices = extract_domain_prices(html_content)

        # Check if domain prices were successfully extracted
        if domain_prices:
            # Save domain prices to a JSON file
            with open("domain_prices.json", "w") as json_file:
                json.dump(domain_prices, json_file, indent=4)
            print("Domain prices have been saved to domain_prices.json.")
        else:
            # Print a message if no domain prices were extracted
            print("No domain prices were extracted.")
    else:
        # Print a message if HTML content retrieval failed
        print("Failed to retrieve HTML content. Please check your internet connection or try again later.")

The result is:

cloudscraper scraped data namecheap

What Is Cloudscraper and How Does It Work

Cloudscraper is a Python library implemented with the Requests library, designed to bypass Cloudflare's anti-bot challenges. It is specifically created to scrape data from websites protected by Cloudflare.

When you frequently request websites protected by Cloudflare, the service may suspect you're a bot and present a challenge to verify you're a real user. This process can take up to 5 seconds. If you fail the challenge, you'll be blocked from accessing the website.

cloudflare bot protection notification

Cloudflare uses various browser fingerprinting challenges and checks to distinguish between genuine users and scrapers/bots. Cloudscraper circumvents these challenges by mimicking the behavior of a real web browser.

This tool leverages a JavaScript engine/interpreter to overcome JavaScript challenges without requiring explicit deobfuscation and parsing of Cloudflare's JavaScript code.

Note: Cloudscraper is an actively maintained open-source project, which is a rarity for open-source Cloudflare tech. It has more than 3.8k stars on GitHub.

How to Scrape with Cloudscraper

Let's take a step-by-step look at scraping with Python cloudscraper. We'll extract the domain names and prices from the namecheap.com.

While extracting data, you may come across unrelated cards such as those offering information about discounted ‘.COM’ and ‘.INC’ domains. We’ll simply ignore these cards.

scraping namecheap

Step 1. Prerequisites

Before you start, make sure you meet all the following requirements:

Python: Ensure you’ve Python installed by downloading the latest version from the official website. For this blog, we’re using Python 3.12.2.
Code Editor: Choose a code editor, such as Visual Studio Code, PyCharm, or Jupyter Notebook, according to your preference.
Libraries: Create a virtual environment using virtualenv. Once within the virtual environment, run the following commands in your terminal to install the required libraries.

pip install cloudscraper==1.2.71 beautifulsoup4==4.12.3

Step 2. Define the HTML Fetching Function

Define a function that scrapes a Cloudflare-protected website and fetches the HTML content from the provided URL using the cloudscraper library.

def fetch_html_content(url):

Step 3. Create a Cloudscraper Instance

To bypass bot protection, create a Cloudscraper instance with advanced parameters including browser user agent, captcha handling, and JavaScript interpreter.

JavaScript Interpreters and Engines: It specifies the JavaScript engine used to execute the anti-bot challenges. In our case, "nodejs" is used.
Browser/User-Agent Filtering: It controls how and which "User-Agent" is randomly selected. We’re using mobile Chrome User-Agents on iPhone in this example. Others too are available, such as mobile Chrome User-Agents on Android, desktop Chrome User-Agents on Windows, or any custom user agent.
3rd-Party Captcha Solvers: Cloudscraper currently supports several third-party captcha solvers: 2captcha, AntiCaptcha, CapSolver, CapMonster Cloud, DeathByCaptcha, and 9kw. This example will use 2captcha as the captcha provider. Make sure you’ve obtained the required API key. Sign up on 2Captcha to get an API key.
Delay: The Cloudflare IUAM challenge requires the browser to wait for approximately five seconds before submitting the challenge answer. If you want to override this delay, you can do so using the delay parameter.

Here’s the sample code snippet:

scraper = cloudscraper.create_scraper(
    interpreter="nodejs",
    delay=10,
    browser={
        "browser": "chrome",
        "platform": "ios",
        "desktop": False,
    },
    captcha={
        "provider": "2captcha",
        "api_key": "YOUR_2CAPTCHA_API",
    },
)

Step 4. Make Request to the Website

To use cloudscraper, you can follow the same steps as you would with Requests. Cloudscraper works just like a Requests Session object. Instead of using requests.get() or requests.post(), you can use scraper.get() or scraper.post().

Once you’ve made the request, review the response status code. If the status code is 200, it means that the website is allowing you to scrape the data.

Next, check the response status code. If it's 200, the site is allowing you to scrape the data.

response = scraper.get(url)

if response.status_code == 200:
    return response.text
else:
    print(f"Failed to fetch URL: {url}. Status code: {response.status_code}")
    return None

Step 5. Extract Domain Names and Prices

This function extracts domain names and their corresponding prices from the provided HTML content.

def extract_domain_prices(html_content):

Step 6. Parsing HTML Content with BeautifulSoup

Create a BeautifulSoup object to parse HTML content. html_content contains the raw HTML content of the page and html.parser specifies the parser to be used.

soup = BeautifulSoup(html_content, "html.parser")

Step 7. Find Listing Elements

The next step is to locate all the product card elements present on the webpage, as these elements hold the information we need to scrape.

product_cards = soup.find_all(class_="nchp-product-card")

Use the find_all() of the BeautifulSoup method to locate all HTML elements with the class name nchp-product-card.

Step 8. Extract Information from Each Listing

To extract the necessary data, you need to iterate through each listing element. For each piece of data, you can use if/else statements to check if the element exists. If it does not exist, assign the string "None" as the value.

for card in product_cards:

Extract Domain Names

First, identify the element using the nchp-product-card__name class. Then, extract and trim the text from the element.

domain_name_elem = card.find(class_="nchp-product-card__name")
domain_name = domain_name_elem.text.strip() if domain_name_elem else None

Extract Domain Prices

You can then use the class nchp-product-card__price-text to find the element containing the price information. Extract the text from the price element and remove leading and trailing whitespace.


price_elem = card.find(class_="nchp-product-card__price-text")
price_text = price_elem.text.strip() if price_elem else None

Step 9. Final Check Condition

We need to check if the domain name starts with a dot. This is important because while extracting data using classes, other data might also be extracted besides domains. Therefore, we’ll run a separate check to extract only the domains.

In the below image, you can see that unnecessary information (like starter email, private email, or font maker) is also fetched. We need to extract only the domain names and prices from this data.

namecheap scraped data

Apply the following check condition to extract only the domain names and their corresponding prices.

if domain_name and domain_name.startswith("."):
    domain_prices[domain_name] = price_text

Step 10. Save Data to a JSON File

Once you’ve extracted the data from the listings, save it in a JSON file.

with open("domain_prices.json", "w") as json_file:
    json.dump(domain_prices, json_file, indent=4)

Step 11. Run the Scraper

Lastly, call the required functions to run the scraper and store data in a JSON file.

if __name__ == "__main__":
    url = "https://www.namecheap.com/"
    html_content = fetch_html_content(url)

    if html_content:
        domain_prices = extract_domain_prices(html_content)
        if domain_prices:
            # Store data in a JSON file
            pass

The final output is:

cloudscraper data

Drawbacks of Using Cloudscraper

It's important to note that not all Cloudflare-protected websites can be accessed. If you try to access them, you may receive a 403 status code. This is because as Cloudflare updates, open-source solutions like Cloudscraper may become outdated and stop working.

Currently, Cloudscraper cannot scrape websites protected by the newer version of Cloudflare. A significant number of websites are using this updated version. One such example is the Author website.

When visiting the website, we’re automatically redirected to the Cloudflare waiting room for a security check.

bypass cloudflare with cloudscraper

Cloudflare accepts our connection and redirects us to the homepage since we’re sending the request from a real browser.

cloudflare protected website

Let's try to access the website with Cloudscraper.

import cloudscraper

url = "https://author.today/"
scraper = cloudscraper.create_scraper()
info = scraper.get(url)
print(info.status_code)
print(info.text)

The output received:

As you can see, Cloudscraper isn't working against newer Cloudflare versions. The error message suggests that the challenge page is blocking access to the website. In the next section, we’ll discuss an alternative to solve this problem.

Alternative to Cloudscraper

If you've been facing issues with newer versions of Cloudflare, it’s time to switch to alternative solutions. ScrapingBee API is a great option to consider, as it takes care of maintaining your code infrastructure and ensures that you stay up-to-date with the latest software updates from Cloudflare.

ScrapingBee offers a fresh pool of proxies that can handle even the most challenging websites. To use this pool, you simply need to add stealth_proxy=True to your API calls. The ScrapingBee Python SDK makes it easier to interact with ScrapingBee's API.

Don't forget to replace "Your_ScrapingBee_API_Key" with your actual API key, which you can retrieve from here.

# Install the Python ScrapingBee library:      
# pip install scrapingbee

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(
    api_key="Your_ScrapingBee_API_Key"
)

response = client.get(
    "https://author.today/",
    params={
        "stealth_proxy": "True",
    },
)

print("Response HTTP Status Code: ", response.status_code)

with open("response.html", "wb") as f:
    f.write(response.content)

The output is:

cloudflare protected website scraping

We did it!

We’ve successfully fetched HTML text that will be stored in a file named "respnse.html." When you open this file, you’ll see the above image. This shows that we’ve successfully bypassed the Cloudflare-protected website and retrieved its HTML.

Wrapping Up

You've learned how to scrape Cloudflare-protected websites using Cloudscraper, an open-source Python library specifically designed to bypass Cloudflare. You've also learned some of Cloudscraper's advanced functions, such as handling captchas and user agents. Finally, you've explored both the limitations of Cloudscraper and the best alternatives available.

Satyam Tripathi

Satyam is a senior technical writer who is passionate about web scraping, automation, and data engineering. He has delivered over 130 blog posts since 2021.