Over 7.59 million active websites use Cloudflare. The website you intend to scrape might be protected by it. Websites protected by services like Cloudflare can be challenging to scrape due to the various anti-bot measures they implement. If you've tried scraping such websites, you're likely already aware of the difficulty of bypassing Cloudflare's bot detection system.
Bypassing Cloudflare becomes a near-necessity for large-scale projects or scraping popular websites. There are various methods to bypass Cloudflare, each with its pros and cons. In this guide, we'll explore each method in detail, allowing you to choose the one that best suits your needs.
Without further ado, let’s get started!
Cloudflare blocking errors and response codes
If you’ve tried scraping a site protected by Cloudflare, you may have encountered some of the following Cloudflare 1XXX errors . These errors appear in the HTML body of the response.
Cloudflare Error 1005
Cloudflare Error 1005 , commonly seen as "Access Denied: You have been banned," occurs when a website owner or Cloudflare blocks your IP address. This typically happens due to security measures designed to prevent malicious activities like web scraping, DDoS attacks, or other suspicious behaviors. Your IP may be flagged if it's identified as engaging in scraping activities, or violating the website’s policies. Overcoming this error often involves changing your IP address using residential proxies or by using a web scraping API.
Cloudflare Error 1015
Cloudflare Error 1015 refers to a rate-limiting error. It occurs when a website owner restricts the number of requests allowed in a specific timeframe, and you exceed this limit. This happens when you send too many requests in a short period.
Cloudflare Error 1009
Cloudflare Error 1009 refers to the 'Access Denied: Country or Region Banned' error. It occurs when the website owner has banned the country or region where your IP address originates.
Cloudflare Error 1020
Cloudflare Error 1020 , known as the Access Denied error, occurs when you violate a firewall rule set up by the Cloudflare-protected website. This violation can occur due to various reasons, including sending too many requests to the website.
Cloudflare Error 1010
Cloudflare Error 1010 means that the website owner has banned your access based on your browser's signature. This can occur when you attempt to scrape a website using automated tools like Selenium, Puppeteer, or Playwright. These tools often leave a distinct fingerprint that Cloudflare's JavaScript detection can easily identify.
Cloudflare uses different types of technology to accurately classify bots and safeguard websites from malicious bot activity. Below are some of the key technologies and features used by Cloudflare:
Web Application Firewall (WAF)
The Cloudflare WAF uses threat intelligence and machine learning to automatically block emerging threats. It sits in front of web applications to stop a wide range of attacks using powerful rulesets, advanced rate limiting, and other security measures.
Distributed denial-of-service (DDoS) protection
DDoS attacks can slow down or shut down services, but Cloudflare stops them all. Cloudflare DDoS autonomous systems automatically detect and mitigate these threats.
Under Attack Mode
Cloudflare's I'm Under Attack Mode performs additional security checks to mitigate DDoS attacks. When enabled, visitors encounter a CAPTCHA challenge on an interstitial page, filtering out bots and ensuring only valid users access the site.
Bot Management
Cloudflare Bot Management stops bad bots while allowing good bots like search engine crawlers, with minimal latency and rich analytics. It uses machine learning, behavioral analysis, and fingerprinting to classify bots accurately.
How to bypass cloudflare antibot protection when scraping in 2024
While Cloudflare uses various methods to detect web scrapers, there are a few tried-and-tested techniques you can use to bypass their bot protection. Using a combination of these techniques can improve your success rate of bypassing Cloudflare more frequently.
- Method 1: Fortified Headless Browsers
- Method 2: Scrape a Cached Version
- Method 3: Backward Engineer Cloudflare's Anti-Bot Detection Measures
- Method 4: Use a Web Scraping API
- Method 5: Get Premium Proxies
Fortified Headless Browsers
Most headless browsers are designed for testing website functionality and automating actions, not web scraping. As such, several traits make bypassing Cloudflare's anti-bot protection difficult.
For example, a commonly known leak in headless browsers like Puppeteer, Playwright, and Selenium is the value of the navigator.webdriver
property. In regular browsers, this property is set to false
. However, in unfortified headless browsers, it is set to true
.
Don't worry, there are solutions for each of the most popular headless browsers:
- Selenium: Try using the undetected_chromedriver as it optimizes and patches the base Selenium library, making it more effective at bypassing Cloudflare. If this option isn't working, you can also consider using the NoDriver library, which is the official successor of undetected_chromedriver.
- Puppeteer: Puppeteer Extra (an open-source library) extends Puppeteer functionality with useful plugins like stealth ( puppeteer-extra-plugin-stealth ). It also provides other plugins such as adblocker , recaptcha , anonymize-ua , etc.
- Playwright: The Stealth plugin is also available for Playwright. Here are some Puppeteer Extra plugins compatible with Playwright at the time of writing: puppeteer-extra-plugin-stealth , puppeteer-extra-plugin-recaptcha , and plugin-proxy-router .
There are other popular Cloudflare solvers such as Cloudscraper and FlareSolverr . FlareSolverr is a proxy server used to bypass Cloudflare and DDoS-GUARD protection. Cloudscraper is a simple Python module designed to bypass Cloudflare's anti-bot page, also known as "I'm Under Attack Mode" or IUAM.
These are all great options, but they come with various downsides. One of the major issues with open-source packages such as Nodriver is that anti-bot companies can detect how these packages bypass their anti-bot protection systems and easily fix the issues that they exploit.
This leads to an ongoing arms race where these packages come up with new workarounds, and anti-bot companies patch these workarounds as well.
Scrape a Cached Version
You can consider scraping data from Google's Cached Version instead of the actual website. This is helpful when dealing with highly protected websites or when you need data that doesn't change often.
Most websites protected by Cloudflare allow Google to crawl their websites. When Google crawls the web to index web pages, it creates a cache of the data it finds.
Google provides a cached version of most websites, which you can access by using the URL https://webcache.googleusercontent.com/search?q=cache:YOUR_WEBSITE_URL
. Alternatively, you can use the
Internet Archive Wayback Machine
.
If you would like to scrape https://www.g2.com/
, the URL to scrape the Google cache version would be:
https://webcache.googleusercontent.com/search?q=cache:https://www.g2.com/
How often Google updates its cache depends on the site's popularity and how often its content changes. For less popular sites or sites that are rarely updated, the cache could be quite old.
Additionally, some sites, like LinkedIn, do not allow Google to store their data. Make sure to check if a cached version is available and meets your needs.
Backward Engineer Cloudflare Antibot Detection Measures
The most complex way to bypass Cloudflare's anti-bot measures is to reverse engineer their protection system and develop a solution that evades all their checks.
As web technologies evolve, so do the strategies websites use to detect and block automated scraping. These anti-bot technologies can involve complex algorithms designed to analyze user behavior, request patterns, and other hallmarks of automated activity.
Reverse engineering these technologies to understand or bypass them isn't straightforward. It requires a deep understanding of both network security and web technologies.
This method is effective (commonly used by many advanced proxy solutions), but it's challenging to implement and requires significant technical expertise.
The bot detection methods used by Cloudflare can generally be classified into two categories:
- Passive: These bot detection techniques consist of fingerprinting checks performed on the backend server
- Active: These detection techniques rely on checks performed on the user's browser (client-side).
Let's dive into a few examples from each category together.
Passing Cloudflare Passive Bot Detection Techniques
Here's a list of some passive bot detection techniques that Cloudflare uses on the server side, along with how to potentially bypass them:
IP Address Reputation: Cloudflare assigns an IP reputation score (also known as a risk or fraud score) to your request based on factors like location, internet provider, and past activity. Residential or mobile proxies generally have higher reputation scores than data center proxies or those linked to VPNs, so using those will improve your IP's reputation with Cloudflare.
HTTP Request Headers:
Cloudflare analyzes HTTP request headers to distinguish between real browsers and automated bots. If your scraper uses a non-browser user agent like python-requests/2.22.0
, it will likely be flagged as a bot.
Cloudflare can block a bot if the request it sends lacks certain headers that a normal browser would include. Browsers typically send a set of headers, while some bots might lack certain ones.
Additionally, Cloudflare checks for inconsistencies between headers and the User-Agent string. For instance, if the user-agent string claims the client is Firefox but includes a header (sec-ch-ua-full-version-list
) that Firefox wouldn't send, Cloudflare can recognize this as suspicious and block the request.
To bypass Cloudflare's bot detection mechanisms, you need to make your scraper's requests appear as if they are coming from a real browser. One way to achieve this is to use a complete set of browser headers or user-agents that match the type of browser you want to appear as. This can be extracted from the latest version of browsers like Google Chrome, Safari, or Firefox.
Alternatively, you can find popular user agents on the following resource . Finally, you can use solutions like Fake-Useragent (for Python), User Agents (for JS), Faker (for Ruby), and Fake User Agent (for Rust) that can generate a user-agent string for you.
TLS & HTTP/2 Fingerprints:
Cloudflare uses several techniques to detect bots, with TLS (Transport Layer Security) and HTTP/2 fingerprinting being among the most sophisticated.
When a client makes an HTTP request, it sends various headers that indicate which browser it claims to be (e.g., User-Agent string). Cloudflare checks if the TLS and HTTP/2 fingerprints match the declared browser headers. If there’s a mismatch, it suggests that the request might be coming from a bot trying to spoof a real browser.
Different versions of browsers and HTTP clients have distinct TLS and HTTP/2 fingerprints, which Cloudflare compares to the declared browser headers to verify authenticity. For example, a Chrome browser on Windows (version 104) would have a different fingerprint than:
- A Chrome browser on Windows (version 87)
- A Chrome browser on an Android device
- A Firefox browser
- The Python HTTP requests library
To bypass Cloudflare's fingerprinting tests, ensure your browser headers, TLS fingerprint, and HTTP/2 fingerprint are all consistent and indicate that the request is coming from a real browser.
Passing Cloudflare Active Bot Detection Techniques
Now you need to deal with Cloudflare client-side verification tests! When you visit a Cloudflare-protected website, several checks constantly run on the client-side to determine if you're a bot.
Here are the main Cloudflare client-side bot fingerprinting techniques that Cloudflare performs in users’ browsers, which you will need to bypass.
Turnstile CAPTCHAs:
Since September 2023, Cloudflare has replaced all its CAPTCHAs with Turnstile, a tool that provides a frustration-free web experience for visitors. However, unlike traditional CAPTCHAs, Turnstile usually operates in the background, which means it verifies users without requiring them to solve puzzles or challenges.
Below is an example of a Cloudflare turnstile appearing on a Cloudflare-protected site:
When you try to access a Turnstile-protected website, Cloudflare runs a series of small, non-interactive JavaScript challenges in the background. These challenges analyze your browser environment, including proof-of-work, proof-of-space, web API access attempts, and browser quirks, to distinguish real users from bots.
Cloudflare Turnstile dynamically adjusts challenge difficulty based on user behavior, thwarting automated scraping scripts.
Canvas Fingerprinting: Another technique Cloudflare uses to detect scrapers is canvas fingerprinting. It’s surprising how much information their browser reveals just by visiting a website. From your operating system and screen resolution to your browser version and installed fonts, these seemingly generic details combine to create a unique fingerprint that identifies you with an accuracy of around 90-99% .
When a user visits a site, a specific JavaScript code instructs the browser to draw a hidden layer of text or graphics on the canvas, which is then turned into either a token or a hash. This hash allows the website to track your browser's activity, and if it detects scraper-like behavior, it can take action.
In bot detection, this is useful because bots often lie about their technology using their user-agent header. Cloudflare maintains a large dataset of legitimate canvas fingerprints paired with user agents.
If a request claims to be from a Chrome browser on a Linux machine, but the canvas fingerprint reveals that the request is coming from Firefox on a Windows machine. This discrepancy will make cloudflare challenge or block the request.
Note: Cloudflare uses Google's Picasso Fingerprinting. to generate canvas fingerprints.
You can use BrowserLeaks Live Demo to see your browser's canvas fingerprint.
One important element of the image above is the Signature value, which represents the image data.
You could disable the canvas API to avoid getting blocked while scraping , though this has trade-offs. Another approach is to disable JavaScript, but most websites depend on it to display content.
The two popular methods to bypass canvas fingerprinting are using headless browsers (such as Base Puppeteer) and enabling anti-canvas extensions in a headless browser.
Event Tracking:
Another challenge you may face while interacting with web pages is Cloudflare event tracking. Cloudflare adds event listeners to monitor actions such as mouse movements, clicks, and key presses. Real users always interact with the browser using their mouse or keyboard. If Cloudflare detects no mouse movement, it may assume the request is coming from an automated browser.
Reverse engineering a bypass for Cloudflare's anti-bot system is a bit challenging. It requires understanding the algorithm responsible for generating the challenge and validating the response. This involves:
- Intercepting Cloudflare network requests during the Waiting Room page load.
- Debugging and deobfuscating the Cloudflare Javascript challenge script.
- Analyzing the deobfuscated script to solve the Javascript challenges and return the correct result.
The downside of this approach is that you'll need to delve into Cloudflare's obfuscated anti-bot system and employ trial-and-error to bypass its verification. Maintaining this system becomes increasingly difficult as Cloudflare continues to develop its anti-bot protection.
While possible, I only recommend this for those interested in the intellectual challenge of reverse engineering a complex system or those who gain economic benefit from building and maintaining such a solution.
At ScrapingBee, we've built our tools with an intimate understanding of these anti-bot measures. Our service handles the intricacies of headers, proxies, and even JavaScript rendering, making your scraping efforts more human-like and less detectable.
Using ScrapingBee means you have a sophisticated ally designed to navigate complex website defenses. We stay updated with the latest in web security trends, so you don't have to continually adjust your strategies.
Remember, the key isn't just about scraping data—it's about doing it in a way that maintains your operations under the radar and in compliance with web standards. With ScrapingBee, you're equipped to adapt to and overcome modern web defenses.
Use a Web Scraping API
Web scraping APIs such as ScrapingBee are designed to tackle the challenges posed by Cloudflare. This all-in-one tool is capable of bypassing anti-bot measures and is user-friendly.
Let's use ScrapingBee to extract data from the MySQL reviews page on G2 , which is protected by Cloudflare. To get started, simply sign up to get your free API key.
You’ll be redirected to the ScrapingBee dashboard where you can see your API key, API credits consumed, etc.
Navigate to the HTML API Request Builder page. Select "Python" (although ScrapingBee supports multiple languages) and enter the target URL. Then, check the boxes for "Block Ads", "Stealth Proxy", "Block Resources", and "JavaScript Rendering" to enable them.
Note: You can directly test this tool on the Request Builder page by clicking the "Try It" button.
Now, copy the code snippet into your code editor. Then, install the Python Requests library using the following command:
pip install requests
The code has been modified to save the screenshot in your directory. We've set the "screenshot" parameter to True. You can also enable this option by checking the box on the request builder page.
We’re using stealth_proxy=True
but not the premium proxy. This is because even with premium_proxy=True
, some websites are difficult to scrape. For these challenging websites, ScrapingBee provides a new pool of proxies specifically designed to bypass strong anti-scraping measures. To use this pool, simply add stealth_proxy=True
to your API calls.
import requests
def send_request():
response = requests.get(
url="https://app.scrapingbee.com/api/v1/",
params={
"api_key": "YOUR_API_KEY",
"url": "https://www.g2.com/products/mysql/reviews",
"block_ads": "true",
"stealth_proxy": "true",
"screenshot": "true",
},
)
print("Response HTTP Status Code: ", response.status_code)
if response.ok:
with open("screenshot.png", "wb") as f:
f.write(response.content)
print("Successfully saved the screenshot!")
send_request()
The output is:
Great, you've bypassed Cloudflare with Python and ScrapingBee!
Using a web scraping API like ScrapingBee saves you from dealing with various anti-scraping measures, making your data collection efficient and less prone to blocks.
Get Premium Proxies
Web scraping proxies act as intermediaries between you and a target server, allowing you to route requests through various IP addresses. Free proxies are readily available but come with drawbacks like slower speeds, high detection risk, and high failure rates.
Premium proxies, in contrast, offer reliability, faster speeds, and a higher chance of avoiding detection. Residential proxies use real user IP addresses from real ISPs, making them appear like genuine users and reducing the likelihood of being blocked by websites.
For tasks that require higher security and less detectability, consider using stealth proxies, which are designed to be undetectable as proxies. These are especially useful for scraping websites with aggressive anti-scraping measures.
ScrapingBee offers a proxy mode, essentially a front-end to its core API. This mode handles requests like any standard API call. It bypasses anti-bot measures using proxy rotation, anti-CAPTCHA solutions, and headless browsers – all with a single API call. But remember to tick "block resources" or you'll be making an individual proxy call for all resources on that page.
Offering both API and proxy modes, ScrapingBee integrates seamlessly into any scraping project.
Navigate to the request builder and select proxy mode.
Here’s the code using proxy mode:
import requests
def send_request():
proxies = {
"http": "http://AWDD03N6HJTW0DXCUYX23IAKXB5BZKA49MRM8QB9JU7GD6YRMBRBMC7BEZ9DE0118UGZXZ7P2LBDD14S:render_js=False&stealth_proxy=True@proxy.scrapingbee.com:8886",
"https": "https://AWDD03N6HJTW0DXCUYX23IAKXB5BZKA49MRM8QB9JU7GD6YRMBRBMC7BEZ9DE0118UGZXZ7P2LBDD14S:render_js=False&stealth_proxy=True@proxy.scrapingbee.com:8887",
}
response = requests.get(
url="https://author.today/", proxies=proxies, verify=False)
print("Response HTTP Status Code: ", response.status_code)
send_request()
The output shows that the code has successfully bypassed Cloudflare.
Conclusion
Most modern websites use Cloudflare, and its bot detection mechanism throws challenges and makes web scraping difficult. We discussed various methods to bypass this protection, which can help you overcome these challenges.
While those methods could bring success in some cases, some still pose certain risks and limitations. In such cases, ScrapingBee can be a great alternative. It helps you scrape any website data with just a few lines of code and ensures you achieve your scraping goals effortlessly.