In this article, we examine how to use the Python Requests library behind a proxy server. Developers use proxies for anonymity, security, and sometimes will even use more than one to prevent websites from banning their IP addresses. Proxies also carry several other benefits such as bypassing filters and censorship. Feel free to learn more about rotating proxies before continuing, but let's get started!
π‘ ScrapingBee and proxies
Did you know that ScrapingBee has a native proxy mode ? You just authenticate with your API key and optional request parameters and ScrapingBee takes care of everything else. Sign up for a free account and enjoy the first 1,000 scraping requests completely on the house.
Prerequisites & Installation
This article is intended for those who would like to scrape behind a proxy in Python. To get the most out of the material, it is beneficial to:
β Have experience with Python 3 π.
β Python 3 installed on your local machine.
Check if the python-requests
package is installed by opening the terminal and typing:
$ pip freeze
pip freeze
will display all your current python packages and their versions, so go ahead and check if it is present. If not, install it by running:
$ pip install requests
How to Use a Proxy with Requests
Routing your requests via a proxy is quite straightforward with Requests. You simply pass a Python dictionary with the relevant proxy addresses to the usual request methods, and Requests does all the rest:
To use a proxy in Python, first import the
requests
package .Next, create a
proxies
dictionary that defines the HTTP and HTTPS connections. This variable should be a dictionary that maps a protocol to the proxy URL. Additionally, declare aurl
variable set to the webpage you're scraping from.
Notice in the example below, the dictionary defines the proxy URL for two separate protocols: HTTP and HTTPS. The dictionary defines separate entries for each protocol, but this does not mean that the two cannot point to the same proxy address.
- Lastly, run your Requests call and save the response in the
response
variable. The important part here is to pass the proxy dictionary to the request method.
import requests
proxies = {
'http': 'http://proxy.example.com:8080',
'https': 'http://secureproxy.example.com:8090',
}
url = 'http://mywebsite.com/example'
response = requests.post(url, proxies=proxies)
With this code, a POST
request is sent to the specified URL, using the proxy provided in the proxies
dictionary matching the protocol scheme of our URL. As we passed an HTTP URL, http://proxy.example.com:8080
will be picked.
Setting Proxies with Environment Variables π±
In addition to configuring proxies with a Python dictionary, Requests also supports the
standard proxy environment variables
HTTP_PROXY
and HTTPS_PROXY
. This comes particularly in handy when you have a number of different Python scripts and want to globally set a proxy, without the need to touch the Python code individually.
Simply set the following environment variables (like with the dictionary setup, HTTP and HTTPS are configured separately) and Python will route requests automatically via these proxies.
HTTP_PROXY='http://10.10.10.10:8000'
HTTPS_PROXY='http://10.10.10.10:1212'
Don't forget to use export
when on Unix.
Proxy Authentication π©βπ»
Some proxies (especially paid services) require you to provide proxy authentication credentials using Basic Authentication . As indicated by RFC 1738 , you simply specify the relevant credentials in the URL before the hostname:
http://[USERNAME]:[PASSWORD]@[HOST]
For example, extending our previous Python dictionary to authenticate as "bob" with the password "alice", would give us this:
proxies = {
'http': 'http://bob:alice@proxy.example.com:8080',
'https': 'http://bob:alice@secureproxy.example.com:8090',
}
The same syntax also applies to environment variables:
HTTP_PROXY='http://bob:alice@10.10.10.10:8000'
HTTPS_PROXY='http://bob:alice@10.10.10.10:1212'
Reading Responses π
While not specific to proxies, it's also always a good thing to know how to obtain the data you actually requested.
Getting the response as text is rather straightforward with the text
property of the response object:
If you would like to read your data:
response = requests.get(url)
text_resp = response.text
If you have a JSON response, you can also use the json()
method to get a pre-parsed Python object with the data from the response:
response = requests.get(url)
json_resp = response.json()
If it's binary data, you can use content
to get hold of the body as byte stream:
response = requests.get(url)
response.content
Requests Session with Proxies π
Individual web requests alone often don't cut it and you also need to keep track of sessions (especially when a site requires a login). That's where Request's Session class comes to the rescue and allows you to retain network connections and session cookies.
Enabling a session object to support proxies is quite straightforward by simply setting the proxies
field, as the following example shows:
import requests
import time
HN_USER = 'YOUR_USERNAME'
HN_PASS = 'YOUR_PASSWORD'
session = requests.Session()
response = session.get('https://news.ycombinator.com/submit')
print("Login prompt present: {}".format("logged in" in response.text))
session.post('https://news.ycombinator.com/login', { 'goto': 'news', 'acct': HN_USER, 'pw': HN_PASS });
# wait a few seconds
time.sleep(5)
response = session.get('https://news.ycombinator.com/submit')
print("Login prompt present: {}".format("logged in" in response.text))
Here, we use once more our favorite site Hacker News for a login example and first request the submission page (which requires a valid user session). Because we are not logged in yet, it tells us to do so and show a sign-in screen. That's exactly what we do with the next call to the session post()
method, where we provide the credentials we previously configured. Once post()
returns, we should have a valid session cookie in our session object and can now request the submission page again and should not be prompted any more.
When running our code we should get the following output:
Login prompt present: True
Login prompt present: False
Rotating Proxies with Requests
Remember how we said some developers use more than one proxy? Well, now you can too!
Anytime you find yourself scraping from a webpage repeatedly, it's good practice to use more than one proxy, because should the proxy get blocked you'll be back to square one and run into the same fundamental issue as if you hadn't used a proxy to begin with. The scraping cancel culture is real! So, to avoid being canceled, it's best to regularly rotate your list of proxies.
To rotate proxies, you first need to have a pool of proxy IPs available. You can use free proxies found on the internet or commercial solutions. In most cases, if your service relies on scraped data a free proxy will most likely not be enough.
How to Rotate IPs with Requests
In order to start rotating proxies IP addresses, you need a list of free proxies. In the case where free proxies do fit your scrapping needs, here you can find a list of free proxies . Today you'll be writing a script that chooses and rotates through proxies.
First, import the
Requests
,BeautifulSoup
, andchoice
libraries.Next, define a method
get_proxy()
that will be responsible for retrieving IP addresses for you to use. In this method you will define yoururl
as whatever proxy list resources you choose to use. After sending a request api call, convert the response into a Beautiful Soup object to make extraction easier. Use the html5lib parser library to parse the websiteβs HTML, as you would for a browser. Create aproxy
variable that useschoice
to randomly choose an IP address from the list of proxies generated bysoup
. Within the map function, you can use alambda
function to convert the HTML element into text for both retrieved IP addresses and port numbers.Create a
proxy_request
method that takes in 3 arguments: therequest_type
, theurl
, and**kwargs
. Inside this method, define your proxy dictionary as the proxy returned from theget_proxy
method. Similiar to before, you'll use therequests
, passing in your arguments.
import requests
import random
ip_addresses = [ "http://mysuperproxy.com:5000", "http://mysuperproxy.com:5001", "http://mysuperproxy.com:5100", "http://mysuperproxy.com:5010", "http://mysuperproxy.com:5050", "http://mysuperproxy.com:8080", "http://mysuperproxy.com:8001", "http://mysuperproxy.com:8000", "http://mysuperproxy.com:8050" ]
def proxy_request(request_type, url, **kwargs):
while True:
try:
proxy = random.randint(0, len(ip_addresses) - 1)
proxies = {"http": ip_addresses[proxy], "https": ip_addresses[proxy]}
response = requests.request(request_type, url, proxies=proxies, timeout=5, **kwargs)
print(f"Proxy currently being used: {proxy['https']}")
break
except Exception as err:
print("Error, looking for another proxy")
return response
proxy_request('GET', 'http://example.com')
You can now scrape and rotate all at once!π
Use ScrapingBee's Proxy Mode
Believe it or not, there is another free alternative that makes scraping behind a proxy even easier! That alternative is ScrapingBee's Proxy Mode , a proxy interface to the ScrapingBee scraping API. π
Create a free account on ScrapingBee . Once logged on, you can see your account information, including your API Key. *And not to mention 1000 free API credits! π―π
Run the following script, passing your API key as the proxy username and the API parameters as the proxy password. You can omit the proxy password if the default API parameters suit your needs.:
# Install the Python Requests library:
# pip install requests
import requests
def send_request():
proxies = {
"http": "http://YOUR_SCRAPINGBEE_API_KEY:render_js=False&premium_proxy=True@proxy.scrapingbee.com:8886",
"https": "https://YOUR_SCRAPINGBEE_API_KEY:render_js=False&premium_proxy=True@proxy.scrapingbee.com:8887"
}
response = requests.get(
url="http://httpbin.org/headers?json",
proxies=proxies,
verify=False
)
print('Response HTTP Status Code: ', response.status_code)
print('Response HTTP Response Body: ', response.content)
send_request()
Remember that if you want to use proxy mode, your code must be configured not to verify the SSL certificate. In this case, it would be verify=False
since you are working with Python Requests.
That's all there is to sending successful HTTP requests! When you use ScrapingBee's Proxy Mode , you no longer need to deal with proxy rotation manually, we take care of everything for you. π
Conclusion
While it might be tempting to start scraping right away with your fancy new proxies, there are still a few key things you should know. For starters, **not all proxies are the same. There are actually different types, with the three main being:
- transparent proxies
- anonymous proxies
- elite proxies
The difference between these proxy types is basically how well they shield the fact that you are using a proxy or whether they are transparent about it. As such, a transparent proxy will will be very upfront and forward your IP address to the site you are scraping. That's of course not ideal π¨. Anonymous proxies are already a notch stealthier and do not divulge the original address, but they still send proxy headers to the site, which make it obvious that a proxy is involved. Elite proxies provide the highest level of abstraction and completely hide the fact that you are using a proxy.
βΉοΈ Proxies can help a lot with possible IP restrictions, but you still need to pay attention to request throttling, user agent management, and anti-bot measures. If you prefer to focus more on the content than that bureaucracy, please also check out ScrapingBee's scraping platform , as we designed it with all these obstacles in mind and strive to provide as much a seamless scraping experience as possible. The first 1,000 request are always free.
Now that we have that all cleared up, it's time to start web scraping with a proxy in Python . So, get on out there and make all the requests you can dream up!π