Introduction
In this article, you will learn how to scrape product information from Amazon, the biggest online shopping website in the world. You will learn about the techniques and Python libraries required to effectively scrape the required data from there.
You might want to scrape Amazon to:
- get notified when a popular item gets restocked
- keep tabs on your competitors' prices if you are a seller
- monitor new products and trending products in particular niches
This is what a typical product page on Amazon looks like. This exact product can be accessed here .
Fetching Amazon.com product page
Let's start by creating a new folder called amazon_scraper
where all the code will be stored. Afterward, create a scraper.py
file in it:
$ mkdir amazon_scraper
$ touch amazon_scraper/scraper.py
If you're an absolute beginner in Python, you can read our full Python web scraping tutorial , it'll teach you everything you need to know to start!
You will be making use of three libraries: re (regular expressions) , Requests , BeautifulSoup . Requests will help you in downloading the product page HTML using Python, BeautifulSoup will help you in parsing the HTML and extracting data from it, and regular expressions will help in extracting data that is not straightforward to extract using BeautifulSoup.
re
comes by-default with Python and you can install the other two libraries using PIP. Run this command in the terminal to do so:
$ pip install beautifulsoup4 requests
A typical website is made up of HTML (Hyper Text Markup Language) and this is what a server responds with when you type in a URL in your browser. You can download the same HTML in Python with the help of the requests
library. Open up the scraper.py
file and type in the following code to do so:
import requests
url = "https://www.amazon.com/Apple-MWP22AM-A-cr-AirPods-Renewed/dp/B0828BJGD2/"
html = requests.get(url)
print(html.text)
Running this code should print some HTML code in the terminal. If you are lucky, you should see the real product page HTML but in most cases, you will be greeted with a captcha page. This is because Amazon has put some defenses in place to discourage web scraping. They can easily distinguish between a human-triggered request and an automated request. You can verify if you receive a captcha page if the HTML contains this string:
To discuss automated access to Amazon data please contact api-services-support@amazon.com.
There are a few basic methods we can employ to defeat the bot detection algorithm used by Amazon. The very first method is to make use of a valid User-Agent
string. This string is sent as part of the headers with a web request and is used to identify where the request came from. You can mimic a real browser by copying the User-Agent
string sent by a browser and sending the same string with the automated request made using requests
.
You can look at the headers a browser sends with the request by opening up developer tools and then inspecting the network requests. This is what it looks like on Firefox for a request sent to amazon.com :
You can send additional headers with a request like so:
import requests
HEADERS = {
'User-Agent': ('Mozilla/5.0 (X11; Linux x86_64)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/44.0.2403.157 Safari/537.36'),
'Accept-Language': 'en-US, en;q=0.5'
}
url = "https://www.amazon.com/Apple-MWP22AM-A-cr-AirPods-Renewed/dp/B0828BJGD2/"
html = requests.get(url, headers=HEADERS)
print(html.text)
Try running this code and you should hopefully receive the real product page as a response. With the HTML downloading sorted out, let's discuss how you can extract the required data from it.
How to extract amazon.com product information
It is important to finalize what data needs to be extracted. We will focus on extracting the following data about a product in this tutorial:
- Name
- Price
- Rating
- Images
- Description
The screenshots below highlight where this information is located on the product page:
You will mainly be using the BeautifulSoup library for data extraction. It is not the only library for the job but it is one of the most widely used ones because of its powerful parsing backend, easy-to-use API, and ability to handle most data-extraction use cases perfectly.
If you followed the installation instructions at the beginning, you should be good to go. However, before you can use it, you need to explore the HTML sent back by Amazon, figure out its structure, and pinpoint exactly where the information you need is located in the HTML.
There are various workflows that people use for this step and I will share the most commonly used one. Go to the product page in the browser, right-click on the data you want to extract, and click on "Inspect". This will open up the developer tools window. This tool is available in most famous web browsers and is indispensable when it comes to web scraping. It will help you figure out the closest tags (often uniquely identifiable based on unique IDs) that can be used to extract the required information.
Let's start with the product name and see how it works. Right-click on the name and click on "Inspect". I am using Firefox for this demonstration:
As the image shows, you can extract the product name by searching for the span
with the id
of productTitle
. Go back to scraper.py
and write down some code to parse the HTML using BeautifulSoup and extract the product title:
import requests
from bs4 import BeautifulSoup
url = "https://www.amazon.com/Apple-MWP22AM-A-cr-AirPods-Renewed/dp/B0828BJGD2/"
HEADERS = {
'User-Agent': ('Mozilla/5.0 (X11; Linux x86_64)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/44.0.2403.157 Safari/537.36'),
'Accept-Language': 'en-US, en;q=0.5'
}
html = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(html.text)
title = soup.find('span', {'id':"productTitle"}).text.strip()
print(title)
If you run this script, it should print the product title in the terminal:
'Apple AirPods Pro (Renewed)'
The magic is happening in the second-last line:
title = soup.find('span', {'id':"productTitle"}).text.strip()
Here, soup.find
searches for the first span
in the HTML that has an id
of productTitle
. .text
extract all the visible textual data from the span (including the nested child tags if present) and .strip()
removes any preceding or trailing whitespaces. Without .strip()
, the output would have been:
' Apple AirPods Pro (Renewed) '
The rest of the product information can be extracted in the same way. Let's focus on the price next. It is visible at two different locations on the page but we will focus on the one displayed in the sidebar. Repeat the same inspect workflow and try to pinpoint the closest tag you can extract:
Turns out, you can make use of the price_inside_buybox
id
to extract the price. Go ahead and update the scraper.py
file and append the following code to it:
price = soup.find('span', {'id':"price_inside_buybox"}).text.strip()
print(price)
Running the script again should print the product price in the terminal:
$139.97
Ratings can also be extracted similarly. You can target an i
tag with the class of a-icon a-icon-star a-star-4-5
as is evident in the picture below:
Add the following code to scraper.py
to extract the textual rating:
rating = soup.find("i", attrs={
'class': 'a-icon a-icon-star a-star-4-5'}).text.strip()
Running the updated code should output the product rating:
'4.4 out of 5 stars'
Similarly, product description can be extracted by filtering for the div
with the id of productDescription
:
Add the following code to the Python file to scrape the product description:
description = soup.find('div', {'id': "renewedProgramDescriptionBtf_feature_div"}).text.strip()
The product images are a little tricky to extract. Go ahead and inspect the images using developer tools:
Based on the image above, it seems like you can filter for the divs
that have a class of imgTagWrapper
but if you try this filter in the terminal you will soon realize that it doesn't return all the product images:
>>> len(soup.findAll("div", attrs={'class': 'imgTagWrapper'}))
3
It should return a list of length 6 (for a total of 6 product images) but instead, it returns a list of length 3. This is because Amazon is using JavaScript to populate the rest of the image tags on the page but our requests
+ BeautifulSoup
combo is not doing that. This is where regular expressions can be really handy. If you copy the URL of any high-definition product image and search for it in the HTML returned by Amazon, you will observe that it is stored inside a script
tag as part of a JSON object. The high-definition URLs are all assigned to the hiRes
keys. You can easily extract all high-resolution images by extracting the values assigned to all hiRes
keys.
There are some posts on StackOverflow that discourage the use of regex to parse HTML, but in my experience, it can be really handy for certain situations such as this one. You can use the following regex to extract the images:
import re
# ...
images = re.findall('"hiRes":"(.+?)"', html.text)
This should return a list comprising 6 URLs.
At this point, your scraper.py
file should resemble this:
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.amazon.com/Apple-MWP22AM-A-cr-AirPods-Renewed/dp/B0828BJGD2/"
HEADERS = {
'User-Agent': ('Mozilla/5.0 (X11; Linux x86_64)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/44.0.2403.157 Safari/537.36'),
'Accept-Language': 'en-US, en;q=0.5'
}
html = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(html.text)
title = soup.find('span', {'id':"productTitle"}).text.strip()
price = soup.find('span', {'id':"price_inside_buybox"}).text.strip()
rating = soup.find("i", attrs={'class': 'a-icon a-icon-star a-star-4-5'}).text.strip()
images = re.findall('"hiRes":"(.+?)"', html.text)
description = soup.find('div', {'id': "productDescription"}).text.strip()
print(title, "\n", price, "\n", images, "\n", description)
You can make the required changes and run this script to extract data from whichever product page you want.
If you wish to learn more about BeautifulSoup, check out our BeautifulSoup tutorial .
Cycling the User-Agent
You can also use multiple rotating User-Agent
strings to make it even harder for Amazon to detect that it is receiving bot traffic from your IP. This works by sending a different User-Agent
string for each successive request to Amazon. It is simple to implement and can be done like this:
import random
import re
import requests
from bs4 import BeautifulSoup
urls = [
"https://www.amazon.com/Apple-MWP22AM-A-cr-AirPods-Renewed/dp/B0828BJGD2/",
"https://www.amazon.com/Practical-Malware-Analysis-Hands-Dissecting/dp/1593272901/",
"https://www.amazon.com/Hacking-APIs-Application-Programming-Interfaces/dp/1718502443/"
]
UA_STRINGS = [
"Mozilla/5.0 (Linux; Android 12; SM-S906N Build/QP1A.190711.020; wv) AppleWebKit/537.36\
(KHTML, like Gecko) Version/4.0 Chrome/80.0.3987.119 Mobile Safari/537.36",
"Mozilla/5.0 (Linux; Android 10; SM-G996U Build/QP1A.190711.020; wv) AppleWebKit/537.36\
(KHTML, like Gecko) Version/4.0 Mobile Safari/537.36",
"Mozilla/5.0 (iPhone; CPU iPhone OS 12_0 like Mac OS X) AppleWebKit/605.1.15\
(KHTML, like Gecko) CriOS/69.0.3497.105 Mobile/15E148 Safari/605.1"
]
for url in urls:
HEADERS = {
'User-Agent': random.choice(UA_STRINGS),
'Accept-Language': 'en-US, en;q=0.5'
}
html = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(html.text)
title = soup.find('span', {'id':"productTitle"}).text.strip()
price = soup.find('span', {'id':"price_inside_buybox"}).text.strip()
images = re.findall('"hiRes":"(.+?)"', html.text)
description = soup.find('div', {'id': "productDescription"}).text.strip()
If you run the code above, random.choice
will pick a random User-Agent
string from the UA_STRINGS
list on each loop iteration and will send that with the request.
Avoid getting blocked with ScrapingBee
There are a few caveats I didn't discuss in detail. The biggest one is that if you run your scraper every so often, Amazon will block it. They have services in place to figure out when a request is made by a script and no simply setting an appropriate User-Agent
string is not going to help you bypass that. You will have to use rotating proxies and automated captcha-solving services. This can be too much to handle on your own and luckily there is a service to help with that: ScrapingBee.
You can use ScrapingBee to extract information from whichever product page you want and ScrapingBee will make sure that it uses rotating proxies and solves captchas all on its own. This will let you focus on the business logic (data extraction) and let ScrapingBee deal with all the grunt work.
Let's look at a quick example of how you can use ScrapingBee. First, go to the terminal and install the ScrapingBee Python SDK:
$ pip install scrapingbee
Next, go to the ScrapingBee website and sign up for an account:
After successful signup, you will be greeted with the default dashboard. Copy your API key from this page and start writing some code in a new Python file:
I will show you the code and then explain what is happening:
import re
from scrapingbee import ScrapingBeeClient
client = ScrapingBeeClient(api_key='YOUR_API_KEY')
url = "https://www.amazon.com/Apple-MWP22AM-A-cr-AirPods-Renewed/dp/B0828BJGD2/"
response = client.get(
url,
params={
'extract_rules': {
"name": {
"selector": "span[id='productTitle']",
"output": "text",
},
"price": {
"selector": "span[class='a-price a-text-price a-size-medium apexPriceToPay'] > span",
"output": "text",
},
"rating": {
"selector": "i[class='a-icon a-icon-star a-star-4-5'] > span",
"output": "text",
},
"description": {
"selector": "div[id='productDescription']",
"output": "text",
},
"full_html": {
"selector": "html",
"output": "html",
},
}
}
)
if response.ok:
scraped_data = response.json()
images = re.findall('"hiRes":"(.+?)"', response.json()['full_html'])
print(scraped_data['name'])
print(scraped_data['price'])
print(scraped_data['rating'])
print(scraped_data['description'])
print(images)
Don't forget to replace YOUR_API_KEY
with your API key from ScrapingBee. The code is similar to what you wrote using requests
and BeautifulSoup
. This code, however, makes use of ScrapingBee's powerful
extract rules
. It allows you to state the tags and selectors that you want to extract the data from and ScrapingBee will return you the scraped data.
In the code above, you ask ScrapingBee to give you the information directly that is already in the typical HTML tags and additionally you ask it to return the complete HTML as well. The HTML selectors are a bit different here because, unlike the requests
+ BeautifulSoup
combo, ScrapingBee executes the JavaScript code in the page as well before doing the data extraction. This can be extremely useful when you need to extract data from a page that makes heavy use of JavaScript such as SPAs (Single Page Applications).
ScrapingBee will make sure that you are charged only for a successful response which makes it a really good deal.
Conclusion
I just scratched the surface of what is possible with Python and showed you just a single approach to how you can leverage all the different Python packages to extract data from Amazon. If your project grows in size and you want to extract a lot more data and want to automate even more stuff, you should look into Scrapy . It is a full-fledged Python web scraping framework that features pause/resume, data filtration, proxy rotation, multiple output formats, remote operation, and a whole load of other features.
You can wire up ScrapingBee with Scrapy to utilize the power of both and make sure your scraping is not affected by websites that continuously throw a captcha.
I hope you learned something new today. If you have any questions please do not hesitate to reach out. We would love to take care of all of your web scraping needs and assist you in whatever way possible!
Yasoob is a renowned author, blogger and a tech speaker. He has authored the Intermediate Python and Practical Python Projects books ad writes regularly. He is currently working on Azure at Microsoft.