How to Easily Scrape Shopify Stores With AI

16 December 2024 | 8 min read

Scraping Shopify stores can be a challenging task because each store uses a unique theme and layout, making traditional scrapers with rigid selectors unreliable. That’s why we'll be showing you how to leverage an AI-powered web scraper that easily adapts to any page structure, effortlessly extracting Shopify e-commerce data no matter how the store is designed.

In this tutorial, we’ll be using our Python Scrapingbee client to scrape one of the most successful Shopify stores on the planet; gymshark.com , to obtain all the product page URLs and the corresponding product details from each product page. We’ve previously written blogs about scraping product listing pages using Scrapy or using schema.org metadata . We’ll also be using our AI query feature to extract structured data from each product page without parsing any HTML. Please note that we’re using Python only for demonstration and this technique and our API will work with any programming language.

a screenshot showing Gymshark home page, the ecommerce website we willl be scraping.

Getting all the Product Page URLs

As we saw in the screenshot above on the home page, Gymshark has three categories: Women, Men, and Accessories. Further, each of these categories has subcategories, and the list of products on these pages may also be paginated. Additionally, the categories will also change with each Shopify site. Obtaining all the product URLs starting from the homepage seems cumbersome.

Luckily, Shopify sites publish an XML sitemap with all the product URLs. In fact, this is what search engines to discover, crawl, and index all the URLs on the website. The products sitemap for Shopify sites usually has the URL https://<domain-name>/sitemap_products_1.xml. There might also be multiple sitemaps, and this can be checked using the Sitemap Index, which is usually located at the URL https://<domain-name>/sitemap.xml. For additional reference, we’ve also covered the general process of finding all URLs for a given website in another blog .

For Gymshark the Product Sitemap URL is https://www.gymshark.com/sitemap_products_1.xml . Let’s see what this file looks like:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
	<url>
		<loc>https://www.gymshark.com/products/gift-card</loc>
		<changefreq>daily</changefreq>
		<image:image>
			<image:loc>https://cdn.shopify.com/s/files/1/0156/6146/files/gift-card-2023.jpg?v=1697471488</image:loc>
			<image:title>eGift Card</image:title>
		</image:image>
	</url>
	<url>
		<loc>https://www.gymshark.com/products/gymshark-fit-cropped-legging-black-white</loc>
		<changefreq>daily</changefreq>
		<image:image>
			<image:loc>https://cdn.shopify.com/s/files/1/0156/6146/products/FIT_CROPPED_LEGGINGS_BLACK_WHITE.A-TEST.jpg?v=1580466124</image:loc>
			<image:title>Gymshark Fit Seamless Cropped Leggings - Black/White</image:title>
		</image:image>
	</url>
  <!-- more urls -->
</urlset>

The first step of our scraping workflow would be to fetch all the URLs from this sitemap, enclosed in <loc> tags. Let’s see how we can accomplish this using the requests and xmltodict Python modules:

import requests
import xmltodict

SITEMAP_URL = 'https://www.gymshark.com/sitemap_products_1.xml'
SITE_NAME = 'gymshark'

# GET THE SITEMAP CONTENT
r = requests.get(SITEMAP_URL)

# PARSE XML INTO A DICT
sitemap_data = xmltodict.parse(r.text)

# ADD ALL THE URLS INTO A LIST
urls = []
for urlobj in sitemap_data['urlset']['url']:
    urls.append(urlobj['loc'])

# SAVE URLS TO DISK
with open(f'{SITE_NAME}_urls.txt', 'w') as f:
    f.write('\n'.join(urls))
    f.close()

In the above code, we used requests to fetch the text content of the XML sitemap and xmltodict to parse the XML tree into a dict and extract the URLs from that. We then wrote all the URLs to a text file, gymshark_urls.txt. At the time of writing, there were 2266 product URLs on this sitemap.

Shopify AI Web Scraper

In the previous step, we gathered all the product page URLs to be scraped. Next, let’s scrape each product URL using our API. We now have an AI query feature that lets us extract structured data using natural language queries, saving us the trouble of going through the HTML code to parse and scrape it. All we have to do is specify the fields we need in our output and a prompt for each field. We also provide a Python package that wraps around the API. We’ll be using that. Let’s see the code:

from scrapingbee import ScrapingBeeClient
import json
import os

# MAKE A DIRECTORY FOR OUTPUT
os.makedirs('product-data', exist_ok=True)

# INITIALIZE SCRAPINGBEE CLIENT
SB_API_KEY = '<YOUR_SCRAPINGBEE_API_KEY>'
sb_client = ScrapingBeeClient(api_key=SB_API_KEY)

# READ THE URLS LIST WE SAVED TO DISK EARLIER
urls = open('gymshark_urls.txt').read().split('\n')

# GET PRODUCT DATA FOR EACH URL
for url in urls:
    # SCRAPE USING SCRAPINGBEE AI QUERY
    response = sb_client.get(
        url,
        params={
            'ai_selector': "main:not([data-locator-id='collections-page'])",
            'ai_extract_rules': json.dumps({
                "name": {
                    "description": "Name of the Product",
                    "type": "string"
                },
                "description": {
                    "description": "Description of the product",
                    "type": "string"
                },
                "original_price": {
                    "description": "Original price of the Product, before offers are applied",
                    "type": "number"
                },
                "offer_price": {
                    "description": "Final price of the product after applying offers. If no offers are running, return the original price.",
                    "type": "number",
                },
                "in_stock": {
                    "description": "Is the item in stock?",
                    "type": "string",
                },
                "available_sizes": {
                    "description": "Available sizes for the item",
                    "type": "list",
                }
            })
        }
    )
    
    # SAVE THIS PRODUCT'S DATA TO DISK
    slug = url.split("/")[-1].split("?")[0]
    data = {
		    'slug': slug,
        'url': url,
        
        # STATUS CODE WILL NOT BE 200 IF URL HAS BEEN REMOVED
        'status_code': response.headers['spb-initial-status-code'],
        
        # RESOLVED URL IS DIFFERENT FROM ORIGINAL URL IF THERE IS A REDIRECT
        'resolved_url': response.headers['spb-resolved-url'],
        
        # THE JSON OUTPUT
        'response': response.text
    }
    
    with open(f'product-data/{slug}.json', 'w') as f:
        f.write(json.dumps(data, indent=2))
        f.close()
    

In the above code, we iterate over each product URL, and use the API to extract the following fields:

  • Name of the product (name)
  • Description of the product (description)
  • The original price of the product before applying discounts or offers (original_price)
  • The final price of the product after offers are applied (offer_price)
  • Whether the item is in stock (in_stock). This is not a boolean type because multiple sizes could have different stock statuses.
  • The available sizes (availabe_sizes). This is relevant here as this is a clothing store.

For each product, we get a response that looks like the example response shown below:

{
	"name": "GS Power Crop Tank", 
	"description": "Power your pursuit of perfection in a range that is truly made for lifting. Hit big numbers with 0 distraction designs, pre- and post- workout cover ups and supportive fits.\n\n• Power print to upper half of tank\n• Cut-outs to front and back for breathability\n• Side panelling for less distraction and better fit\n• Logo to front left hand side\n\nSIZE & FIT\n• Cropped length\n• Model is 5'9\" and wears size S\n\nMATERIALS & CARE\n• 78% Polyester, 22% Elastane\n\nSKU: B4A7N-EB6N", 
	"original_price": 36, 
	"offer_price": 18, 
	"in_stock": "XS,XXL,L", 
	"available_sizes": ["XS", "S", "M", "L", "XL", "XXL"]
}

Sometimes, the product page may not exist. It may have been removed or redirected to a collections page instead. In this case, the JSON contains mostly null values. We can check this by observing the HTTP status code that was obtained when our backend hit the URL. This can be determined by observing the spb-initial-status-code header in the response from the API call. If the status code is 200, the product listing is live. When the product listing is removed, we either get a 302 (Redirect) or a 404 (Page Not Found).

Finally, we save the status code, resolved URL, and the response to disk, for each product.

Consolidating The Scraped Data

After the previous step, we now have 2266 JSON files in the product-data/ directory, one file for each product URL. Let’s go over each of these files, determine which ones resolved to live product URLs, and aggregate all the data we have:

import json
import os

DIR = 'product-data/'
products = []

for filename in os.listdir(DIR):
    with open(DIR + filename) as f:
        data = json.loads(f.read())
        f.close()
        
    # SKIP IF PRODUCT URL IS NOT LIVE    
    status_code = int(data['status_code'])
    if status_code != 200:
        continue
    if data['url'] != data['resolved_url']:
        continue
        
    # REMOVING ESCAPE SEQUENCES CAUSING JSON ERRORS
    data['response'] = data['response'].replace("\\.", '.')
    data['response'] = data['response'].replace("\\-", "-")
    data['response'] = data['response'].replace("\\*", "-")
    data['response'] = data['response'].replace("\\)", ")")
    data['response'] = data['response'].replace("\\+", "+")
    data['response'] = data['response'].replace(" \\&", " &")
    data['response'] = data['response'].replace("C\\&S", "C&S")
    
    # REMOVING MARKDOWN FORMATTING FROM LLM OUTPUT
    data['response'] = data['response'].replace("```json", '')
    data['response'] = data['response'].replace("```", '')

    product = json.loads(data['response'])
    product['url'] = data['url']
    products.append(product)

print(len(products))
# OUTPUT: 1838

with open('all-products.json', 'w') as f:
    f.write(json.dumps(products, indent=2))
    f.close()

In the above code, we iterate over each JSON file, checking whether the status code was 200, and if the original URL was the same as the resolved URL. If these two conditions are met, we assume that the product page was live and data was obtained. Next, we parse the data from the API response and add it to the products list. Finally, we save the list onto the disk. At the time of writing, there were 1838 live products on this site.

Summary

In this tutorial, we looked at scraping a popular Shopify website, Gymshark to obtain structured data about all the product listings on the website. We obtained 2266 product URLs out of which 1838 product pages were actually live and had data. Since all Shopify websites have the same backend and a similar site structure, this technique would work on virtually any of the millions of Shopify websites. One can further extend the utility of this scraper to monitor price drops and offers.

We also used our AI query feature, which returned structured data with just text-based prompts. We did not have to look at the HTML code and build any parsing logic based on CSS IDs, class names, etc. The AI-based scraper would be more robust and not break immediately if the site’s HTML structure changes.

We hope you enjoyed following along with this tutorial. For a more comprehensive guide about scraping eCommerce websites, you can also refer to our Guide To Scraping eCommerce Websites .

image description
Karthik Devan

I work freelance on full-stack development of apps and websites, and I'm also trying to work on a SaaS product. When I'm not working, I like to travel, play board games, hike and climb rocks.