How to find all URLs on a domain's website
Today, we're diving into a topic that might seem daunting at first: how to find all URLs on a domain. If you've ever felt overwhelmed by the thought of tracking down every last one of those sneaky links, you're not alone. But fear not! We're here to guide you through the process, making it as simple and stress-free as possible. From leveraging the power of Google search to exploring tools like ScreamingFrog, and even crafting our own Python script, we've got you covered.
So, let's embark on this journey together. By the end of this article, you'll be equipped with the knowledge and tools to tackle this task with confidence.
The source code for this article can be found on GitHub .
Why get all URLs from a website?
Ever wondered why finding all the URLs on a website might be a good idea? Here are some reasons why this task can be super handy:
- Scrape all the website's content: before analyzing a website's content, knowing what pages exist is crucial. That's where the URL hunt kicks in.
- Fixing broken links: broken links can hinder user experience and SEO. Identifying and fixing them is essential.
- Making sure Google notices you: slow-loading or non-mobile-friendly pages can affect your Google ranking. A thorough check can reveal these issues for SEO improvement.
- Catching hidden pages: sometimes, Google might overlook pages due to duplicate content or other issues. Regular checks can help catch these.
- Finding pages Google shouldn't see: pages under construction or meant for admin eyes might accidentally appear in search results. Monitoring your site helps prevent this.
- Refreshing outdated content: Google values fresh content. Knowing all your pages helps you update and strategize effectively.
- Improving site navigation: identifying orphan pages can enhance site navigation and credibility.
- Spying on competitors: analyzing competitors' websites can offer insights for your own site.
- Prepping for a website redesign: understanding all your pages facilitates a smoother redesign process.
There are various ways to uncover every page on a website, each with its own set of advantages and challenges. Let's explore how to tackle this task in a friendly and efficient way.
How to find all webpages on a website?
In this section, we'll explore some effective ways to find all URLs on a domain. Detailed instructions will follow, helping you master each method.
1. Google search
One of the simplest methods is to use Google search. Just type in a specific search query to find all the website's pages. However, remember this approach might not catch everything. Some pages could be missing, and sometimes outdated pages might show up in your results.
2. Sitemap and robots.txt
For those who don't mind diving into a bit more technical detail, checking the website's sitemap and robots.txt
file can be quite revealing. This method can be more accurate, but it's also trickier. If the website isn't set up correctly, finding and using this information might range from challenging to downright impossible.
3. SEO spider tools
If you're looking for a straightforward solution without much hassle, consider using an SEO spider tool. There are several out there, each with its own set of features. While many of these tools are user-friendly and offer detailed insights, the catch is that they often come at a cost, especially for extensive use.
4. Custom scripting
For those with specific needs and a bit of coding knowledge, creating a custom script might be the best route. This is the most involved method but allows for customization and potentially the most thorough results. If you have the time and skills, a DIY script can be a rewarding project that perfectly fits your requirements.
Each of these methods offers a different balance of ease, accuracy, and depth of information. Whether you prefer a quick overview or a deep dive, there's an approach here to suit your needs.
How to use Google Search to estimate your website's page count
Discovering how many pages your website has can be straightforward with Google. Here's a basic method to get a ballpark figure of your site's content.
First, head to google.com
. Type in the search bar using the format: site:DOMAIN
, replacing DOMAIN
with your site's domain name, but leave off the https://
or http://
part. For example, site:www.scrapingbee.com
:
You'll see a list of pages that Google has found on your website. Pretty easy, right?
Note, however, that in the example above it says "about 413 results" as in many cases this number is just an estimate. This method isn't always spot-on because Google only shows pages it has indexed. If a page is too new, or if it's seen as a duplicate or of low quality, Google might not list it. And sometimes, Google lists pages that aren't even on your site anymore.
So, while using Google is a simple way to get a general idea of your website's size, it's not the most accurate. It's best used for quick, rough estimates.
Using ScrapingBee to scrape Google search results
Using Google search is easy, but when it comes to sorting through the results for something specific, things can get a bit tricky. Copying and pasting each link by hand? That can eat up a lot of time. That's where ScrapingBee comes in to save the day! It's a tool designed to make your life easier with a Google request builder that's a breeze to use. It neatly organizes search results in a straightforward, easy-to-handle format. And don't worry about hitting those pesky rate limits – ScrapingBee has got your back, ensuring everything runs smoothly.
If you haven't already done so, register for free and then proceed to the Google API Request Builder:
Pop your search query into the Search field, and press Try it:
You'll receive your results in a clean JSON format. Here's the example with the relevant fields. Note the url
keys that contain the actual page links:
"organic_results": [
{
"url": "https://bodrovis.tech/",
"displayed_url": "https://bodrovis.tech",
},
{
"url": "https://bodrovis.tech/en/teaching",
"displayed_url": "https://bodrovis.tech › teaching",
},
{
"url": "https://bodrovis.tech/ru/blog",
"displayed_url": "https://bodrovis.tech › blog",
}
]
Now you can simply download this JSON document and use it for your needs. Please refer to my other tutorial to learn more about scraping Google search results.
Finding all URLs with sitemaps and robots.txt
This approach is a bit more technical but can give you more detailed results. We'll dive into how sitemaps and robots.txt
files can help you uncover all the URLs of a website.
Sitemaps
Webmasters use XML files known as "sitemaps" to help search engines better understand and index their websites. Think of sitemaps as roadmaps that provide valuable insights into the website's layout and content.
Here's what a standard sitemap looks like:
<urlset xsi:schemaLocation="...">
<url>
<loc>https://example.com</loc>
<lastmod>2024-02-18T13:13:49+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>
<url>
<loc>https://example.com/some/path</loc>
<lastmod>2024-02-18T13:13:49+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>
</urlset>
This XML structure shows two URLs under the url
tag, with each loc
tag revealing the URL's location. Additional details like the last modification date and frequency of change are mainly for search engines.
For smaller sitemaps, you can manually copy the URLs from each loc
tag. To simplify the process for larger sitemaps, consider using an online tool to convert XML to a more manageable format like CSV.
Be aware that large websites may use multiple sitemaps. Often, there's a main sitemap that points to other, more specific sitemaps:
<sitemapindex xsi:schemaLocation="...">
<sitemap>
<loc>
https://examples.com/sitemaps/english.xml
</loc>
<lastmod>2024-02-18T13:13:49+00:00</lastmod>
</sitemap>
<sitemap>
<loc>
https://examples.com/sitemaps/french.xml
</loc>
<lastmod>2024-02-18T13:13:50+00:00</lastmod>
</sitemap>
</sitemapindex>
If you study this file for a few moments, it becomes obvious that the website has two sitemaps: one for English language, and one for French. Then you can simply visit each location to view the corresponding contents.
How to find sitemaps?
Not sure where to find a sitemap? Try checking /sitemap.xml
on the website, like https://example.com/sitemap.xml
. The robots.txt
file, discussed next, often includes a sitemap link.
Other common sitemap locations include:
/sitemap.xml.gz
/sitemap_index.xml
/sitemap_index.xml.gz
/sitemap.php
/sitemapindex.xml
/sitemap.gz
/sitemapindex.xml
Another effective method involves leveraging our old friend Google. Simply navigate to the Google search bar and enter: site:DOMAIN filetype:xml
, remembering to replace DOMAIN
with your actual website's domain. This neat trick is designed to unearth a wealth of indexed XML files associated with your site, notably including those all-important sitemaps.
Keep in mind, if your website is particularly rich in XML files, you might find yourself doing a bit of extra legwork to filter through everything. But not to worry—consider it a small, intriguing part of the journey!
Using robots.txt
robots.txt
is yet another file created for search engines. Typically it says where the sitemap is located, which pages should be indexed, and which are disallowed for indexing. According to current standards, this file should be available under the /robots.txt
path.
Here's an example of the robots.txt
file:
User-agent: *
Sitemap: https://example.com/sitemap.xml
Disallow: /login
Disallow: /internal
Disallow: /dashboard
Disallow: /admin
Host: example.com.
In the example above we can see where the sitemap is located. Also there are a few routes that have been disallowed for indexing which means that they are indeed present on the site.
Use ScreamingFrog to crawl the website
Now let's see how to use a SEO spider and find all the website pages. We'll take advantage of the tool called ScreamingFrog. Ready to give it a whirl? Grab it from their official website to get started. They offer a free version that's perfect for smaller sites, allowing you to uncover up to 500 pages.
Once downloaded, launch the tool (in crawl mode), pop your website's URL into the top text field, and hit Start:
Give it a little time—especially for those more intricate websites—and soon you'll have a complete list of URLs right before your eyes:
It lists everything by default, including images, JS, and CSS files. If you're only after the main HTML pages, just tweak the Filter option to narrow it down:
Also use tabs on top to choose the data to display. For example, by using this tool you can easily find broken links on your website.
Getting started with this tool is a breeze. But, be aware that sometimes a site might block your efforts for various reasons. If that happens, there are a few tricks you can try, like changing the user agent or reducing the number of threads in use. Head over to the Configuration menu to adjust these settings:
You'll mostly be interested in tweaking the Speed, User-Agent, and HTTP Header settings. Although, keep in mind, some of these advanced features are part of the paid version. Setting your user agent to "Googlebot (Smart Phone)" can often help, but the right speed setting might take some experimentation to get just right, as different sites have their own ways of detecting and blocking scrapers.
Also in the "Crawl Config" it's worth unticking "External links" as we only want the links on the target website.
Create a script to find all URLs on a domain
In this section, I'll guide you through crafting a custom Python 3 script to get all URLs from a website.
First off, let's kickstart a new project using Poetry :
poetry new link_finder
Now, beef up your project's dependencies by adding the following lines to the pyproject.toml
file:
[tool.poetry.dependencies]
beautifulsoup4 = "^4.12"
requests = "^2.31"
Once done, hit the command:
poetry install
If you don't use Poetry, simply install these libraries with pip:
pip install beautifulsoup4 requests
Next let's open the link_finder/link_finder.py
file and import the necessary dependencies:
import requests
from bs4 import BeautifulSoup as Soup
import os
import csv
Send the request:
# ...
def parse_sitemap(url):
"""Parse the sitemap at the given URL and append the data to a CSV file."""
# Return False if the URL is not provided.
if not url:
return False
# Attempt to get the content from the URL.
response = requests.get(url)
# Return False if the response status code is not 200 (OK).
if response.status_code != 200:
return False
Now let's create a BeautifulSoup parser:
def parse_sitemap(url):
# ...
# Parse the XML content of the response.
soup = Soup(response.content, "xml")
If after running this script you get an error saying that the parser cannot be found, make sure to install the lxml library:
lxml = "^5.1"
Remember, a sitemap file might point to more sitemaps we need to handle. Let's tackle that with a recursive call:
def parse_sitemap(url):
# ...
# Recursively parse nested sitemaps.
for sitemap in soup.find_all("sitemap"):
loc = sitemap.find("loc").text
parse_sitemap(loc)
Next, we'll find all the page URLs and prepare the project root (as we'll save the URLs into a CSV file later):
def parse_sitemap(url):
# ...
# Define the root directory for saving the CSV file.
root = os.path.dirname(os.path.abspath(__file__))
# Find all URL entries in the sitemap.
urls = soup.find_all("url")
Now, simply cycle through the URLs and stash the data into a CSV file:
def parse_sitemap(url):
# ...
rows = []
for url in urls:
row = []
for attr in ATTRS:
found_attr = url.find(attr)
# Use "n/a" if the attribute is not found, otherwise get its text.
row.append(found_attr.text if found_attr else "n/a")
rows.append(row)
# Append the data to the CSV file.
with open(os.path.join(root, "data.csv"), "a+", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerows(rows)
Here I'm using the ATTRS
constant so let's create it:
# Constants for the attributes to be extracted from the sitemap.
ATTRS = ["loc", "lastmod", "priority"]
def parse_sitemap(url):
# ...
This constant simply says which attributes should be extracted to the CSV file.
Now, to put our function to the test:
parse_sitemap("https://bodrovis.tech/sitemap.xml")
And voila! Here's the final cut of our script:
import requests
from bs4 import BeautifulSoup as Soup
import os
import csv
# Constants for the attributes to be extracted from the sitemap.
ATTRS = ["loc", "lastmod", "priority"]
def parse_sitemap(url):
"""Parse the sitemap at the given URL and append the data to a CSV file."""
# Return False if the URL is not provided.
if not url:
return False
# Attempt to get the content from the URL.
response = requests.get(url)
# Return False if the response status code is not 200 (OK).
if response.status_code != 200:
return False
# Parse the XML content of the response.
soup = Soup(response.content, "xml")
# Recursively parse nested sitemaps.
for sitemap in soup.find_all("sitemap"):
loc = sitemap.find("loc").text
parse_sitemap(loc)
# Define the root directory for saving the CSV file.
root = os.path.dirname(os.path.abspath(__file__))
# Find all URL entries in the sitemap.
urls = soup.find_all("url")
rows = []
for url in urls:
row = []
for attr in ATTRS:
found_attr = url.find(attr)
# Use "n/a" if the attribute is not found, otherwise get its text.
row.append(found_attr.text if found_attr else "n/a")
rows.append(row)
# Append the data to the CSV file.
with open(os.path.join(root, "data.csv"), "a+", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerows(rows)
# Example usage
parse_sitemap("https://bodrovis.tech/sitemap.xml")
To run it, simply call:
poetry run python link_finder\link_finder.py
Please find the final project version on GitHub .
What if there's no sitemap on the website?
Sometimes, you might stumble upon a website that doesn't have a sitemap (although this is quite uncommon nowadays). But don't worry, there's still a workaround! Here's what you can do:
Instead of relying on a sitemap, we can scan the main page of the website. By doing this, we can identify all the internal links present there. We then add these links to a queue, visit each one, and repeat the process. Once we've explored all the links, we've effectively mapped out the entire website.
While this method isn't foolproof (some pages might not be linked to from anywhere), it's still a pretty robust solution. If you want to dive deeper into this technique, we have another tutorial on our website that covers it in more detail.
The code snippet below provides a simple solution using basic libraries:
from urllib.parse import urljoin # Importing module for URL manipulation
import requests # Importing module for sending HTTP requests
from bs4 import BeautifulSoup # Importing module for web scraping
class Crawler:
def __init__(self, urls=[]):
"""
Constructor for the Crawler class.
"""
self.visited_urls = [] # List to store visited URLs
self.urls_to_visit = urls # List to store URLs yet to be visited
def download_url(self, url):
"""
Function to download the content of a webpage given its URL.
"""
return requests.get(url).text # Sending a GET request to the URL and returning the HTML content
def get_linked_urls(self, url, html):
"""
Function to extract linked URLs from the HTML content of a webpage.
"""
soup = BeautifulSoup(html, 'html.parser') # Creating a BeautifulSoup object
for link in soup.find_all('a'): # Finding all <a> tags in the HTML
path = link.get('href') # Getting the 'href' attribute of the <a> tag
if path and path.startswith('/'): # Checking if the URL is relative
path = urljoin(url, path) # Resolving the relative URL
yield path # Yielding the resolved URL
def add_url_to_visit(self, url):
"""
Function to add a URL to the list of URLs to visit if it has not been visited before.
"""
if url not in self.visited_urls and url not in self.urls_to_visit: # Checking if the URL is not already visited or in the list of URLs to visit
self.urls_to_visit.append(url) # Adding the URL to the list of URLs to visit
def crawl(self, url):
"""
Function to crawl a webpage by downloading its content and extracting linked URLs.
"""
html = self.download_url(url) # Downloading the content of the webpage
for url in self.get_linked_urls(url, html): # Iterating through linked URLs found in the webpage
self.add_url_to_visit(url) # Adding each linked URL to the list of URLs to visit
def run(self):
"""
Function to start the crawling process.
"""
while self.urls_to_visit: # Loop until there are URLs to visit
url = self.urls_to_visit.pop(0) # Get the next URL to visit from the list
try:
self.crawl(url) # Crawling the webpage
except Exception:
print(f'Failed to crawl: {url}') # Handling exceptions
finally:
self.visited_urls.append(url) # Adding the visited URL to the list of visited URLs
if __name__ == '__main__':
Crawler(urls=['https://www.imdb.com/']).run() # Creating an instance of the Crawler class and starting the crawling process with IMDb's homepage as the starting URL
Here's how it works:
- We keep track of the URLs we need to visit in an array called
urls_to_visit
. - We identify all the hrefs on the page.
- If a URL hasn't been visited yet, we add it to the array.
- We continue running the script until there are no more URLs left to visit.
This code serves as a great starting point. However, if you're after a more robust solution, please refer to our tutorial on Scrapy .
Using ScrapingBee to send requests
Actually, that's where ScrapingBee can also assist you because we provide a dedicated Python client to send HTTP requests. This client enables you to use proxies, make screenshots of the HTML pages, adjust cookies, headers, and more.
To get started, install the client by running pip install scrapingbee
or add it into your pyproject.toml
:
scrapingbee = "^2.0"
Then import it in your script and create a client:
from scrapingbee import ScrapingBeeClient
client = ScrapingBeeClient(
api_key="YOUR_API_KEY"
)
Now simply send the request with the client and adjust the params as necessary:
response = client.get(
url,
params={
# Enable proxy:
# 'premium_proxy': 'true',
# 'country_code': 'gb',
"block_resources": "false",
"wait": "1500" # Waiting for the content to load (1.5 seconds)
# Make screenshot if needed:
# "screenshot": True,
# "screenshot_full_page": True,
}
)
Now you can pass the response.content
to BeautifulSoup as before and use it to find all the loc
tags inside.
What to do once you've found all the URLs
So, you've gathered a whole bunch of URLs—what's next? Well, the path you take from here largely hinges on what you're aiming to achieve. If you're thinking about scraping data from these pages, you're in luck because there are a plethora of resources out there to guide you. Here are some enlightening articles that not only offer a wealth of information but might also spark some innovative ideas:
- How to web scrape data from any website ** — quick start code to get you up and running mining the internet for data.
- The Best Web Scraping Tools for 2024 ** — Discover the top tools that can empower your web scraping projects.
- Web Scraping with Python: Everything you need to know ** — Master the art of web scraping efficiently using Python.
- Easy web scraping with Scrapy ** — A guide to leveraging Scrapy for Python-powered web scraping.
- Web Scraping without getting blocked ** — Strategies to scrape the web while dodging blocks and bans.
- Web crawling with Python ** — Step-by-step instructions on building a Python crawler from the ground up.
- Web Scraping with JavaScript and NodeJS ** — Learn how to implement web scraping using JavaScript and NodeJS.
Lastly, don't forget to dive deeper into the ScrapingBee API ! With ScrapingBee, you'll sidestep the hassles of managing numerous headless browsers, navigating rate limits, juggling proxies, or tackling those pesky captchas. It's all about focusing on what really matters: the data itself. Let us handle the nitty-gritty of data retrieval, so you can channel your energy into making the most of the information at your fingertips.
Conclusion
And there you have it! We've explored some straightforward methods to to find all webpages on a website. I hope you've found the guidance helpful and feel equipped to put these tips into action.
Thank you so much for joining me on this journey. Wishing you all the best with your scraping adventures. Happy data hunting!
Ilya is an IT tutor and author, web developer, and ex-Microsoft/Cisco specialist. His primary programming languages are Ruby, JavaScript, Python, and Elixir. He enjoys coding, teaching people and learning new things. In his free time he writes educational posts, participates in OpenSource projects, tweets, goes in for sports and plays music.