You can ignore non-HTML URLs when web crawling via two methods.
- Check the URL suffix for unwanted file extensions
Here is some sample code that filters out image file URLs based on extension:
import os
IMAGE_EXTENSIONS = [
'mng', 'pct', 'bmp', 'gif', 'jpg',
'jpeg', 'png', 'pst', 'psp', 'tif',
'tiff', 'ai', 'drw', 'dxf', 'eps',
'ps', 'svg', 'cdr', 'ico',
]
url = "https://scrapingbee.com/logo.png"
if os.path.splitext(url)[-1][1:] in IMAGE_EXTENSIONS:
print("Abort the request")
else:
print("Continue the request")
- Perform a HEAD request to the URL and investigate the response headers
A head request does not download the whole response but rather makes a short request to a URL to get some metadata. An important piece of information that it provides is the Content-Type
of the response. This can give you a very good idea of the file type of a URL. If the HEAD request returns a non-HTML Content-Type
then you can skip the complete request. Here is some sample code for making a HEAD request and figuring out the response type:
import requests
response = requests.head("https://scrapingbee.com")
print(response.headers['Content-Type'])
# Output: 'text/html; charset=utf-8'
if "text/html" in response.headers['Content-Type']:
print("You can now make the complete GET request")
else:
print("Abort the request")
# This request would have failed with the above check:
# response = requests.head("https://practicalpython.yasoob.me/_static/images/book-cover.png")
# print(response.headers['Content-Type'])
# Output: image/png