The internet is full of useful information and data. In 2021, an astonishing 2.5 quintillion bytes of data was created daily. This data can be analyzed and used to make better business decisions. However, most of the data is not structured and isn’t readily available for processing. That’s where web scraping comes in.
Web scraping enables you to retrieve data from a web page and store it in a format useful for further processing. But, as you probably know, web pages are written in a markup language called HTML. So, in order for you to extract the data from a web page, you must first parse its HTML. In doing so, you transform the HTML into a tree of objects.
There are numerous HTML parsers on the market, and choosing which one to go with can be confusing. In this roundup, you’ll review some of the best Python HTML parsers out there. Python is one of the most popular languages when it comes to scraping data, so it’s not surprising that there are quite a few options to consider. The parsers in this roundup were chosen for discussion based on the following factors:
- Open sourced
- Actively maintained
- Lightweight
- Good community support
- High performance
BeautifulSoup
BeautifulSoup is an HTML parser, but it’s also much more than that—it’s a lightweight and versatile library that allows you to extract data and information from both HTML and XML files. You can use it for web scraping, but also for modifying HTML files. While it’s relatively simple and easy to learn, it’s a very powerful framework. You can complete even more complex web scraping projects using BeautifulSoup as the only web scraping library.
The BeautifulSoup Python package is not built-in, so you need to install it before using it. Fortunately, it is very user-friendly and easy to set up. Simply install it by running pip install beautifulsoup4
. Once that is done, you only need to input the HTML file you want to parse. You can then use the numerous functions of BeautifulSoup to find and scrape the data you want, all in just a few lines of code.
Let’s take a look at an example based on this simple HTML from BeautifulSoup’s documentation. Note that if you wanted to scrape a real web page instead, you’d first need to use something like the Requests module in order to access the web page.
Here’s the HTML:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
Before parsing this, you’ll need to import the BeautifulSoup library using the following code:
from bs4 import BeautifulSoup
Finally, you can parse the HTML with this simple line of code:
soup = BeautifulSoup(html_doc, 'html.parser')
With this line, you simply pass the document you want to parse (in this case, the “html_doc”) into the BeautifulSoup
constructor, which then takes care of the parsing. You’ve probably noticed that apart from the document you want to parse, the BeautifulSoup
constructor takes a second argument. That’s where you pass the parser you want to use. If instead of the html.parser
you prefer the lxml
, you can simply run the following:
soup = BeautifulSoup(html_doc, 'lxml')
Keep in mind that different parsers will output different results.
Now that the HTML is parsed, you can find and extract any elements that you’re looking for. For example, finding the title element is as easy as running soup.title
. You can get the actual title with soup.title.string
. Some popular functions BeautifulSoup offers include find()
and especially find_all()
. These make it extremely easy to find any elements you want. For instance, finding all the paragraphs or links in an HTML is as simple as running soup.find_all(‘p’)
or soup.find_all(‘a’)
.
In a nutshell, BeautifulSoup is a powerful, flexible, and easy-to-use web scraping library. It’s great for everyone, but thanks to its extensive documentation and being straightforward to use, it’s especially suitable for beginners. To learn more about all the options BeautifulSoup can provide, check out the documentation .
💡 Love BeautifulSoup? Check out our awesome guide to improving scraping speed performance with BS4 .
lxml
Another high performance HTML parser is lxml. You’ve already seen it in the previous section—BeautifulSoup supports using the lxml parser by simply passing lxml
as the second argument to the BeautifulSoup
constructor. Previously, lxml was known for its speed, while BeautifulSoup was known for its ability to handle messy HTML. However, now that they support each other, you get both the speed and the ability to handle messy HTML in a single package.
With lxml, you can extract data from both XML and broken HTML. It’s very fast, safe, and easy to use—thanks to the Pythonic API—and it requires no manual memory management. It also comes with a dedicated package for parsing HTML.
As with BeautifulSoup, to use lxml you first need to install it, which is easily done with pip install lxml
. Once you install it, there are a couple of different functions for parsing HTML, such as parse()
and fromstring()
. For example, the same html_doc
from the BeautifulSoup example can be parsed with the following piece of code:
from lxml import etree
from io import StringIO
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html_doc), parser)
Now the HTML file will be contained in tree
. From there, you can extract the data you need using a variety of methods, such as XPath or CSSSelect. You can learn how to do this in detail from lxml’s
extensive documentation
.
lxml is a lightweight, actively maintained parser with great documentation. It’s similar to BeautifulSoup in many aspects, so if you know one of them, picking up the other should be pretty straightforward.
pyquery
If you enjoy the jQuery API and would like it in Python, then pyquery is for you. It’s a library that offers both XML and HTML parsing, with high speeds and an API very similar to jQuery.
Just like the other HTML parsers in this roundup, pyquery allows you to traverse and scrape information from the XML or HTML file. It also allows you to manipulate the HTML file, including adding, inserting, or removing elements. You can create document trees from simple HTML and then manipulate and extract data from them. Also, since pyquery includes many helper functions, it helps you save time by not having to write as much code yourself.
In order to use pyquery, you first need to install it with pip install pyquery
and then import it with from pyquery import PyQuery as pq
. Once that’s done, you can load documents from strings, URLs, or even lxml. Here’s the code that will enable you to do so:
d = pq(html_doc) # loads the html_doc we introduced previously
d = pq(url="https://www.scrapingbee.com/") # loads from an inputted url
pyquery does essentially the same thing as lxml and BeautifulSoup. The main distinction is in its syntax, which is intentionally very similar to that of jQuery. For instance, in the code above, the d
is like the $
in jQuery. If you’re familiar with jQuery, this might be a particularly helpful option for you.
However, pyquery is not as popular as BeautifulSoup or lxml, so its community support comparatively lacking. Still, it’s lightweight, actively maintained, and has great documentation .
jusText
While not as powerful as the other parsers discussed in this roundup, jusText is a tool that can be very useful in specific instances, such as when you want to keep only the full sentences on a given web page. It has one main goal: to remove all boilerplate content from a web page, and leave only the main content. So, if you were to pass a web page to jusText, it would remove the contents from the header, main menu, footer, sidebar, and everything else deemed not to be important, leaving you with only the text on the main part of the page.
You can give it a try using this jusText demo . Simply input the URL of any web page, and it will strip out all unnecessary content, leaving only what it considers important. If this is something that you would find useful, you can learn more about jusText here .
In order to use jusText, you’d first need to install it using pip install justext
. You would also need to have the Requests module installed so that you can access the web page from which you want to remove the boilerplate content. Then, you can use the following piece of code to get the main content of any web page:
import requests
import justext
url = "https://www.scrapingbee.com/"
response = requests.get(url)
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
if not paragraph.is_boilerplate:
print(paragraph.text)
This piece of code accesses the ScrapingBee homepage and returns only its main content. All you need to change in this code is to insert any URL you want, though keep in mind that this only works in English.
Similar to pyquery, jusText is another package that’s not as popular as BeautifulSoup, and it’s main use is limited to one particular task. Compared to lxml and BeautifulSoup, it doesn’t come with extensive documentation, nor is it as actively maintained. Still, it’s a lightweight parser that is great for the task of removing boilerplate content.
Scrapy
Finally, BeautifulSoup may be great for parsing HTML and extracting data from it, but when it comes to crawling and extracting a whole web site, there are better choices available. If you’re working on a highly complex web scraping project, you might be better off with a framework like Scrapy. It’s much more complicated and requires a steeper learning curve than BeautifulSoup, but it’s also much more powerful.
To say that Scrapy is an HTML parser is a huge understatement, since parsing HTML is a miniscule part of what Scrapy is capable of. Scrapy is the complete Python web scraping framework. It has all the features that you may require for web scraping, such as crawling an entire website from a single URL, exporting and storing the data in various formats and databases, limiting the crawling rate, and more. It’s very powerful, efficient, and customizable.
Among its many features, Scrapy offers methods for parsing HTML. First, you need to perform a request for the URL you need parsed, which you can do using the start_requests
method. Once that’s done, the web page you get as a response is easily parsed thanks to the parse
method, which extracts the data and returns an object. The parse
method allows for the extracted information to be returned as different kinds of objects, such as Item objects, Request objects, and dictionaries. Once the data is extracted, the yield
command sends it to an Item Pipeline.
The only downside to Scrapy is its steep learning curve; unlike the other items in this roundup, Scrapy is not suitable for beginners. Learning it can be a challenge, since there are so many features you’ll need to be familiar with.
However, by learning enough Scrapy, you can extract any piece of data available on a website. If you really want to step up your web scraping abilities, there’s no better tool. Because of its many features, Scrapy is not as lightweight as the other mentioned parsers. Still, it’s actively maintained and offers very extensive documentation and great community support—it’s one of the most popular Python packages available. If you want to learn how to build your first Scrapy spider, check out this Scrapy beginner’s guide .
💡 Love web scraping in Python? Check out our expert list of the Best Python web scraping libraries.
Conclusion
The amount of data on the internet increases by the day, and the need to handle this data and turn it into something useful grows in turn. In order to gather the data efficiently, you must first parse a web page’s HTML and extract the data from it. To parse the HTML, you need an HTML parser.
This article discussed several Python HTML parsers, reviewed based on whether they’re open sourced, lightweight, and actively maintained, as well as whether they offer good performance and community support.
For your next web scraping project, consider using one of the parsers. For most use cases, BeautifulSoup is likely the best place to start, with lxml being a viable alternative. If you prefer the jQuery API, then you can opt for pyquery. In cases when all you need is to get the main content from a web page, jusText is a great option. Finally, if you’re scraping an entire web site with thousands of pages Scrapy is probably your best bet.
Alen is a data scientist working in finance. He also freelances and writes about data science and machine learning.