Web scraping is essential when trying to retrieve massive amounts of data from the internet, and the most crucial part of the process is HTML parsing, or extracting needed data from HTML code.
If you need an HTML parser, you may be overwhelmed by the many choices. These are the basic criteria to keep in mind:
It must be open-source and free.
It must offer reasonable documentation.
The library must be actively maintained.
This article will analyze five HTML parsing libraries based on these rules, so that you can consider them for your projects.
Note: Some of the libraries mentioned aren’t strictly used for HTML parsing; they may be used for web automation or as a parser for any hypertext language in general (e.g., HTML, XML, MathML).
Html Agility Pack
Html Agility Pack (HAP) is an essential library for parsing HTML using C#. It’s also a dependency for one of the other libraries mentioned in this article, Fizzler. HAP is more versatile than other libraries, allowing you to scrape websites and parse them directly using the same library.
What HAP offers:
It can automatically get the HTML source file through an HTTP request, while other libraries rarely have this functionality built-in.
It offers cleaner results compared to other libraries. The inline HTML is generally removed, giving you plain text.
HAP is available for new versions of .NET, including .NET Core 3.1 and .NET 5.
☝️ If you want to learn more about this, check out our tutorial about HTML agility pack .
Pros
HAP is one of the fastest HTML parsers in C#, achieving first place when benchmarked against other libraries. The result includes retrieving the HTML source from a supplied URL.
It saves you the trouble of removing unused inline HTML tags.
It’s extensible with Fizzler, natively adding a CSS selector to the library.
It’s actively maintained, with frequent updates and good documentation.
It’s more straightforward than other libraries, providing an excellent experience for developers who want to work on something fast.
Cons
HAP only allows you to query XPath and HTML tags. It doesn’t support CSS selectors.
If you need to retrieve data from inline HTML tags, you might have difficulty using Html Agility Pack’s cleaner altered source.
You can install Html Agility Pack using the NuGet package manager .
AngleSharp
AngleSharp gives you the functionality to parse hypertext source documents and can serve as a de facto headless browser because it returns responses similar to state-of-the-art browsers. The library has been maintained continuously since 2013.
What AngleSharp offers:
AngleSharp responses are raw, which is useful if you prefer an HTML parser that doesn’t alter results.
AngleSharp is one of the most extensible libraries out there, with many alternative libraries you can use to make it more robust (AngleSharp.XPath and AngleSharp.Css, to name a few).
AngleSharp is one of the more popular HTML parser libraries along with Html Agility Pack, and it’s also one of the best maintained. It’s available for .NET Framework and .NET Standard.
Pros
AngleSharp is one of the fastest C# HTML parser libraries out there, second only to Html Agility Pack when benchmarked. The benchmark includes the HTTP request to retrieve the HTML source.
It returns a raw HTML source rather than an altered one, making it easier for you to retrieve all kinds of data from within the HTML tags.
There are many different versions of AngleSharp, including AngleSharp.XPath for XPath support and AngleSharp.Css to enhance CSS selector support.
It’s actively maintained with frequent updates and good documentation.
It’s an old library that has stood the test of time.
It offers a broad toolset, giving developers more freedom to do what they want.
Cons
AngleSharp only provides native HTML tag selector support. You’d need to install separate packages to query using XPath or CSS selectors.
It doesn’t retrieve the HTML source for you. Instead, you have to make an HTTP request using an HTTP client.
You can install AngleSharp , including AngleSharp.XPath, by using NuGet.
Fizzler
Fizzler is a CSS Selector engine built on top of Html Agility Pack. It may serve as an extension to Html Agility Pack and is similar to AngleSharp. Fizzler also names its selectors after the JavaScript counterparts (ie, QuerySelector).
Fizzler receives the least support on this list. As of this writing, the last significant update was about ten months ago. The library still is actively maintained, though.
Fizzler offers similar functions to those of Html Agility Pack and shares its processing speed. Fizzler is a .NET Standard 1.0 library.
Pros
Fizzler offers the benefits of the regular Html Agility Pack library plus increased CSS selector support.
It enables you to get and parse HTML using a URL supplied and retrieved using Html Agility Pack.
It’s a literal CSS selector extension to Html Agility Pack, giving developers a tool to use CSS selectors without needing to change libraries.
Cons
Fizzler is actively maintained, but it is one of the least active projects compared to the other libraries.
It isn’t that well documented, with the only example provided in its README.md file in the Fizzler main Git repository.
You can install Fizzler using NuGet.
CefSharp
CefSharp, an abbreviation for Chromium Embedded Framework C#, is a library that manipulates Chromium drivers from C#. The Chromium drivers can be either used as a headless browser or as part of an application’s UI inside a Windows Forms app.
CefSharp is similar to a web automation tool like Selenium. But unlike Selenium, CefSharp doesn’t have a built-in HTML parser. You’d need to use an external HTML parser library such as AngleSharp or Html Agility Pack.
There are many CefSharp libraries, but CefSharp.OffScreen is similar to a headless browser that will run Chromium when you initialize it. Compared to Selenium’s web driver, CefSharp gives developers practicality because you don’t have to define the path of your Chromium driver binary to make it work. CefSharp automatically sets up a Chromium driver for you in your project’s dependencies.
Pros
CefSharp runs a headless Chromium driver, making it a headless browser capable of scraping a ton of data from complicated websites.
It’s the oldest maintained library in this list and is still popular among the .NET community (mainly for its capabilities as an embedded browser inside WinForms).
Cons
- CefSharp can only be used to scrape websites and not to parse them. It would help if you combined it with an external HTML parsing library such as Html Agility Pack.
To use CefSharp for your .NET Core project, you can install CefSharp.Offscreen.NETCore using NuGet.
Selenium WebDriver
Selenium is a suite of tools to help web automation, most prominently quality assurance testing. You can use Selenium tools to open and interact with web browsers through your code and retrieve the responses.
Like CefSharp, Selenium is a headless browser that you can execute through your C# code. While Selenium can be used as a general quality assurance testing tool, it’s not used explicitly for HTML parsing and may be too general for use.
Pros
Selenium runs a headless browser under the hood, making it capable of scraping data from websites.
It’s arguably the most famous quality assurance testing library, with plenty of tutorials and documentation available.
Selenium has an in-built HTML parser that can use XPath to retrieve data from an HTML source file.
Selenium enables developers to have a granular level of control.
Cons
Selenium requires you to have a Chromium driver installation, making it a more complicated install compared to other libraries.
Using Selenium only to parse HTML can seem excessive, because the library is more frequently used for quality assurance.
Because it relies on a headless browser, it’s the slowest HTML parsing tool on this list.
You need to install two of its libraries: Selenium.WebDriver and Selenium.WebDriver.ChromeDriver.
Conclusion
All the libraries listed above provide features and functions that may appeal to you. Generally the most popular option is Selenium, followed by AngleSharp, CefSharp, Html Agility Pack, and Fizzler. For developer experience, Html Agility Pack seems to be the best option, with Fizzler providing an excellent add-on to the library’s functionality.
If you are looking to parse an HTML file directly, consider using AngleSharp, Html Agility Pack, and Fizzler. To parse web URLs, Html Agility Pack with Fizzler is a good choice because they can parse HTML by URL. If you need to simulate browser actions to get to the page you want to parse, consider using Selenium and CefSharp.
Each library has its pros and cons, but you may find that all of them are worth trying, depending on your use case.