Haskell Web Scraping
Even though web scraping is commonly done with languages like Python and JavaScript, a statically typed functional programming language like Haskell can provide extra benefits. Types make sure that your scripts do what you want them to do and that the data scraped conforms to your requirements.
In this article, you'll learn how to do web scraping in Haskell with libraries such as Scalpel and webdriver .
Basic Scraping
Scraping a static website can be done with any language that has libraries for an HTTP client and HTML parsing. Haskell is no different. It even has a dedicated high-level scraping library called Scalpel , which puts it above similar languages like Rust.
In the first part of this tutorial, you'll use Scalpel to scrape the list of largest cities from Wikipedia.
Setup
To use Scalpel, you need GHC (the Haskell compiler) and Stack (the Haskell package manager) on your machine. If you don't have them yet, install them via GHCup , which is the currently recommended installation method.
Next, create a new Haskell project using the following command:
stack new scraper
The command creates a new folder with everything you need for a Haskell project that uses Stack. Move into that folder:
cd scraper
After that, install Scalpel using the following command:
stack install scalpel
Next, add Scalpel and the "text" library as dependencies in the package.yaml file:
dependencies:
- base >= 4.7 && < 5
- scalpel
- text
In addition to Scalpel, the "text" library will be used to handle strings.
Then, open app/Main.hs in a code editor. Paste in the following code:
{-# LANGUAGE OverloadedStrings #-}
module Main (main) where
import Data.Text
import Text.HTML.Scalpel
main :: IO ()
main = someFunc
This code imports the libraries you'll use and also enables the OverloadedStrings language extension, which is necessary for Scalpel.
During the tutorial, you'll continue working in Main.hs to build out your program and understand the opportunities that the library offers.
Scraping Elements
At the heart of Scalpel are scrapers . They find elements using selectors and return the matching DOM elements, text content, or HTML attributes.
Here's a simple example of a scraper:
heading :: Scraper Text Text
heading = text "h1"
It takes HTML as input and returns the text of the first H1 element as output.
Scrapers can be run using the
scrapeURL
function. It returns an IO function for running a certain scraper on a certain URL, which can later be executed by the main
function.
Here's how you can use scrapeURL
to create a function for scraping the heading of a page:
scraper :: IO (Maybe Text)
scraper = scrapeURL "https://en.wikipedia.org/wiki/List_of_largest_cities" heading
where
heading :: Scraper Text Text
heading = text "h1"
To run it, you need to execute it in main
:
main :: IO ()
main = do
result <- scraper
case result of
Just x -> print x
Nothing -> print "Didn't find the necessary items."
Here's how the full code for running a scraper for the heading on the page would look like:
{-# LANGUAGE OverloadedStrings #-}
module Main (main) where
import Data.Text
import Text.HTML.Scalpel
scraper :: IO (Maybe Text)
scraper = scrapeURL "https://en.wikipedia.org/wiki/List_of_largest_cities" heading
where
heading :: Scraper Text Text
heading = text "h1"
main :: IO ()
main = do
result <- scraper
case result of
Just x -> print x
Nothing -> print "Didn't find the necessary items."
You can run it with stack run
. It should print out the name of the page, which is List of largest cities.
Scrapers and Selectors in Scalpel
Scalpel offers many types of scrapers and selectors for advanced use.
If you would like to get an attribute of an element, you can use the attr
scraper. It takes the name of the attribute you want to access and a selector for an element and returns the contents of that attribute. For example, if you want to access the class name of the h1
element, you can do it with the following code:
headingClass :: Scraper Text Text
headingClass = attr "class" "h1"
Scalpel selectors can not only target elements of one tag but also match nested elements by using the //
operator. For example, the following combination of selectors would work the same way as the div h1
CSS selector:
headingInDiv :: Scraper Text Text
headingInDiv = text ("div" // "h1")
It's also possible to use different types of
attribute predicates
. For example, if you need a div
with a specific class, you can use the HasClass
function:
navbarDiv :: Scraper Text Text
navbarDiv = text ("div" @: [hasClass "navbar"])
This is identical to having a div.navbar
CSS selector.
If you're used to slowly arriving at the final element by executing a series of selections, you will find the chroot
scraper useful. As shown below, it selects HTML using the selector that you provide and then passes that HTML to the next scraper, enabling you to chain a series of selections.
scraper :: IO (Maybe (Text))
scraper = scrapeURL "https://en.wikipedia.org/wiki/List_of_largest_cities" sidebar
where
sidebar :: Scraper Text Text
sidebar = chroot ("table" @: [hasClass "sidebar"]) sidebarTitle
sidebarTitle :: Scraper Text Text
sidebarTitle = text "th"
Scraping Tables
Now you're ready to scrape the table containing data about the largest cities in the world. Here's how you can do it.
First, create a record to hold data about cities. It will have three fields: name, country, and population:
data City = City
{ name :: Text
, country :: Text
, population :: Text
} deriving (Show, Eq)
Next, create a function for scraping the information by executing a series of scrapers:
allCities :: IO (Maybe ([City]))
allCities = scrapeURL "https://en.wikipedia.org/wiki/List_of_largest_cities" table
where
table :: Scraper Text [City]
table = chroot ("table" @: [hasClass "static-row-numbers"]) cities
cities :: Scraper Text [City]
cities = chroots "tr" city
city :: Scraper Text City
city = do
name <- text "th"
rows <- texts "td"
let country = getCountry(rows)
let population = getPopulation(rows)
return $ City (strip $ name) country population
getCountry(x: xs) = strip x
getCountry(_) = "Not available"
getPopulation(x: y: xs) = strip y
getPopulation(_) = "Not available"
Here's what the scrapers do:
table
finds the table containing the information on the page and useschroot
to execute thecities
scraper on that table.cities
finds each row of the table and executes thecity
scraper on each of those rows.city
extracts information from the row by scraping the row's header and the first and second table element. Then it returns a record containing that information.
Run allCities
using the following main
function:
main :: IO ()
main = do
cities <- allCities
print cities
Here's how the final code should look:
{-# LANGUAGE OverloadedStrings #-}
module Main (main) where
import Data.Text
import Text.HTML.Scalpel
data City = City
{ name :: Text
, country :: Text
, population :: Text
} deriving (Show, Eq)
allCities :: IO (Maybe ([City]))
allCities = scrapeURL "https://en.wikipedia.org/wiki/List_of_largest_cities" table
where
table :: Scraper Text [City]
table = chroot ("table" @: [hasClass "static-row-numbers"]) cities
cities :: Scraper Text [City]
cities = chroots "tr" city
city :: Scraper Text City
city = do
name <- text "th"
rows <- texts "td"
let country = getCountry(rows)
let population = getPopulation(rows)
return $ City (strip $ name) country population
getCountry(x: xs) = strip x
getCountry(_) = "Not available"
getPopulation(x: y: xs) = strip y
getPopulation(_) = "Not available"
main :: IO ()
main = do
result <- allCities
case result of
Just x -> print x
Nothing -> print "Didn't find the necessary items."
If you run it with the stack run
command, it should print out a list of records corresponding to the cities in the Wikipedia list.
Scraping with Selenium in Haskell
Since Scalpel only parses HTML, you cannot use it to interact with pages that use JavaScript to dynamically generate content of the page. For these types of pages, you can use
webdriver
, a library with Selenium bindings that enable you to programmatically control a web browser.
This section shows you how to use the webdriver
library to scrape dynamic websites.
Setup
webdriver
is quite an old library and needs a Selenium server version 2 to work. You can download it
here
.
In addition, you'll need a driver for your browser. This tutorial assumes that you use Chrome. You can download the ChromeDriver that matches your version of Chrome here . Extract the file and put it in the same location as your code and the server.
To use the library, add it to the dependencies in your package.yaml:
dependencies:
- base >= 4.7 && < 5
- scalpel
- text
- webdriver
Open a new terminal window and start the Selenium server using the following command:
java -jar .\selenium-server-standalone-2.53.1.jar
Now you're ready to connect to it and drive a browser.
Interacting with Dynamic Elements
The webdriver
library enables you to drive a browser.
You can test it out by trying to scrape quotes from the JS-generated version of Quotes to Scrape . Since the quotes on the page are generated by JavaScript, using a simple HTTP client library to scrape them will fail, but you can easily get the quotes by using Selenium to drive a browser.
Replace the code in Main.hs with the following:
{-# LANGUAGE OverloadedStrings #-}
module Main (main) where
import Data.Text
import Test.WebDriver
chromeConfig :: WDConfig
chromeConfig = useBrowser chrome defaultConfig
main :: IO [()]
main = do
quotes <- getQuotes
mapM print quotes
getQuotes :: IO [Text]
getQuotes = runSession chromeConfig $ do
openPage "http://quotes.toscrape.com/js/"
searchInput <- findElems ( ByCSS "span.text" )
quotes <- traverse getText searchInput
closeSession
return quotes
In the code example above:
chromeConfig
provides the configuration to connect to a Selenium server running ChromeDriver;getQuotes
fetches the quotes from the page; andmain
executesgetQuotes
.
Running the code with stack run
should result in a Chrome browser starting up, opening the website, and then closing. Quotes from the website should be printed out in your console.
Conclusion
In this article, you learned how to do basic scraping of static websites in Haskell and explored advanced techniques for scraping dynamic websites using Selenium.
Web scraping with Haskell is possible, especially if you're a passionate Haskeller who wants to use it to solve everyday tasks.
However, if you prefer a hassle-free web scraping experience without dealing with rate limits, proxies, user agents, and browser fingerprints, you can check out ScrapingBee's no-code web scraping API. Did you know the first 1,000 calls are on us? Give it a try!