Web Scraping with Perl

14 April 2025 (updated) | 8 min read

Web scraping is a technique for retrieving data from web pages. While one could certainly load any site in their browser and copy-paste the relevant data manually, this hardly scales and so web scraping is a task destined for automation. If you are curious why one would scrape the web, you'll find a myriad of reasons for that:

  • Generating leads for marketing
  • Monitoring prices on a page (and purchase when the price drops low)
  • Academic research
  • Arbitrage betting

Perl is universally considered the "Swiss Army knife of programming" and there is a good reason for that, as it particularly excels in text processing and handling of textual input of any sort. This makes it a perfect companion for web scraping, which is inherently text-centric.

This article provides:

  • A brief introduction to web scraping
  • A discussion of the benefits of scraping
  • And a demonstration of how to build a simple scraper in Perl.
cover image

Overview of Scraping Libraries

Applications of Web Scraping

We’ve already discussed some benefits of scraping earlier, but the list wasn’t exhaustive.

Even if a site provides an API, that API does not necessarily provide all the data the site itself has to offer. This is where web scrapers come to the rescue, as they can cover that last mile of data access. For example, the API of the lyrics site genius.com does not include the lyrics of a given song in its response - this is exactly what this article will address.

Web scraping enables developers to access and extract precisely the information they need. For example, e-commerce companies often opt to scrape competitor sites to ensure their own pricing remains attractive and market-aligned. Another use case is the aggregation of product reviews to ease a centralised market analysis.

The applications of scraping are almost endless, as the insights and value a company can derive from retrieving data from (almost) anywhere on the Internet is significant.

cURL

The easiest way to perform an HTTP request from the command line is with curl:

curl -i https://genius.com

If you like to learn more about cURL you can check: How to follow redirect using cURL?, How to send a POST request using cURL?, or Web Scraping With Linux And Bash

In return, the server provides you with an HTTP response:

HTTP/2 200
date: Mon, 03 Mar 2025 20:06:48 GMT
content-type: text/html; charset=utf-8
cf-ray: 705b9e87f9aabc6c-DUR
accept-ranges: bytes
cache-control: public, s-maxage=180
etag: W/"1bae1849f0a7e6803d98f06c9c43e007"
set-cookie: _genius_ab_test_cohort=98; Max-Age=2147483647; Path=/
vary: X-Requested-With, Accept-Encoding
via: 1.1 vegur
cf-cache-status: HIT
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
set-cookie: _genius_ab_test_primis_mobile=control; Max-Age=2147483647; Path=/
status: 200 OK
x-frame-options: SAMEORIGIN
x-runtime: 1242
server: cloudflare

<!doctype html>
<html>
  <head>
    <title>Genius | Song Lyrics &amp; Knowledge</title>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

<!--The rest of it goes here--->

Alongside the requested content, the response also includes HTTP headers which provide additional information and instructions, such as the type of returned content, any cookies, caching directives, and more.

The content type, alternatively referred to as MIME type, is the response data format — in our example, an HTML document.

In this example, the cache-control header is a set of directives for how things should be cached. In our response above, the directive indicates the HTML can be cached for a maximum of three minutes (180 seconds). Cookies are short key-value pairs that contain data sent from server to client in the set-cookie header. The server header, like the user-agent, identifies the server. If you want to know more about the portions of metadata in this response that are not discussed in detail here, you can visit MDN.

Web Scraping with Perl

The goal of the scraper we are about to build is to fetch the song lyrics for a specific song available on Genius. To do this, we first need to know about the Perl modules necessary to perform the web requests and parse the HTML content.

LWP

LWP – the World-Wide Web library for Perl – is a module that enables your Perl application to compose and send HTTP requests. It also supports document and file downloads and powers language package managers like CPAN.

You can find the full LWP API specification on the Comprehensive Perl Archive Network (CPAN) or browse through it locally with perldoc by typing the following in the console:

perldoc LWP

Parsing with TreeBuilder

HTML::Treebuilder is a Perl module available on CPAN, designed to parse an HTML document into a DOM tree, enabling subsequent data queries and further document manipulation. It is based on the HTML:Parser and HTML::Element packages.

TreeBuilder can be installed using either cpan or cpanm, both of which are available on most major operating systems and can be set up following the instructions on the CPAN website. The following cpan command will install the package:

cpan
cpan[1]> install HTML::TreeBuilder

Alternatively, you can use cpanm – a streamlined version of cpan – to install TreeBuilder with the following command:

cpanm HTML::TreeBuilder

Coding the Scraper

For this example, we’re going to retrieve the song lyrics for “Six Days” by the American songwriter DJ Shadow.

The first step is to instantiate the runtime objects of LWP and TreeBuilder, which will allow us to perform the HTTP request and extract the song lyrics from the returned HTML document:

my $ua = LWP::UserAgent->new;
$ua->agent("Genius Scraper");

my $url = "https://genius.com/DJ-Shadow-Six-Days-lyrics";
my $root = HTML::TreeBuilder->new();

# perform HTTP GET request
my $request = $ua->get($url) or die "Cannot contact Genius $!\n";

Now, $request should contain the server response and we can pass it to the parse() method of our TreeBuilder instance. To avoid encoding issues, we ensure to pass a UTF-8 decoded version using decode_utf8:

if ($request->is_success) {
  $root->parse(decode_utf8 $request->content);
} else {
  # print error message
  print "Cannot display the lyrics.\n";
}

Once parsed, $root contains the fully parsed DOM tree and we can call the look_down() method to traverse it and extract the lyrics from the right position within the document. On Genius, the lyrics can be found as part of an HTML <div> element with the element ID lyrics-root (akin to a div#lyrics-root CSS selector):

my $data = $root->look_down(
  _tag  => "div",
  id    => "lyrics-root"
);

Although useful when debugging, the standard output methods of TreeBuilder do not provide best formatting experience, so we'll make use of the FormatText module to print the lyrics in a clean fashion to console. For this, we simply instantiate FormatText and call its format() method with the $data.

my $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);

$formatter->format($data);

Let's now just make our script executable:

chmod a+x scraper.pl

And our scraper should be ready to run with the following command:

./scraper.pl

After completing its first round, it should display the lyrics in the following format:

Six Days Lyrics
---------------

[Verse 1]
At the starting of the week
At summit talks you'll hear them speak
It's only Monday
Negotiations breaking down
See those leaders start to frown
It's sword and gun day

[Hook]
Tomorrow never comes until it's too late

... etc. etc.

At this point, the scraper is crawling the page to retrieve the lyrics for “Six Days”. While that shows how you can easily crawl and scrape a website in Perl, it would be great if the scraper would provide a more generic interface to crawl any song on Genius.

Let's try just that now with a little tweak, by requiring a command line argument:

if ($#ARGV + 1 != 1) {
  die "Please provide song input\n";
}

Now, we change our initial URL setup, to specify the indicated song title in the URL:

my $url = "https://genius.com/$ARGV[0]";

If we now call our scraper and pass DJ-shadow-Six-Days-lyrics as command line argument, we should get the same content as before:

./scraper.pl DJ-shadow-Six-Days-lyrics

However, we can now also pass The-beatles-yesterday-lyrics and get the lyrics of "Yesterday":

./scraper.pl The-beatles-yesterday-lyrics

The code for the scraper is available in a GitHub Gist. Feel free to customize it as you see fit.

Chrome automation with Perl

When a site relies heavily on JavaScript and client-side rendering, you may not be able to extract the desired data with an HTTP client alone. In such cases, a full browser instance is often necessary. That's where WWW::Mechanize::Chrome comes into play. That module is based on WWW::Mechanize, but does not use a traditional, built-in HTTP client but a full-fledged Chrome environment (using Chrome::DevTools).

You can install it with the following command:

cpan WWW::Mechanize::Chrome

Now you can use the module to load websites, access page information, and interact with it just like in any regular browser:

use WWW::Mechanize::Chrome;

my $mech = WWW::Mechanize::Chrome->new();

# Load the provided URL
$mech->get('https://www.scrapingbee.com/');

# Wait for the page to have fully loaded
$mech->wait_for_page();

# Print the page title
print "Page title: " . $mech->title . "\n";

# Save a screenshot of the current page
my $png = $mech->content_as_png();

Conclusion

Web scraping has a tremendous number of use cases in any data related field, be that for individual developers or large-scale corporate projects, ranging from filling gaps in API data to enhancing business intelligence. With its robust text-handling capabilities, Perl is an excellent choice for any web scraping related task and will not disappoint.

This article aims to provide a first introduction to scraping the web with Perl and show how to create a simple scraper with just a few lines of code. However, typically there are many more aspects to take into consideration, from handling sites using advanced JavaScript code to using HTTP proxies and making sure that your requests are not blocked by any anti-scraper technologies. To address these issues and more, please feel free to check out our free scraping trial and get 1,000 API calls on the house.

image description
Alexander M

Alexander is a software engineer and technical writer with a passion for everything network related.