Introduction to Web Scraping With Java

Kevin Sahin | 25 November 2024 (updated) | 13 min read

Table of contents

Is there a website from where you'd like to regularly scrape data in a structured fashion, but that site does not offer a standardised API, such as a JSON REST interface yet? Don't fret, web scraping with Java comes to the rescue.

💡 Interested in web scraping with Java? Check out our guide to the best Java web scraping libraries

Welcome to the world of web scraping

Web scraping, or web crawling, refers to the process of fetching and extracting arbitrary data from a website. This involves downloading the site's HTML code, parsing that HTML code, and extracting the desired data from it.

If the aforementioned REST API is not available, scraping typically is the only solution when it comes to collecting information from a site. It is a commonly employed business standard, to obtain data in an automated fashion and can be used for any subject of your choice. For example, to analyse changes in your competitor's pricing scheme, to aggregate the latest stories from different news agencies, or to collect address information for your latest marketing campaign.

Doing essentially what a standard web browser does, there are barely any limits as to what information you can collect and the most tricky part typically is obtaining information from multimedia content (i.e. images, audio, video).

💡 Check out the advanced data extraction features of ScrapingBee and how they can help you to handle even more complex site setups.

In this post, we will walk you through on how to set up a basic web crawler in Java, fetch a site, parse and extract the data, and store everything in a JSON structure.

Prerequisites

As we are going to use Java for our demo project, please make sure you have the following prerequisites in place, before proceeding.

Java 23 SDK
A suitable Java IDE for development (e.g. Intellij IDEA)
If not part of your IDE, Maven for dependency management

Of course, having a basic understanding of Java and the concept of XPath will also speed things up.

Java dependencies

Please make sure to add HtmlUnit as dependency to your pom.xml file:

<dependency>
	<groupId>net.sourceforge.htmlunit</groupId>
	<artifactId>htmlunit</artifactId>
	<version>2.70.0</version>
</dependency>

as well as Jackson's FasterXML

<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-databind</artifactId>
  <version>2.18.0</version>
</dependency>

💡 If you are using Eclipse, it is recommended to set the maximum length of the output in the detail pane (when you click in the variables tab) so that you will receive the entire HTML code of the page.

Let's scrape Craigslist

For our example here, we're going to focus on Craigslist and would like to get a list of all classified ads in New York, selling an iPhone 13.

As Craigslist doesn't expose a public API, our only option is to go the scraping path and extract the data straight from the site. For that, we will fetch the site, collect the names, prices, and images of all items and export it all as a JSON structure.

Finding the right search URL

First, let's take a look at what happens when you search for something on Craigslist.

Go to https://newyork.craigslist.org
Enter iphone 13 in the search box on the left
Press Enter

You'll be immediately redirected to the search page with all the found products. For the purpose of this example, the URL in the address bar will be the most interesting thing for now.

https://newyork.craigslist.org/search/moa?query=iphone%2013

At this point, we have established what the search URL for this particular query is, and what parameters (i.e. query) it requires.

You can now open your favourite IDE, it is time to code.

Fetching the page

To make a request to a site, you'll first need an HTTP client to send that request. It just so happens that HtmlUnit comes with a default class for that task, appropriately called WebClient.

There are quite a few parameters you can tweak to customise its behaviour (e.g. proxy settings, CSS support, and more), but for our example we will use the bare configuration without CSS and JavaScript support.

// Define the search term
String searchQuery = "iphone 13";

// Instantiate the client
WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);

// Set up the URL with the search term and send the request
String searchUrl = "https://newyork.craigslist.org/search/moa?query=" + URLEncoder.encode(searchQuery, "UTF-8");
HtmlPage page = client.getPage(searchUrl);

At this point, we have the site's content in the page variable, and we could access the entire document with the asXml() method, however we are more interested in particular data items of the HTML document.

For this, we are checking out the site's structure in the browser, using the Inspector feature of the developer tools (F12).

Based on this, we now know that all items will be <li> tags beneath an <ul> container tag with the ID search-results. Furthermore, each <li> tag will have the HTML class result-row assigned.

Extracting data

With this knowledge, we can now use XPath to access the returned products and their item properties. HtmlUnit provides a number of convenience methods for this purpose (e.g. getHtmlElementById, getFirstByXPath, getByXPath), which allow you to work with an XPath expression to precisely access fetched data from the document. Please refer to JavaDoc of HtmlUnit for more information on the supported methods.

Let's go through the following code step-by-step:

We are fetching all aforementioned <li> tags with the class cl-static-search-result and iterating through them
For each htmlItem, we are going to look
1. for the product details, under an <a> tag
2. for the product title, under a <div> tag with the class title
3. for the product price, under a <div> tag with the class price
4. for the product location, under a <div> tag with the class location
Once we have the details, title, price, and location, we are printing them on the screen.

public static void main(String[] args) throws IOException {
    var searchQuery = "iphone 13";
    var searchUrl = "https://newyork.craigslist.org/search/moa?query=%s".formatted(URLEncoder.encode(searchQuery, StandardCharsets.UTF_8));

    System.out.println("searchUrl = " + searchUrl);

    try (var client = new WebClient()) {
        client.getOptions().setCssEnabled(false);
        client.getOptions().setJavaScriptEnabled(false);
        client.getOptions().setThrowExceptionOnFailingStatusCode(false);
        client.getOptions().setThrowExceptionOnScriptError(false);

        HtmlPage page = client.getPage(searchUrl);
        for (var htmlItem : page.<HtmlElement>getByXPath("//li[contains(@class,'cl-static-search-result')]")) {
            HtmlAnchor itemAnchor = htmlItem.getFirstByXPath(".//a");
            HtmlElement itemTitle = htmlItem.getFirstByXPath(".//div[@class='title']");
            HtmlElement itemPrice = htmlItem.getFirstByXPath(".//div[@class='price']");
            HtmlElement itemLocation = htmlItem.getFirstByXPath(".//div[@class='location']");

            if (itemAnchor != null && itemTitle != null) {
                System.out.printf("Name: %s, Price: %s, Location: %s, URL: %s%n", itemTitle.asNormalizedText(), itemPrice.asNormalizedText(), (itemLocation == null) ? "N/A" : itemLocation.asNormalizedText(), itemAnchor.getHrefAttribute());
            }
        }
    }
}

Voilà, we have parsed the whole page and managed to extract the individual product items!

💡 We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Please check out the documentation here for more information.

Converting to JSON

While the previous example provided an excellent overview on how to quickly scrape a website, we could take this a step further and convert the data into a structured and machine-readable format, such as JSON.

For that, we just need to make small changes to our code and introduce a special object to hold our results.

POJO

We add an additional POJO (Plain Old Java Object) class, which will represent the JSON object and hold our data. In modern Java, we can use the record keyword to define a simple data holder class:

record Item(String title, BigDecimal price, String location, String url) {
}

Mapping

Now, we need to instantiate our mapper:

private final static ObjectMapper OBJECT_MAPPER = new ObjectMapper();

and use it to convert our data into JSON:

if (itemAnchor != null && itemTitle != null) {
    var itemName = itemTitle.asNormalizedText();
    var itemUrl = itemAnchor.getHrefAttribute();
    var itemPriceText = itemPrice.asNormalizedText();
    var itemLocationText = (itemLocation == null) ? "N/A" : itemLocation.asNormalizedText();
    
    var item = new Item(itemName, new BigDecimal(itemPriceText.replace("$", "").replace(",", ".")), itemLocationText, itemUrl);
    System.out.println("item = " + OBJECT_MAPPER.writeValueAsString(item));
}

Let's take it a step further

Our project provided us so far with a quick overview on what web scraping is, its fundamental concepts, and how to set up our own crawler, using Java and XPath.

For now, it's a relatively simple example, taking a defined search term and returning as JSON all the products sold in the area of New York City. What if we wanted to get data from more than one city? Let's check it out.

Multi-city support

If you closely look at the URL we previously used for the search, you'll notice, Craigslist catalogues its ads by city and keeps that information as part of the hostname of the URL.

For example, our ads for New York City are all behind the following URL:

https://newyork.craigslist.org

If we wanted to fetch the ads relevant to Boston, we'd be using https://boston.craigslist.org instead.

Now, let's say, we'd like to retrieve all iPhone 13 ads for the East Coast and, specifically, for New York, Boston, and Washington D.C. In that case, we'd simply revisit our code from Fetching the page and extend it a bit, to support the other cities as well:

public static void main(String[] args) throws IOException {
    var searchQuery = "iphone 13";
    var cities = List.of("newyork", "boston", "washingtondc");

    try (var client = new WebClient()) {
        client.getOptions().setCssEnabled(false);
        client.getOptions().setJavaScriptEnabled(false);
        client.getOptions().setThrowExceptionOnFailingStatusCode(false);
        client.getOptions().setThrowExceptionOnScriptError(false);

        for (String city : cities) {
            var searchUrl = "https://%s.craigslist.org/search/moa?query=%s".formatted(city, URLEncoder.encode(searchQuery, StandardCharsets.UTF_8));

            System.out.println("searchUrl = " + searchUrl);

            HtmlPage page = client.getPage(searchUrl);
            for (var htmlItem : page.<HtmlElement>getByXPath("//li[contains(@class,'cl-static-search-result')]")) {
                HtmlAnchor itemAnchor = htmlItem.getFirstByXPath(".//a");
                HtmlElement itemTitle = htmlItem.getFirstByXPath(".//div[@class='title']");
                HtmlElement itemPrice = htmlItem.getFirstByXPath(".//div[@class='price']");
                HtmlElement itemLocation = htmlItem.getFirstByXPath(".//div[@class='location']");

                if (itemAnchor != null && itemTitle != null) {
                    var itemName = itemTitle.asNormalizedText();
                    var itemUrl = itemAnchor.getHrefAttribute();
                    var itemPriceText = itemPrice.asNormalizedText();
                    var itemLocationText = (itemLocation == null) ? "N/A" : itemLocation.asNormalizedText();

                    var item = new Item(itemName, new BigDecimal(itemPriceText.replace("$", "").replace(",", ".")), itemLocationText, itemUrl);
                    System.out.println("item = " + OBJECT_MAPPER.writeValueAsString(item));
                }
            }
        }
    }
}

What we now did was to add a list for the cities and iterate over it to fetch the ads for each city individually.

Voilà, we now run the request for each city individually!

Output customisation

You could encounter the situation where your crawler may have to support different output formats.

For example, you might have to support JSON and CSV. In that case you could simply add a switch to your code, which changes the output format depending on its value:

public static void main(String[] args) {
   var outputType = args.length == 1 ? args[0].toLowerCase() : "";
   var searchQuery = "iphone 13";
   var cities = List.of("newyork", "boston", "washingtondc");

   var results = fetchCities(cities, searchQuery);

   switch (outputType) {
      case "json" -> asJson(results);
      case "csv" -> asCsv(results);
      default -> System.out.println("unknown output type");
   }
}

private static void asCsv(Map<String, List<Item>> results) {
   System.out.println("city,title,price,location,url");
   for (Map.Entry<String, List<Item>> entry : results.entrySet()) {
      for (Item item : entry.getValue()) {
         System.out.printf("%s,%s,%s,%s,%s%n", entry.getKey(), item.title, item.price, item.location, item.url);
      }
   }
}

The fetchCities method would then return a list of items for each city, and the asJson() and asCsv() methods would convert the data into the respective format.

If you now pass json as first argument to your crawler call, it will return a JSON object for each entry (just as we originally showed under Mapping). If you passed csv, it would print a comma-separated line for each entry instead.

Increasing scale with parallelisation

If you are planning to scrape a large number of sites, you might face the issue of slow performance. In that case, you could consider parallelising your requests to speed up the process.

Let's start by expanding our previous example by adding additional cities to our list:

var cities = List.of("newyork", "boston", "washingtondc", "losangeles", "chicago", "sanfrancisco", "seattle", "miami", "dallas", "denver");

Now, let's measure the time it takes to fetch all the cities sequentially by wrapping the original code into a timed method:

public static void main(String[] args) {
    timed(() -> {
        var outputType = args.length == 1 ? args[0].toLowerCase() : "";
        var searchQuery = "iphone 13";
        var cities = List.of("newyork", "boston", "washingtondc", "losangeles", "chicago", "sanfrancisco", "seattle", "miami", "dallas", "denver");
        var results = fetchCities(cities, searchQuery);
        switch (outputType) {
            case "json" -> asJson(results);
            case "csv" -> asCsv(results);
            default -> System.out.println("unknown output type");
        }
    });
}

private static void timed(Runnable action) {
    var start = System.currentTimeMillis();
    action.run();
    var end = System.currentTimeMillis();
    System.out.printf("time = %dms%n", end - start);
}

It turns out it runs in around 15 seconds:

time = 15861ms

Not great, not terrible, but would not be acceptable for a large number of cities.

Let's parallelise the requests by using Java's virtual threads, which are great for I/O-bound tasks.

To do this, we need to change the fetchCities method to scrape each city on a separate virtual thread. This is done by wrapping the code in a CompletableFuture and using a VirtualThreadExecutor:

private static Map<String, List<Item>> fetchCities(List<String> cities, String searchQuery) {
    try (var client = new WebClient()) {
        client.getOptions().setCssEnabled(false);
        client.getOptions().setJavaScriptEnabled(false);
        client.getOptions().setThrowExceptionOnFailingStatusCode(false);
        client.getOptions().setThrowExceptionOnScriptError(false);
        
        return cities.stream().map(city -> Map.entry(city, CompletableFuture.supplyAsync(() -> {
            var searchUrl = "https://%s.craigslist.org/search/moa?query=%s".formatted(city, URLEncoder.encode(searchQuery, StandardCharsets.UTF_8));
            System.out.println("fetching: " + searchUrl);
            try {
                var results = new ArrayList<Item>();
                HtmlPage page = client.getPage(searchUrl);
                for (var htmlItem : page.<HtmlElement>getByXPath("//li[contains(@class,'cl-static-search-result')]")) {
                    HtmlAnchor itemAnchor = htmlItem.getFirstByXPath(".//a");
                    HtmlElement itemTitle = htmlItem.getFirstByXPath(".//div[@class='title']");
                    HtmlElement itemPrice = htmlItem.getFirstByXPath(".//div[@class='price']");
                    HtmlElement itemLocation = htmlItem.getFirstByXPath(".//div[@class='location']");
                    
                    if (itemAnchor != null && itemTitle != null) {
                        var itemName = itemTitle.asNormalizedText();
                        var itemUrl = itemAnchor.getHrefAttribute();
                        var itemPriceText = itemPrice.asNormalizedText();
                        var itemLocationText = (itemLocation == null) ? "N/A" : itemLocation.asNormalizedText();
                        var item = new Item(itemName, new BigDecimal(itemPriceText.replace("$", "").replace(",", ".")), itemLocationText, itemUrl);
                        results.add(item);
                    }
                }
                return results;
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
            }, Executors.newVirtualThreadPerTaskExecutor()))).toList()
                .stream()
                .collect(Collectors.toMap(Map.Entry::getKey, e -> e.getValue().join()));
    }
}

Now, if you run it again, you'll see that the time has been reduced significantly:

time: 3473ms

Next steps

The examples mentioned so far provided a bit of insight on how to scrape Craigslist, but there are certainly still a few areas which could be improved.

Pagination handling
Support for more than one criterion
and more

Of course, there's a lot more to scraping than just fetching a single HTML page and running a few XPath expressions. Especially when it comes to distributed scraping, fully handling JavaScript, and CAPTCHAs, the topic can quickly become very complex. If you like it and would like to have these things handled automatically, then please simply check out our web scraping API. The first 1,000 API calls are on us!

Even more

We are almost at the end of this post, so thanks for staying with us until now, but we'd still have a couple of recommended articles for you.

Don't get blocked

Also check out our recent blog post on Web Scraping without getting blocked, which goes into details on how to optimise your scraping approach in order to avoid being blocked by anti-scraping measures.

Scraping with Chrome and full JavaScript support

While HtmlUnit is a wonderful headless browser, you may still want to check out our other article on the Introduction to Headless Chrome, as this will provide you with additional insight on how to use Chrome's headless mode, which features full JavaScript support, just as you'd expect it from your daily driver browser.

One CSS selector, please

CSS selectors are used for much more these days, than just applying colours and spacing. Very often they are used in the very same context as XPath expressions and if you happen to prefer CSS selectors, you should definitely also check out our tutorial on HTML parsing with Java using jsoup.

Python maybe?

Python has been one of the most popular languages for years at this point and is, in fact, commonly used for web scraping as well. If Python is your choice of language, you might just like our other guide on using Python for scraping web pages.

Or Groovy?

I've you like Java you're going to LOVE Groovy. Check out our guide to web scraping with Groovy You may also like our guide about web scraping with Kotlin

What about Scala?

Of course, we didn't forget about web scraping with Scala, you should check it out!

Code sample

You can find the full source code of this example in our Github repository.

Kevin Sahin

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.