Best 10 Java Web Scraping Libraries

27 August 2024 | 10 min read

Best 10 Java Web Scraping Libraries

In this article, I will show you the most popular Java web scraping libraries and help you choose the right one. Web scraping is the process of extracting data from websites. At first sight, you might think that all you need is a standard HTTP client and basic programming skills, right?

In theory, yes, but quickly, you will face challenges like session handling, cookies, dynamically loaded content and JavaScript execution, and even anti-scraping measures (for example, CAPTCHA, IP blocking, and rate limiting).

This is where web scraping libraries come in handy. They provide high-level APIs that abstract away the complexity of web scraping, allowing you to focus on extracting the data you need.

I've been playing with Java for over 14 years, and I've seen libraries come and go - let me give you a helping hand in choosing one of the plethora of available options!

1. ScrapingBee

ScrapingBee is a comprehensive platform that's designed to make web scraping trivial. It enables users to deal with common scraping challenges, including the most demanding ones like:

  • Avoiding CAPTCHA
  • JavaScript-heavy websites
  • IP rotation
  • rate limiting
  • and more

Underneath, it uses headless browsers to mimic real user interactions.

It has dedicated support for no-code web scraping and Google search results scraping, and it can even take screenshots of the actual website rather than HTML!

Technically speaking, ScrapingBee is a platform, not a library, which makes it technology-agnostic and usable directly without any external libraries.

Here's a quick example of how to use ScrapingBee with Java:

import java.net.URI;
import java.net.URLEncoder;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;

public class Main {
    public static void main(String[] args) throws Exception {

        var request = HttpRequest.newBuilder()
          .uri(URI.create("https://app.scrapingbee.com/api/v1/?api_key=YOUR_API_KEY"
            + "&url=" + URLEncoder.encode("https://example.com", StandardCharsets.UTF_8)
            + "&extract_rules=" + URLEncoder.encode("""
            {"post-title": "h1"}
            """, StandardCharsets.UTF_8)))
          .GET()
          .build();

        try (var client = HttpClient.newHttpClient()) {
            var response = client.send(request, HttpResponse.BodyHandlers.ofString());
            System.out.println(response.body());
        }
    }
}

All the possible configuration options can be found in the documentation .

Ready to simplify your web scraping tasks? Sign up now to get your free API key and enjoy 1000 free credits to explore all that ScrapingBee has to offer!

2. Jsoup

JSoup is one of the most popular Java libraries used for web scraping. It's lightweight and focuses mainly on raw HTML parsing and data extraction.

It will help you fetch webpage data, but won't handle advanced use cases like:

  • JavaScript execution
  • Advanced anti-scraping measures like IP rotation, CAPTCHA solving

Using JSoup for web scraping is straightforward. To begin, you simply need to establish a connection to the target webpage using the connect() method. Once connected, you can retrieve the HTML content and start extracting the data you need.

Here's a quick example:

public static void main(String[] args) throws Exception {
    Document doc = Jsoup.connect("https://example.com").get();
    Elements h1Elements = doc.select("h1");
    h1Elements.forEach(h1 -> System.out.println("Title: " + h1.text()));
}

However, if you just want to parse HTML, you can do that as well without making any HTTP calls:

public static void main(String[] args) {
    Document doc = Jsoup.parse("<h1>Foo</h1><h2>Bar</h2>");
    Elements h1Elements = doc.select("h1");
    h1Elements.forEach(h1 -> System.out.println("Title: " + h1.text()));
}

You can find its sources on GitHub .

3. HtmlUnit

HtmlUnit is a "GUI-less browser for Java" with some JavaScript support. As the name suggests, it was designed for testing but will do the trick for general web scraping use cases.

HtmlUnit is a good choice for scraping dynamic websites that rely heavily on JavaScript, as it can execute JavaScript code and render the page as a real browser would.

Downsides:

  • HtmlUnit does not support IP rotation, making it vulnerable to IP blocking if a website implements rate limiting.
  • It is not as fast as other libraries like Jsoup, as it requires a full browser environment to execute JavaScript.

Here's a quick example of how to use HtmlUnit:

public static void main(String[] args) throws IOException {
    try (var webClient = new WebClient()) {
        HtmlPage page = webClient.getPage("https://example.com");
        HtmlAnchor anchor = page.getFirstByXPath("//a");
        if (anchor != null) {
            System.out.println("Found links:");
            System.out.printf("- '%s' -> %s%n", anchor.getVisibleText(), anchor.getHrefAttribute());
        } else {
            System.out.println("No anchors found!");
        }
    }
}
// Found links:
// - 'More information...' -> https://www.iana.org/domains/example 

You can find its source code on GitHub .

4. Selenium

Selenium is a browser automation framework designed for end-to-end testing but can also be leveraged for web scraping!

Controlling real browsers is Selenium's biggest advantage, but it also has a couple of downsides:

  • Scripts are often fragile and break easily when a web application's UI changes
  • Selenium uses real browsers; therefore, it's quite resource-intensive
  • Is not self-sufficient - it requires additional setup to interact with installed browsers

Here's a quick example of how to use Selenium with Java (remember about installing a browser and a driver first):

public static void main(String[] args) {
    // requires chrome and chrome-driver installed
    System.setProperty("webdriver.chrome.driver", "path/to/chromedriver");
    WebDriver browser = new ChromeDriver();
    try {
        browser.get("https://www.example.com");
        var pageTitle = browser.getTitle();
        System.out.println(pageTitle);
    } finally {
        browser.quit();
    }
}

You can find its source code on GitHub .

5. Crawler4j

Crawler4j is a library dedicated to small/medium web scraping. It supports robots.txt files.

Downsides:

  • Not-so-user-friendly visitor-based API
  • It does not handle JavaScript, limiting its effectiveness against dynamic pages
  • The project is unmaintained - the last release was in May 2018
  • Relies on a legacy com.sleepycat:je:5.0.84 artifact, which is unavailable in modern Maven repositories
  • No advanced anti-scraping measures like IP rotation, CAPTCHA solving

Here's a quick example of how to use Crawler4j:

public class BasicCrawler extends WebCrawler {
    @Override
    public boolean shouldVisit(Page referringPage, WebURL url) {
        String href = url.getURL().toLowerCase();
        return href.startsWith("https://example.com/");
    }

    @Override
    public void visit(Page page) {
        String url = page.getWebURL().getURL();
        System.out.println("URL: " + url);
    }

    public static void main(String[] args) throws Exception {
        CrawlConfig config = new CrawlConfig();
        config.setCrawlStorageFolder("/data/crawl/root");
        config.setPolitenessDelay(1000);
        config.setMaxDepthOfCrawling(2);

        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        controller.addSeed("https://example.com");
        controller.start(BasicCrawler.class, 1);
    }
}

You can find its source code on GitHub .

6. Apache Nutch

Apache Nutch is an open-source web crawler and search engine software based on Apache Hadoop.

Technically, it's not a library, but a command-line tool, but it's worth mentioning due to its popularity.

Downsides:

  • It's quite complex, requires a lot of configuration, and might be an overkill for simple web scraping tasks - see our guide on getting started with Apache Nutch
  • It does not handle JavaScript/AJAX
  • It's not self-sufficient - it requires additional setup to interact with installed browsers
  • No advanced anti-scraping measures like IP rotation, CAPTCHA solving

You can find its source code on GitHub .

7. Jaunt

Jaunt is a Java library used for web scraping and data extraction from HTML and XML documents.

Jaunt is popular for its simplicity and ease of use - it was designed to hide unnecessary complexity while still providing full DOM-level control.

Here's a quick example of how to use Jaunt:

public static void main(String[] args) throws Exception {
    UserAgent userAgent = new UserAgent();
    userAgent.visit("https://example.com");
    Elements links = userAgent.doc.findEach("<a>");
    for (Element link : links) {
        System.out.println("Link: " + link.getAt("href"));
    }
}

Downsides:

  • Limited JavaScript support
  • Not available via modern Maven repositories, making it hard to integrate with modern Java projects
  • No advanced anti-scraping measures like IP rotation, CAPTCHA solving
  • The free version expires after 30 days

8. WebMagic

WebMagic is a web crawling framework that's designed to be simple and flexible. It's built on top of Apache HttpClient and HTMLUnit, providing a high-level API for web scraping.

Downsides:

  • WebMagic doesn’t natively support JavaScript execution, so it might struggle with dynamic content and some anti-scraping measures.
  • No advanced anti-scraping measures like IP rotation, CAPTCHA solving

Here's a quick example of how to use WebMagic:

class WebMagicExample {

    public static void main(String[] args) {
        Spider.create(new ExampleCrawler())
          .addUrl("http://example.com")
          .thread(5)
          .run();
    }

    public static class ExampleCrawler implements PageProcessor {
        private final Site site = Site.me()
          .setRetryTimes(3)
          .setSleepTime(1000)
          .setTimeOut(10000);

        @Override
        public void process(Page page) {
            String title = page.getHtml().xpath("//title/text()").toString();
            System.out.println("Title: " + title);

            page.addTargetRequests(page.getHtml().links().regex("(https://www.example.com/\\w+)").all());
        }

        @Override
        public Site getSite() {
            return site;
        }
    }
}

You can find its source code on GitHub .

9. Gecco

Gecco is an old-school lightweight web scraping framework that's designed to be simple and flexible. It's built on top of Apache HttpClient and Jsoup , providing a high-level API for web scraping.

What sets Gecco apart is its simplicity and ease of use. It's designed to hide unnecessary complexity while still providing full DOM-level control.

It can even handle distributed crawling with the help of Redis and integrates well with Spring framework and HtmlUnit.

However, Gecco is not actively maintained, doesn't work with new Java versions, and might struggle with modern web technologies and anti-scraping measures.

Here's a quick example of how to use Gecco:

First, we need to define the model class:

@Gecco(matchUrl="https://www.example.com", pipelines="consolePipeline")
public class ExamplePage implements HtmlBean {

    private static final long serialVersionUID = 1L;

    @Text
    @HtmlField(cssPath="h1")
    private String heading;

    public String getHeading() {
        return heading;
    }

    public void setHeading(String heading) {
        this.heading = heading;
    }
}

Then, we need to define a pipeline:

package com.example.crawl;

import com.geccocrawler.gecco.pipeline.Pipeline;

public class ConsolePipeline implements Pipeline<ExamplePage> {

    @Override
    public void process(ExamplePage examplePage) {
        System.out.println("Extracted Heading:");
        System.out.println(examplePage.getHeading());
    }
}

And finally, we can run the crawler:

package com.example.crawl;

import com.geccocrawler.gecco.GeccoEngine;

public class Main {
    public static void main(String[] args) {
        GeccoEngine.create()
          .classpath("com.example.crawl") // 
          .start("https://www.example.com")
          .thread(1)
          .interval(2000)
          .run();
    }
}

Note that models and pipelines are loaded dynamically, so make sure to provide the correct package name.

You can find its source code on GitHub .

10. StormCrawler

StormCrawler is a project that provides a collection of resources for building low-latency, scalable web crawlers on Apache Storm. It's designed to be modular and scalable, making it a good choice for large-scale web crawling and search engine development.

It's worth noting that, at the time of this writing, it's still in the Apache Incubator stage, some features might not be working as expected, and the documentation is almost non-existent.

The initial configuration is quite complex, but it can be simplified by using the official Maven Archetype:

mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-archetype -DarchetypeVersion=3.0

The part responsible for crawling is the CrawlTopology class:

public class CrawlTopology extends ConfigurableTopology {

	public static void main(String[] args) throws Exception {
		ConfigurableTopology.start(new CrawlTopology(), args);
	}

	@Override
	protected int run(String[] args) {
		String[] startUrls = {"http://example.com"};
		TopologyBuilder builder = new TopologyBuilder();
		builder.setSpout("spout", new MemorySpout(startUrls));
		builder.setBolt("fetcher", new FetcherBolt()).shuffleGrouping("spout");
		builder.setBolt("parser", new ParserBolt()).shuffleGrouping("fetcher");
		builder.setBolt("status", new SimpleFetcherBolt()).shuffleGrouping("parser");
		Config conf = new Config();
		conf.setDebug(true);

        LocalCluster cluster;
        try {
            cluster = new LocalCluster();
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
        try {
            cluster.submitTopology("simple-crawler", conf, builder.createTopology());
        } catch (Exception e) {
            throw new RuntimeException(e);
        }

		// wait and observe
		try {
			Thread.sleep(30000);
		} catch (InterruptedException e) {
			e.printStackTrace();
		}

		cluster.shutdown();

		return 0;
	}
}

It's important to note, that it won't work if you try to run the main() method directly from the IDE. You need to submit the topology using a predefined command:

mvn clean compile exec:java -Dexec.mainClass=com.example.scrawler.CrawlTopology

You can find its source code on GitHub .

Conclusion

As you can see, there are plenty of Java libraries available for web scraping. The best one for you will depend on your specific use case and requirements.

  • Libraries like JSoup and HtmlUnit are excellent choices for simple tasks and static content due to their ease of use and lightweight nature.
  • Selenium is a powerful option for projects requiring full browser automation, although it can be resource-intensive and may require more setup.
  • Apache Nutch is a robust tool for large-scale web crawling and search engine development, but it may be overkill for web scraping (especially if you want to use it as a library).
  • Jaunt is actively maintained and offers a simple API for web scraping. However, it has limited JavaScript support and is not available via modern Maven repositories.
  • Gecco is a lightweight and flexible framework that's easy to use, but it's not actively maintained and may struggle with modern Java versions.
  • While Crawler4j and WebMagic offer specialized features for crawling, they lack support for modern web technologies and may require additional effort to overcome certain limitations.
  • StormCrawler is a complex but promising project, but it's still in the Apache Incubator stage.

The reason why most libraries struggle with anti-scraping measures is that it often requires a lot of resources and infrastructure to overcome them. For example, bypassing rate-limiting or CAPTCHA might require rotating IP addresses, which can be quite complex to set up. This is why platforms like ScrapingBee shine - they handle all of these challenges for you, so you can focus on scraping and not infrastructure.

image description
Grzegorz Piwowarek

Independent consultant, blogger at 4comprehension.com, trainer, Vavr project lead - teaching distributed systems, architecture, Java, and Golang