Web Scraping with Scala - Easily Scrape and Parse HTML

24 March 2025 (updated) | 13 min read

This tutorial explains how to use three technologies for web scraping with Scala. The article first explains how to scrape a static HTML page with Scala using jsoup and Scala Scraper. Then, it explains how to scrape a dynamic HTML website with Scala using Selenium.

cover image

💡 Interested in web scraping with Java? Check out our guide to the best Java web scraping libraries

Setting Up a Scala Project

The first step is to create a project in Scala. This tutorial uses Scala version 3.6.3 with sbt version 1.10.7.

The fastest way to start from scratch is to use the official Scala 3 giter8 template:

sbt new scala/scala3.g8

Name your project dev.draft. Then, modify the build.sbt file to include the dependencies for jsoup, scala-scraper, and selenium-java:

  libraryDependencies ++= Seq(
  "org.scalameta" %% "munit" % "1.1.0",
  "org.jsoup" % "jsoup" % "1.18.3",
  "net.ruippeixotog" %% "scala-scraper" % "3.1.2",
  "org.seleniumhq.selenium" % "selenium-java" % "4.29.0")

Basic Web Scraping with jsoup

In the directory src/main/scala/, make a new package called dev.draft, and inside that package, make a file called JsoupScraper.scala with the following contents:

package dev.draft

import org.jsoup.*

@main def hello(): Unit =
  val doc = Jsoup.connect("https://en.wikipedia.org/").get()

Following the jsoup documentation, this particular line calls the connect method of the org.jsoup.Jsoup class to download the web page you're scraping:

val doc = Jsoup.connect("https://en.wikipedia.org/").get()

Most of the jsoup classes are in the org.jsoup.Jsoup package. You use the connect method here to download the entire body of the page. Although parse is another method that works with the same syntax, it can only examine documents stored locally. The main difference between the two methods is that connect downloads and parses, while parse simply parses without downloading.

In this case, the doc is a nodes.Document object that contains the fetched HTML content.

Now, if we call doc.title(), we should be able to see the title of the page:

println(doc.title())

And the result:

Wikipedia, the free encyclopedia

For practical purposes, in this tutorial, you'll only use selection (select) and extraction (text and attr) methods. However, jsoup has many other functions aside from performing queries and modifying HTML documents. For example, it can also be used to perform unit tests on generated HTML code.

🤖 Having trouble scraping websites that end up blocking you? Bypass anti-web scraping technology such as Cloudflare with our guide on Web Scraping without getting blocked

Selecting with jsoup

In this tutorial, you'll select from three sections on the Wikipedia home page:

  • In the news
  • On this day
  • Did you know

While in your web browser, on the Wikipedia page, right-click the In the news section. In the context menu, select Inspect in Firefox or View page source in Chrome. Since the relevant source code is contained in <div id="mp-itn" ...>, you'll use the element id with the value mp-itn to obtain the contents of the section:

val inTheNews = doc.select("#mp-itn b a")

If you use println(inTheNews), the result should look similar to the following:

<a href="/wiki/Tomb_of_Thutmose_II" title="Tomb of Thutmose II">Wadi <span class="nowrap">C-4</span></a>
<a href="/wiki/78th_British_Academy_Film_Awards" title="78th British Academy Film Awards">the British Academy Film Awards</a>
<a href="/wiki/2025_African_Union_Commission_Chairperson_election" title="2025 African Union Commission Chairperson election">is elected</a>
<a href="/wiki/Klaus_Iohannis" title="Klaus Iohannis">Klaus Iohannis</a>
<a href="/wiki/Ilie_Bolojan" title="Ilie Bolojan">Ilie Bolojan</a>
<a href="/wiki/2025_Guatemala_City_bus_crash" title="2025 Guatemala City bus crash">A bus falls off a bridge</a>
<a href="/wiki/Portal:Current_events" title="Portal:Current events">Ongoing</a>
<a href="/wiki/Deaths_in_2025" title="Deaths in 2025">Recent deaths</a>
<a href="/wiki/Wikipedia:In_the_news/Candidates" title="Wikipedia:In the news/Candidates">Nominate an article</a>

Using the same approach, you should find an id with the value mp-otd for the On this day section:

val onThisDay = doc.select("#mp-otd b a")

The result should look similar to the following:

<a href="/wiki/February_22" title="February 22">February 22</a>
<a href="/wiki/Robert_II_of_Scotland" title="Robert II of Scotland">Robert&nbsp;II</a>
<a href="/wiki/1959_Daytona_500" title="1959 Daytona 500">first edition of the Daytona&nbsp;500</a>
<a href="/wiki/Samuel_Byck" title="Samuel Byck">Samuel Byck</a>
<a href="/wiki/North_Korean_Embassy_in_Madrid_incident" title="North Korean Embassy in Madrid incident">A group broke into the North Korean embassy</a>
<a href="/wiki/Peder_Syv" title="Peder Syv">Peder Syv</a>
<a href="/wiki/James_Russell_Lowell" title="James Russell Lowell">James Russell Lowell</a>
<a href="/wiki/Clarence_13X" title="Clarence 13X">Clarence 13X</a>
<a href="/wiki/Bronwyn_Oliver" title="Bronwyn Oliver">Bronwyn Oliver</a>
<a href="/wiki/February_22" title="February 22">February 22</a>
<a href="/wiki/Wikipedia:Selected_anniversaries/February" title="Wikipedia:Selected anniversaries/February">Archive</a>
<a href="https://lists.wikimedia.org/postorius/lists/daily-article-l.lists.wikimedia.org/" class="extiw" title="mail:daily-article-l">By email</a>
<a href="/wiki/List_of_days_of_the_year" title="List of days of the year">List of days of the year</a>
<a href="/wiki/Wikipedia:Selected_anniversaries" title="Wikipedia:Selected anniversaries">About</a>

Once again, follow the same steps to view the source code and get the contents of the Did you know section, and you should get an id with the value mp-dyk, which you can use to obtain this section's elements:

val didYouKnow = doc.select("#mp-dyk b a")

The resulting type and values should look similar to the following:

<a href="/wiki/Statue_of_George_Washington_(Trenton,_New_Jersey)" title="Statue of George Washington (Trenton, New Jersey)">George Washington</a>
<a href="/wiki/Directorate_General_of_Higher_Education" title="Directorate General of Higher Education">Directorate General of Higher Education</a>
<a href="/wiki/Tabyana_Ali" title="Tabyana Ali">Tabyana Ali</a>
<a href="/wiki/Combination_Pizza_Hut_and_Taco_Bell" title="Combination Pizza Hut and Taco Bell">Combination Pizza Hut and Taco Bell</a>
<a href="/wiki/Edward_III%27s_Breton_campaign" title="Edward III's Breton campaign">Edward&nbsp;III's Breton campaign</a>
<a href="/wiki/Serpent%27s_Walk" title="Serpent's Walk">Serpent's Walk</a>
<a href="/wiki/Mordka_Towbin" title="Mordka Towbin">one of the first Polish film producers</a>
<a href="/wiki/Fuzhou_Road" title="Fuzhou Road">Fuzhou Road</a>
<a href="/wiki/Nekonomics" title="Nekonomics">nekonomics</a>
<a href="/wiki/Cat_islands_in_Japan" title="Cat islands in Japan">stray cat populations on islands</a>
<a href="/wiki/Wikipedia:Recent_additions" title="Wikipedia:Recent additions">Archive</a>
<a href="/wiki/Help:Your_first_article" title="Help:Your first article">Start a new article</a>
<a href="/wiki/Template_talk:Did_you_know" title="Template talk:Did you know">Nominate an article</a>

To grab data within the HTML document for each section above, you use the select method, which takes a string that represents a CSS selector. You use CSS selector syntax to extract elements from the document that meet the specified search criteria.

The selector criteria are as follows:

  • bar extracts all elements (tags) with that name, for example, <bar />.
  • As you saw before, #bar extracts all elements with that id, for example, <div id="bar">.
  • Selectors can be combined to extract elements that meet multiple criteria. For example, bar#baz.foo would match an element <bar> with id="baz" and class="foo".
  • Note that if there are any blank spaces between selectors, they'll combine to get elements that support the leftmost selector and any child elements that meet the selector criteria. For example, bar #baz .foo would match the innermost div in <bar><div id="baz"><div class="foo" /></div></bar>.
  • Using the > character, for example in bar > #baz > .foo, selects only the direct children. It ignores other members nested more deeply, and grandchildren.

In the three examples above, you combined selectors with spaces, for example, #mp-otd b a. This notation means that there's a link to each article in each item within the outer <b> tag and inner <a> tag.

In addition to the select method, other methods of iterating through the elements of the selection include next, nextAll, nextSibling, and nextElementSibling.

Extracting with jsoup

Now that you have the required elements, the next step is to obtain the data inside each element. HTML elements have three parts, each of which has a corresponding method of retrieval in jsoup:

  • The children method is used to obtain child elements.
  • The text method is used to extract strings from elements like <div>No more pre-text</div>.
  • The attr method extracts the foo value from bar="foo" using .attr("bar").

For example, if you import import scala.jdk.CollectionConverters.*, the following command obtains a list holding the title and the link href of each element and then prints it out:

val otds = for(otd <- onThisDay.asScala) yield (otd.attr("title"), otd.attr("href"))
otds.foreach { case (title, href) => println(s"$title: $href") }

The result is:

February 22: /wiki/February_22
Robert II of Scotland: /wiki/Robert_II_of_Scotland
1959 Daytona 500: /wiki/1959_Daytona_500
Samuel Byck: /wiki/Samuel_Byck
North Korean Embassy in Madrid incident: /wiki/North_Korean_Embassy_in_Madrid_incident
Peder Syv: /wiki/Peder_Syv
James Russell Lowell: /wiki/James_Russell_Lowell
Clarence 13X: /wiki/Clarence_13X
Bronwyn Oliver: /wiki/Bronwyn_Oliver
February 22: /wiki/February_22
Wikipedia:Selected anniversaries/February: /wiki/Wikipedia:Selected_anniversaries/February
mail:daily-article-l: https://lists.wikimedia.org/postorius/lists/daily-article-l.lists.wikimedia.org/
List of days of the year: /wiki/List_of_days_of_the_year
Wikipedia:Selected anniversaries: /wiki/Wikipedia:Selected_anniversaries

The following command retrieves only the headlines:

val headers = for (otd <- onThisDay.asScala) yield otd.text
headers.foreach(println)

The result:

February 22
Robert II
first edition of the Daytona 500
Samuel Byck
A group broke into the North Korean embassy
Peder Syv
James Russell Lowell
Clarence 13X
Bronwyn Oliver
February 22
Archive
By email
List of days of the year
About

Web Scraping with Scala Scraper

Inside the directory src/main/scala/dev/draft, make a file called ScalaScraper.scala with the following contents:

package dev.draft

import net.ruippeixotog.scalascraper.browser.*

@main def scalascraper(): Unit =
  val browser = JsoupBrowser()
  val doc = browser.get("https://en.wikipedia.org/")
  println("title: " + doc.title)

Following the Scala Scraper documentation, the first step is to call the constructor JsoupBrowser(). As the name suggests, this generates a web browser implementation based on jsoup. However, unlike other browsers, JsoupBrowser doesn't run JavaScript and only works with HTML. In the above code, you call JsoupBrowser() using the following command:

val browser = JsoupBrowser()

You can then use the get method of the JsoupBrowser class to download the web page you're going to scrape:

val doc = browser.get("https://en.wikipedia.org/")

You can use the get method to download the whole page and parse it, while parseFile can be used to parse an already downloaded document.

Similarly to the jsoup example, let's get the page title:

val doc = browser.get("https://en.wikipedia.org/")
println("title: " + doc.title)

And the result:

title: Wikipedia, the free encyclopedia

For practical purposes, as with the jsoup examples, this tutorial only looks at the selection (the operator >>) and extraction (text and attr) methods. However, Scala Scraper has many other functions. It can perform queries and modifications on HTML documents and unit tests on generated HTML code.

Selecting with Scala Scraper

Let's import DSL methods and conversions first, this will allow you to use the >> operator:

import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._

The following code obtains the contents of the In the news section on the Wikipedia home page with Scala Scraper:

val inTheNews = doc >> elementList("#mp-itn b a")

If you use inTheNews.foreach(println), the resulting type and values should look similar to the following:

JsoupElement(<a href="/wiki/Tomb_of_Thutmose_II" title="Tomb of Thutmose II">Wadi <span class="nowrap">C-4</span></a>)
JsoupElement(<a href="/wiki/78th_British_Academy_Film_Awards" title="78th British Academy Film Awards">the British Academy Film Awards</a>)
JsoupElement(<a href="/wiki/2025_African_Union_Commission_Chairperson_election" title="2025 African Union Commission Chairperson election">is elected</a>)
JsoupElement(<a href="/wiki/Klaus_Iohannis" title="Klaus Iohannis">Klaus Iohannis</a>)
JsoupElement(<a href="/wiki/Ilie_Bolojan" title="Ilie Bolojan">Ilie Bolojan</a>)
JsoupElement(<a href="/wiki/2025_Guatemala_City_bus_crash" title="2025 Guatemala City bus crash">A bus falls off a bridge</a>)
JsoupElement(<a href="/wiki/Portal:Current_events" title="Portal:Current events">Ongoing</a>)
JsoupElement(<a href="/wiki/Deaths_in_2025" title="Deaths in 2025">Recent deaths</a>)
JsoupElement(<a href="/wiki/Wikipedia:In_the_news/Candidates" title="Wikipedia:In the news/Candidates">Nominate an article</a>)

To view the contents of the On this day section, use the id with the value mp-otd to obtain its elements:

val onThisDay = doc >> elementList("#mp-otd b a")

The content should be more/less like this:

JsoupElement(<a href="/wiki/February_22" title="February 22">February 22</a>)
JsoupElement(<a href="/wiki/Robert_II_of_Scotland" title="Robert II of Scotland">Robert&nbsp;II</a>)
JsoupElement(<a href="/wiki/1959_Daytona_500" title="1959 Daytona 500">first edition of the Daytona&nbsp;500</a>)
JsoupElement(<a href="/wiki/Samuel_Byck" title="Samuel Byck">Samuel Byck</a>)
JsoupElement(<a href="/wiki/North_Korean_Embassy_in_Madrid_incident" title="North Korean Embassy in Madrid incident">A group broke into the North Korean embassy</a>)
JsoupElement(<a href="/wiki/Peder_Syv" title="Peder Syv">Peder Syv</a>)
JsoupElement(<a href="/wiki/James_Russell_Lowell" title="James Russell Lowell">James Russell Lowell</a>)
JsoupElement(<a href="/wiki/Clarence_13X" title="Clarence 13X">Clarence 13X</a>)
JsoupElement(<a href="/wiki/Bronwyn_Oliver" title="Bronwyn Oliver">Bronwyn Oliver</a>)
JsoupElement(<a href="/wiki/February_22" title="February 22">February 22</a>)
JsoupElement(<a href="/wiki/Wikipedia:Selected_anniversaries/February" title="Wikipedia:Selected anniversaries/February">Archive</a>)
JsoupElement(<a href="https://lists.wikimedia.org/postorius/lists/daily-article-l.lists.wikimedia.org/" class="extiw" title="mail:daily-article-l">By email</a>)
JsoupElement(<a href="/wiki/List_of_days_of_the_year" title="List of days of the year">List of days of the year</a>)
JsoupElement(<a href="/wiki/Wikipedia:Selected_anniversaries" title="Wikipedia:Selected anniversaries">About</a>)

Likewise, to get the contents of the Did you know section, use the id with the value mp-dyk to obtain its elements:

val didYouKnow = doc >> elementList("#mp-dyk b a")

The content should look more/less like this:

JsoupElement(<a href="/wiki/Statue_of_George_Washington_(Trenton,_New_Jersey)" title="Statue of George Washington (Trenton, New Jersey)">George Washington</a>)
JsoupElement(<a href="/wiki/Directorate_General_of_Higher_Education" title="Directorate General of Higher Education">Directorate General of Higher Education</a>)
JsoupElement(<a href="/wiki/Tabyana_Ali" title="Tabyana Ali">Tabyana Ali</a>)
JsoupElement(<a href="/wiki/Combination_Pizza_Hut_and_Taco_Bell" title="Combination Pizza Hut and Taco Bell">Combination Pizza Hut and Taco Bell</a>)
JsoupElement(<a href="/wiki/Edward_III%27s_Breton_campaign" title="Edward III's Breton campaign">Edward&nbsp;III's Breton campaign</a>)
JsoupElement(<a href="/wiki/Serpent%27s_Walk" title="Serpent's Walk">Serpent's Walk</a>)
JsoupElement(<a href="/wiki/Mordka_Towbin" title="Mordka Towbin">one of the first Polish film producers</a>)
JsoupElement(<a href="/wiki/Fuzhou_Road" title="Fuzhou Road">Fuzhou Road</a>)
JsoupElement(<a href="/wiki/Nekonomics" title="Nekonomics">nekonomics</a>)
JsoupElement(<a href="/wiki/Cat_islands_in_Japan" title="Cat islands in Japan">stray cat populations on islands</a>)
JsoupElement(<a href="/wiki/Wikipedia:Recent_additions" title="Wikipedia:Recent additions">Archive</a>)
JsoupElement(<a href="/wiki/Help:Your_first_article" title="Help:Your first article">Start a new article</a>)
JsoupElement(<a href="/wiki/Template_talk:Did_you_know" title="Template talk:Did you know">Nominate an article</a>)

In the three examples above, you combined the selectors with spaces, for example, #mp-otd b a. This notation means that there's a link to each article in each item within the outer <b> tag and inner <a> tag.

Extracting with Scala Scraper

As with the jsoup example, the next step is to obtain the data inside each element. Scala Scraper's corresponding methods for the three different parts of HTML elements are as follows:

  • The children method is used to extract child elements.
  • The text method is used to extract text content. It extracts the string from elements like <div>No more pre-text</div>.
  • The attr method extracts attributes. For example, you'd use .attr("bar") to get the foo value from bar="foo".

For example, the following command obtains the title and the link href of each element:

val otds = for (otd <- onThisDay) yield (otd >> attr("title"), otd >> attr("href"))

And if we print it out (otds.foreach { case (title, href) => println(s"$title: $href") }), we should get something more/less like:

February 22: /wiki/February_22
Robert II of Scotland: /wiki/Robert_II_of_Scotland
1959 Daytona 500: /wiki/1959_Daytona_500
Samuel Byck: /wiki/Samuel_Byck
North Korean Embassy in Madrid incident: /wiki/North_Korean_Embassy_in_Madrid_incident
Peder Syv: /wiki/Peder_Syv
James Russell Lowell: /wiki/James_Russell_Lowell
Clarence 13X: /wiki/Clarence_13X
Bronwyn Oliver: /wiki/Bronwyn_Oliver
February 22: /wiki/February_22
Wikipedia:Selected anniversaries/February: /wiki/Wikipedia:Selected_anniversaries/February
mail:daily-article-l: https://lists.wikimedia.org/postorius/lists/daily-article-l.lists.wikimedia.org/
List of days of the year: /wiki/List_of_days_of_the_year
Wikipedia:Selected anniversaries: /wiki/Wikipedia:Selected_anniversaries

The following instruction obtains just the headlines:

  val headers = for (otd <- onThisDay) yield otd >> text
  headers.foreach(println)

The result:

February 22
Robert II
first edition of the Daytona 500
Samuel Byck
A group broke into the North Korean embassy
Peder Syv
James Russell Lowell
Clarence 13X
Bronwyn Oliver
February 22
Archive
By email
List of days of the year
About

Limitations of These Methods

One limitation of jsoup and Scala Scraper is that dynamic websites and single-page applications (SPAs) can't be scraped. As mentioned before, JsoupBrowser just scrapes HTML documents. If you want to scrape a dynamic website or interact with JavaScript code, you'll need to use a headless browser like Selenium.

Advanced Web Scraping with Selenium

Selenium is a tool for building bots, automating unit tests, and scraping.

Below, you'll use Selenium to run the same examples you executed with jsoup and Scala Scraper.

First, use Selenium to download WebDriver. Note that the instructions for downloading and installing the client differ for Firefox and Chrome. To use the WebDriver module, download the latest geckodriver release and ensure it can be found on your system PATH.

In the directory src/main/scala/dev/draft, make a file called SeleniumScraper.scala with the following contents:

package dev.draft

import org.openqa.selenium.By
import org.openqa.selenium.chrome.{ChromeDriver, ChromeOptions}

import java.time.Duration

@main def selenium(): Unit =
  System.setProperty("webdriver.gecko.driver", "/opt/homebrew/bin/geckodriver")
  val driver = new ChromeDriver(new ChromeOptions().addArguments("--headless=new"))
  driver.manage.deleteAllCookies()
  driver.manage.timeouts.pageLoadTimeout(Duration.ofSeconds(40))
  driver.manage.timeouts.implicitlyWait(Duration.ofSeconds(30))
  driver.get("https://en.wikipedia.org/")

  val inTheNews = driver.findElement(By.cssSelector("#mp-itn b a"))
  println(inTheNews.getText)
  val onThisDay = driver.findElement(By.cssSelector("#mp-otd b a"))
  println(onThisDay.getText)
  val didYouKnow = driver.findElement(By.cssSelector("#mp-dyk b a"))
  println(didYouKnow.getText)
  driver.quit()

In the code above, you obtain the same three sections of the Wikipedia home page using CSS selector syntax.

As mentioned, Selenium can also scrape dynamic web pages. For example, on the Related Words web page, you can type a word to retrieve all related words and their respective links. The following code will retrieve all the dynamically generated words related to the word Draft:

package dev.draft

import org.openqa.selenium.By
import org.openqa.selenium.chrome.{ChromeDriver, ChromeOptions}

import java.time.Duration

@main def selenium(): Unit =
  System.setProperty("webdriver.gecko.driver", "/opt/homebrew/bin/geckodriver")
  val driver = new ChromeDriver(new ChromeOptions().addArguments("--headless=new"))
  driver.manage.deleteAllCookies()
  driver.manage.timeouts.pageLoadTimeout(Duration.ofSeconds(40))
  driver.manage.timeouts.implicitlyWait(Duration.ofSeconds(30))
  driver.get("https://relatedwords.org/relatedto/" + "Draft")

  val relatedWords = driver.findElement(By.className("words"))
  println(relatedWords.getText)
  driver.quit()

Results:

outline
plan
blueprint
conscription
enlist
...

Conclusion

Scala Scraper and jsoup are sufficient when you have to parse a static HTML web page or validate generated HTML code. However, when you need to validate dynamic web pages or JavaScript code, you need to use tools like Selenium.

In this tutorial, you learned how to set up a Scala project and use jsoup and Scala Scraper to load and parse HTML. You were also introduced to some web scraping techniques. Finally, you saw how a headless browser library like Selenium can be used to scrape a dynamic website.

If you're a JVM fan, don't hesitate to take a look at our guide about web scraping with Kotlin.

If you prefer not to have to deal with rate limits, proxies, user agents, and browser fingerprints, please check out the web scraping API from ScrapingBee. Did you know that the first 1,000 calls are on us?

image description
Grzegorz Piwowarek

Independent consultant, blogger at 4comprehension.com, trainer, Vavr project lead - teaching distributed systems, architecture, Java, and Golang