Introduction
This post covers the main tools and techniques for web scraping in Ruby. We start with an introduction to building a web scraper using common Ruby HTTP clients and how to parse HTML documents in Ruby.
This approach to web scraping does have its limitations, however, and can come with a fair dose of frustration. Particularly in the context of single-page applications , we will quickly come across major obstacles due to their heavy use of JavaScript. We will have a closer look on how to address this, using web scraping frameworks, in the second part of this article.
Note: This article does assume that the reader is familiar with the Ruby platform. While there is a multitude of gems , we will focus on the most popular ones and use their Github metrics (use, stars, and forks) as indicators. While we won't be able to cover all the use cases of these tools, we will provide good grounds for you to get started and explore more on your own.
Part I: Static pages
0. Setup
In order to be able to code along with this part, you may need to install the following gems:
gem install 'pry' #debugging tool
gem install 'nokogiri' #parsing gem
gem install 'HTTParty' #HTTP request gem
Moreover, we will use open-uri
, net/http
, and csv
, which are part of the standard Ruby library so there's no need for a separate installation. As for Ruby, we are using version 3 for our examples and our main playground will be the file scraper.rb
.
1. Make a request with HTTP clients in Ruby
In this section, we will cover how to scrape a Wikipedia page with Ruby.
Imagine you want to build the ultimate Douglas Adams fan wiki. You would for sure start with getting data from Wikipedia. In order to send a request to any website or web app, you would need to use an HTTP client. Let's take a look at our three main options: net/http
, open-uri
, and HTTParty
. You can use whichever of the below clients you like the most and it will work with the step 2.
Net::HTTP
Ruby's standard library comes with an HTTP client of its own, namely, the net-http
gem. In order to make a request to Douglas Adams' Wikipedia page easily, we first need to convert our URL string into a URI
object, using the open-uri
gem. Once we have our URI, we can pass it to get_response
, which will provide us with a Net::HTTPResponse
object and whose body
method will provide us with the HTML document.
require 'open-uri'
require 'net/http'
url = "https://en.wikipedia.org/wiki/Douglas_Adams"
uri = URI.parse(url)
response = Net::HTTP.get_response(uri)
html = response.body
puts html
#=> "\n<!DOCTYPE html>\n<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n<head>\n<meta charset=\"UTF-8\"/>\n<title>Douglas Adams - Wikipedia</title>..."
Pro tip: Should you use
Net::HTTP
with a REST interface and need to handle JSON, simplyrequire 'json'
and parse the response withJSON.parse(response.body)
.
That's it - it works! However, the syntax of net/http
may be a bit clunky and less intuitive than that of HTTParty
or open-uri
, which are, in fact, just elegant wrappers for net/http
.
HTTParty
The HTTParty
gem was created to make http fun. Indeed, with the intuitive and straightforward syntax, the gem has become widely popular in recent years. The following two lines are all we need to make a successful GET request:
require "HTTParty"
response = HTTParty.get("https://en.wikipedia.org/wiki/Douglas_Adams")
html = response.body
puts html
# => "<!DOCTYPE html>\n" + "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n" + "<head>\n" + "<meta charset=\"UTF-8\"/>\n" + "<title>Douglas Adams - Wikipedia</title>\n" + ...
get
returns an HTTParty::Response
object which, again, provides us with the details on the response and, of course, the content of the page. If the server provided a content type of application/json
, HTTParty will automatically parse the response as JSON and return appropriate Ruby objects.
OpenURI
The simplest solution, however, is making a request with the open-uri
gem, which also is a part of the standard Ruby library:
require 'open-uri'
html = URI.open("https://en.wikipedia.org/wiki/Douglas_Adams")
##<File:/var/folders/zl/8zprgb3d6yn_466ghws8sbmh0000gq/T/open-uri20200525-33247-1ctgjgo>
This provides us with a file descriptor and allows us to read from the URL as if it were a file, line by line.
The simplicity of OpenURI is already in its name. It only sends one type of request, and does it well, with sensible HTTP defaults for SSL and redirects.
While we have covered here the two most straightforward approaches to load the content of a URL, there are of course also quite a few other HTTP clients. So please feel free to also check out our article on the best Ruby HTTP clients .
2. Parsing HTML with Nokogiri
Once we have the HTML, we need to extract the parts that are of our interest. As you probably noticed, each of the previous examples had declared an html
variable. We will use it now as an argument for the Nokogiri::HTML
method. Don't forget require "nokogiri"
, though 🙂.
doc = Nokogiri::HTML(html)
# => #(Document:0x3fe41d89a238 {
# name = "document",
# children = [
# #(DTD:0x3fe41d92bdc8 { name = "html" }),
# #(Element:0x3fe41d89a10c {
# name = "html",
# attributes = [
# #(Attr:0x3fe41d92fe00 { name = "class", value = "client-nojs" }),
# #(Attr:0x3fe41d92fdec { name = "lang", value = "en" }),
# #(Attr:0x3fe41d92fdd8 { name = "dir", value = "ltr" })],
# children = [
# #(Text "\n"),
# #(Element:0x3fe41d93e7fc {
# name = "head",
# children = [ ...
Wonderful! We've now got a Nokogiri::HTML::Document
object, which is essentially the DOM representation of our document and will allow us to query the document with, both, CSS selectors and XPath expressions.
💡 If you like to find out more about the DOM and XPath, we'd have another lovely article on that: Practical XPath for Web Scraping
In order to select the right DOM elements, we need to do a bit of a detective work with the browser's developer tools. In the example below, we are using Chrome to inspect whether a desired element has any attached class:
As we see, Wikipedia does not exactly make extensive use of HTML classes, what a shame. Still, we can select them by their tag. For instance, if we wanted to get all the paragraphs, we'd approach it by selecting all <p>
elements and then fetching their text content:
description = doc.css("p").text
# => "\n\nDouglas Noel Adams (11 March 1952 – 11 May 2001) was an English author, screenwriter, essayist, humorist, satirist and dramatist. Adams was author of The Hitchhiker's Guide to the Galaxy, which originated in 1978 as a BBC radio comedy before developing into a \"trilogy\" of five books that sold more than 15 million copies in his lifetime and generated a television series, several stage plays, comics, a video game, and in 2005 a feature film. Adams's contribution to UK radio is commemorated in The Radio Academy's Hall of Fame.[1]\nAdams also wrote Dirk Gently's...
This approach resulted in a 4,336-word-long string. However, imagine you would like to get only the first introductory paragraph and the picture. You could either use a regular expression or let Ruby do this for you with the .split
method.
In our example, we can notice the delimiters for paragraphs (\n
) have been preserved, so we can simply split by newlines and get the first non-empty paragraph:
description = doc.css("p").text.split("\n").find{|e| e.length > 0}
Another way would be to trim all whitespace with .strip
and select the first element from our string array:
description = doc.css("p").text.strip.split("\n")[0]
Alternatively, and depending on how the HTML is structured, sometimes an easier way could be to directly access the selector elements:
description = doc.css("p")[1]
#=> #(Element:0x3fe41d89fb84 {
# name = "p",
# children = [
# #(Element:0x3fe41e43d6e4 { name = "b", children = [ #(Text "Douglas Noel Adams")] }),
# #(Text " (11 March 1952 – 11 May 2001) was an English "),
# #(Element:0x3fe41e837560 {
# name = "a",
# attributes = [
# #(Attr:0x3fe41e833104 { name = "href", value = "/wiki/Author" }),
# #(Attr:0x3fe41e8330dc { name = "title", value = "Author" })],
# children = [ #(Text "author")]
# }),
# #(Text ", "),
# #(Element:0x3fe41e406928 {
# name = "a",
# attributes = [
# #(Attr:0x3fe41e41949c { name = "href", value = "/wiki/Screenwriter" }),
# #(Attr:0x3fe41e4191cc { name = "title", value = "Screenwriter" })],
# children = [ #(Text "screenwriter")]
# }),
Once we found the element we are interested in, we need to call the .children
method on it, which will return -- you've guessed it -- more nested DOM elements. We could iterate over them to get the text we need. Here's an example of return values from two nodes:
doc.css("p")[1].children[0]
#=> #(Element:0x3fe41e43d6e4 { name = "b", children = [ #(Text "Douglas Noel Adams")] })
doc.css("p")[1].children[1]
#=> #(Text " (11 March 1952 – 11 May 2001) was an English ")
Now, let's find the article's main image. That should be easy, right? The most straightforward approach is to select the <img>
tag, isn't it?
doc.css("img").count
#=> 16
Not quite, there are quite a few images on that page. 😳
Well, we could filter for some image specific data, couldn't we?
doc.css("img").find{|picture| picture.attributes["alt"].value.include?("Douglas adams portrait cropped.jpg")}
#=> <img alt....
Perfect, that gave us the right image, right? 🥺
Sort of, but do hold your horses just for a second. The moment there's only a slight change, our find()
call won't find any more.
All right, all right, what about using an XPath expression?
doc.xpath("/html/body/div[3]/div[3]/div[5]/div/table[1]/tbody/tr[2]/td/a/img")
#=> <img alt....
True, we got the image here as well and did not filter on an arbitrary tag but, as always with absolute paths, that can also quickly break if there is just a slight change to the DOM hierarchy.
So what now?
As mentioned before, Wikipedia isn't exactly generous when it comes to IDs, but there are some seemingly unique HTML classes which seem to be pretty stable across Wikipedia.
doc.css(".infobox-image img")
#=> <img alt....
As you notice, getting the right (and stable) DOM path can be a bit tricky and does take some experience and analysis of the DOM tree, but it's also quite rewarding when you found the right CSS selectors or XPath expressions and they withstand the test of time and do not break with DOM changes. As so often, your browser's developer tools will be your best friend in this endeavour.
If you're doing web scraping, you will often have to use proxies during your endeavors. Check out our guide on how to use proxy with Ruby and Faraday to learn to do so.
3. Exporting Scraped Data to CSV
All right, before we move on to covering how to make use of the full-fledged web scraping framework, we mentioned earlier, let's just see how to actually use the data we just got from our website.
Once you've successfully scraped the website, you probably want to persist that data for later use. A convenient and interoperable way to do that, is to save it as CSV file. CSVs cannot only be easily managed with Excel, but they are also a standard format for many other third party platforms (e.g. mailing frameworks). Naturally, Ruby got you covered with the csv gem .
require "nokogiri"
require "csv"
require "open-uri"
html = URI.open("https://en.wikipedia.org/wiki/Douglas_Adams")
doc = Nokogiri::HTML(html)
description = doc.css("p").text.split("\n").find{|e| e.length > 0}
picture = doc.css("td a img").find{|picture| picture.attributes["alt"].value.include?("Douglas adams portrait cropped.jpg")}.attributes["src"].value
data_arr = []
data_arr.push(description, picture)
CSV.open('data.csv', "w") do |csv|
csv << data_arr
end
So, what were we doing here? Let's quickly recap.
- We imported the libraries we are going to use.
- We used OpenURI to load the content of the URL and provided it to Nokogiri.
- Once Nokogiri had the DOM, we politely asked it for the description and the picture URL. CSS selectors are truly elegant, aren't they? 🤩
- We added the data to our
data_arr
array. - We used
CSV.open
to write the data to our CSV file.
💡 We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Feel free to check the documentation here .
Part II: Kimurai - a complete Ruby web scraping framework
So far we have focused on how to load the content of a URL, how to parse its HTML document into a DOM tree, and how to select specific document elements using CSS selectors and XPath expressions. While that all worked pretty well, there are still a few limitations, namely JavaScript.
More and more sites rely on JavaScript to render their content (in particular, of course, Single-Page Applications or sites which utilise infinite scroll for their data), in which case our Nokogiri implementation will only get the initial HTML bootstrap document without the actual data. Not getting the actual data is, let's say, less ideal for a scraper, right?
In these cases, we can use tools which specifically support JavaScript-powered sites. One of them is Kimurai , a Ruby framework specifically designed for web scraping. Like our previous example, it also uses Nokogiri to access DOM elements, as well as Capybara to execute interactive actions, typically performed by users (e.g. mouse clicks). On top of that, it also supports full integration of headless browsers (i.e. Headless Chrome and Headless Firefox) and PhantomJS .
In this part of the article, we will scrape a job listing web app. First, we will do it statically by just visiting different URL addresses and then, we will introduce some JS action.
Kimurai Setup
In order to scrape dynamic pages, you need to install a couple of tools -- below you will find the list with the macOS installation commands:
- Chrome and Firefox:
brew cask install google-chrome firefox
- ChromeDriver:
brew cask install chromedriver
- geckodriver:
brew install geckodriver
- PhantomJS:
brew install phantomjs
- Kimurai gem:
gem install kimurai
In this tutorial, we will use a simple Ruby file but you could also create a Rails app that would scrape a site and save the data to a database.
Static page scraping
Let's start with what Kimurai considers bare minimum: a class with options for the scraper and a parse
method:
class JobScraper < Kimurai::Base
@name= 'eng_job_scraper'
@start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"]
@engine = :selenium_chrome
def parse(response, url:, data: {})
end
end
In our class, we defined the following three fields:
@name
: you can name your scraper whatever you wish or omit it altogether if your scraper consists of just one file;@start_urls
: this is an array of start URLs, which will be processed one by one inside theparse
method;@engine
: the engine used for scraping; Kimurai supports four default engines . For our examples here, we'll be using Selenium with Headless Chrome
Let's talk about the parse
method now. It is the default entry method for the scraper and accepts the following arguments:
response
: theNokogiri::HTML
object, which we know from the prior part of this post;url
: a string, which can be either passed to the method manually or otherwise will be taken from the@start_urls
array;data
: a storage for passing data between requests;
Just like when we used Nokogiri, you can also use CSS selectors and XPath expressions here to select the document elements you want to extract.
The browser object
Every Kimurai class also has a default
browser
field , which provides access to the underlying Capybara session object and allows you to interact with its browser instance (e.g. compile forms or perform mouse actions).
All right, let's dive into the page structure of our job site.
As we are interested in the job entries, we should first check if there's a common parent element (ideally with its own HTML ID, right? 🤞). And we are in luck, the jobs are all contained within one single <td>
with a resultsCol
ID.
td#resultsCol
Now, we just need to find the tag for the individual entry elements and we can scrape the data. Fortunately, that's relatively straightforward as well, a <div>
with a job_seen_beacon
class.
div.job_seen_beacon
Following is our previous base class with a rough implementation and a @@jobs
array to keep track of all the jobs entries we found.
require 'kimurai'
class JobScraper < Kimurai::Base
@name= 'eng_job_scraper'
@start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"]
@engine = :selenium_chrome
@@jobs = []
def scrape_page
doc = browser.current_response
returned_jobs = doc.css('td#resultsCol')
returned_jobs.css('div.job_seen_beacon').each do |char_element|
#code to get only the listings
end
end
def parse(response, url:, data: {})
scrape_page
@@jobs
end
end
JobScraper.crawl!
Time to get the actual data!
Let's check out the page once more and try to find the selectors for the data points we are after.
Data point | Selector |
---|---|
Page URL | h2.jobTitle > a |
Title | h2.jobTitle > a > span |
Description | div.job-snippet |
Company | div.companyInfo > span.companyName |
Location | div.companyInfo > div.companyLocation |
Salary | span.estimated-salary > span |
With that information, we should be now able to extract the data from all the job entries in our page.
def scrape_page
doc = browser.current_response
returned_jobs = doc.css('td#resultsCol')
returned_jobs.css('div.job_seen_beacon').each do |char_element|
# scraping individual listings
title = char_element.css('h2.jobTitle > a > span').text.gsub(/\n/, "")
link = "https://indeed.com" + char_element.css('h2.jobTitle > a').attributes["href"].value.gsub(/\n/, "")
description = char_element.css('div.job-snippet').text.gsub(/\n/, "")
company = char_element.css('div.companyInfo > span.companyName').text.gsub(/\n/, "")
location = char_element.css('div.companyInfo > div.companyLocation').text.gsub(/\n/, "")
salary = char_element.css('span.estimated-salary > span').text.gsub(/\n/, "")
# creating a job object
job = {title: title, link: link, description: description, company: company, location: location, salary: salary}
# adding the object if it is unique
@@jobs << job if !@@jobs.include?(job)
end
end
Instead of creating an object, we could also create an array, depending on what data structure we'd need later:
job = [title, link, description, company, location, salary, requirements]
As the code currently is, we only get the first 15 results, or just the first page. In order to get data from the next pages, we can visit subsequent URLs:
def parse(response, url:, data: {})
# scrape first page
scrape_page
# next page link starts with 20 so the counter will be initially set to 2
num = 2
#visit next page and scrape it
10.times do
browser.visit("https://www.indeed.com/jobs?q=software+engineer&l=New+York,+NY&start=#{num}0")
scrape_page
num += 1
end
@@jobs
end
Last but not least, we could store our data in a CSV or JSON file, by adding one of the following snippets to our parse
method:
CSV.open('jobs.csv', "w") do |csv|
csv << @@jobs
end
or:
File.open("jobs.json","w") do |f|
f.write(JSON.pretty_generate(@@jobs))
end
Dynamic page scraping with Selenium and Headless Chrome
So far, our Kimurai code was not all that different from our previous example. Although, we were no loading everything through a real browser, we still simply loaded the page, extracted the desired data items, and loaded the next page based on our URL template.
Real users wouldn't do that last step, would they? No, they wouldn't, they simply click the "next" button. That's exactly what we are going to check out now.
def parse(response, url:, data: {})
10.times do
# scrape first page
scrape_page
puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹"
# find the "next" button + click to move to the next page
browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
puts "🔺 🔺 🔺 🔺 🔺 CLICKED THE NEXT BUTTON 🔺 🔺 🔺 🔺 "
end
@@jobs
end
We still call our scrape_page()
method from before, but now we also use Capybara's browser
object to find()
(using an XPath expression) and click()
the "next" button.
We added two puts
statements to see whether our scraper actually moves forward:
As you see, we successfully scraped the first page but then we encountered an error:
element click intercepted: Element <span class="pn">...</span> is not clickable at point (329, 300). Other element would receive the click: <input autofocus="" name="email" type="email" id="popover-email" class="popover-input-locationtst"> (Selenium::WebDriver::Error::ElementClickInterceptedError)
(Session info: headless chrome=83.0.4103.61)
We now have two options
- check what we received in
response
and whether the DOM tree is any different (spoiler, it is 🙂) - have the browser snap a screenshot and check if the page looks any different
While option #1 is, certainly, the more thorough one, option #2 often is a shortcut to point out any obvious changes. So, let's try that first.
As so often, it's quite straightforward to take a screenshot with Capybara. Simply call
save_screenshot()
on the browser
object, and a screenshot (with a random name) will be saved in your working directory.
def parse(response, url:, data: {})
10.times do
# scrape first page
scrape_page
# take a screenshot of the page
browser.save_screenshot
# find the "next" button + click to move to the next page
browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹"
puts "🔺 🔺 🔺 🔺 🔺 CLICKED THE NEXT BUTTON 🔺 🔺 🔺 🔺 "
end
@@jobs
end
Voilà , Ruby will now save a screenshot of each page. This is the first page:
Lovely, just what we were looking for. Here page number two:
Aha! A popup! After running this test a couple of times, and inspecting errors closely, we know that it comes in two versions and that our "next" button is not clickable when the popup is displayed. Fortunately, a simple browser.refresh
takes care of that.
def parse(response, url:, data: {})
10.times do
scrape_page
# if there's the popup, escape it
if browser.current_response.css('div#popover-background') || browser.current_response.css('div#popover-input-locationtst')
browser.refresh
end
# find the "next" button + click to move to the next page
browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹"
puts "🔺 🔺 🔺 🔺 🔺 CLICKED THE NEXT BUTTON 🔺 🔺 🔺 🔺 "
end
@@jobs
end
Finally, our scraper works without a problem and after ten rounds, we end up with 155 job listings:
Here's the full code of our dynamic scraper:
require 'kimurai'
class JobScraper < Kimurai::Base
@name= 'eng_job_scraper'
@start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"]
@engine = :selenium_chrome
@@jobs = []
def scrape_page
doc = browser.current_response
returned_jobs = doc.css('td#resultsCol')
returned_jobs.css('div.job_seen_beacon').each do |char_element|
# scraping individual listings
title = char_element.css('h2.jobTitle > a > span').text.gsub(/\n/, "")
link = "https://indeed.com" + char_element.css('h2.jobTitle > a').attributes["href"].value.gsub(/\n/, "")
description = char_element.css('div.job-snippet').text.gsub(/\n/, "")
company = char_element.css('div.companyInfo > span.companyName').text.gsub(/\n/, "")
location = char_element.css('div.companyInfo > div.companyLocation').text.gsub(/\n/, "")
salary = char_element.css('span.estimated-salary > span').text.gsub(/\n/, "")
# creating a job object
job = {title: title, link: link, description: description, company: company, location: location, salary: salary}
@@jobs << job if !@@jobs.include?(job)
end
end
def parse(response, url:, data: {})
10.times do
scrape_page
if browser.current_response.css('div#popover-background') || browser.current_response.css('div#popover-input-locationtst')
browser.refresh
end
browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹"
puts "🔺 🔺 🔺 🔺 🔺 CLICKED NEXT BUTTON 🔺 🔺 🔺 🔺 "
end
CSV.open('jobs.csv', "w") do |csv|
csv << @@jobs
end
File.open("jobs.json","w") do |f|
f.write(JSON.pretty_generate(@@jobs))
end
@@jobs
end
end
jobs = JobScraper.crawl!
Alternatively, you could also replace the crawl!
method with parse!
, which would allow you to use the return statement and print out the @@jobs
array:
jobs = JobScraper.parse!(:parse, url: "https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY")
pp jobs
Conclusion
Web scraping is most definitely a very powerful tool when you need to access and analyse a large number of (semi-structured) data from a number of different sources. While it allows you to quickly access, aggregate, and process that data, it can also be a challenging and daunting task, depending on the tools you are using and the data you want to handle. Nothing is more disappointing than believing to have found the one, perfect CSS selector, only to realise on page 500, that it won't work because of one small inconsistency - back to the drawing board.
What is important, is that you are using the right tools and the right approach to crawl a site. As we learned throughout this article, Ruby is a great choice and comes with many ready-to-use libraries for this purpose.
One important aspect to remember is to plan your crawler strategy in a way to avoid being rate limited by the site. We have another excellent article on that subject and how to make sure your web crawler does not get blocked .
💡 If you prefer not to have to deal with rate limits, proxies, user agents, and browser fingerprints, please check out our no-code web scraping API . Did you know, the first 1,000 calls are on us?
Happy Scraping.