One of the most important features of ScrapingBee, is the ability to extract exact data without need to post-process the request’s content using external libraries.
We can use this feature by specifying an additional parameter with the name extract_rules
. We specify the label of elements we want to extract, their CSS Selectors and ScrapingBee will do the rest!
Let’s say that we want to extract the title & the subtitle of the
data extraction documentation page
. Their CSS selectors are h1
and span.text-20
respectively. To make sure that they’re the correct ones, you can use the JavaScript function: document.querySelector("CSS_SELECTOR")
in that page’s developer tool’s console.
The full code will look like this:
require 'net/http'
require 'net/https'
require 'addressable/uri'
require 'json'
# Get
def extract_rules(user_url, rules)
uri = Addressable::URI.parse("https://app.scrapingbee.com/api/v1/")
api_key = "YOUR-API-KEY"
uri.query_values = {
'api_key' => api_key,
'url' => user_url,
'extract_rules' => rules
}
uri = URI(uri)
# Create client
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
# Create Request
req = Net::HTTP::Get.new(uri)
# Fetch Request
res = http.request(req)
# Print response body
return res
rescue StandardError => e
puts "HTTP Request failed (#{ e.message })"
end
url = "https://www.scrapingbee.com/documentation/data-extraction/"
rules = {
"title": "h1",
"subtitle": "span.text-20"
}
rules = rules.to_json # Convert the hash object into JSON format
request = extract_rules(url, rules)
puts request.body
And as you can see, the result is:
{"title": "Documentation - Data Extraction", "subtitle": "Extract data with CSS selector"}'
You can find more about this feature in our documentation: Data Extraction . And more about CSS selectors in W3Schools - CSS Selectors page.
Go back to tutorials