One of the most important features of ScrapingBee, is the ability to extract exact data without need to post-process the request’s content using external libraries.
We can use this feature by specifying an additional parameter with the name extract_rules
. We specify the label of elements we want to extract, their CSS Selectors and ScrapingBee will do the rest!
Let’s say that we want to extract the title & the subtitle of the
data extraction documentation page
. Their CSS selectors are h1
and span.text-20
respectively. To make sure that they’re the correct ones, you can use the JavaScript function: document.querySelector("CSS_SELECTOR")
in that page’s developer tool’s console.
The full code will look like this:
<?php
// Get cURL resource
$ch = curl_init();
// Set base url & API key
$BASE_URL = "https://app.scrapingbee.com/api/v1/?";
$API_KEY = "YOUR-API-KEY";
// Set the extract rules:
$rules = array(
'title' => 'h1',
'subtitle' => 'span.text-20'
);
$rules = json_encode($rules);
// Set parameters
$parameters = array(
'api_key' => $API_KEY,
'url' => 'https://www.scrapingbee.com/documentation/data-extraction', // The URL to scrape
'extract_rules' => $rules
);
// Building the URL query
$query = http_build_query($parameters);
// Set the URL for cURL
curl_setopt($ch, CURLOPT_URL, $BASE_URL.$query);
// Set method
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
// Return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// Send the request and save response to $response
$response = curl_exec($ch);
// Stop if fails
if (!$response) {
die('Error: "' . curl_error($ch) . '" - Code: ' . curl_errno($ch));
}
echo 'HTTP Status Code: ' . curl_getinfo($ch, CURLINFO_HTTP_CODE) . PHP_EOL;
echo 'Response Body: ' . $response . PHP_EOL;
// Close curl resource to free up system resources
curl_close($ch);
?>
Go back to tutorials