The introduction of the
Fetch API
changed how Javascript developers make HTTP calls. This means that developers no longer have to download third-party packages just to make an HTTP request. While that is great news for frontend developers, as fetch
can only be used in the browser, backend developers still had to rely on different third-party packages. Until node-fetch
came along, which aimed to provide the same fetch API that browsers support. In this article, we will take a look at how node-fetch
can be used to help you scrape the web!
Prerequisites
To get the full benefit of this article, you should have:
- Some experience with writing ES6 Javascript.
- A proper understanding of promises and some experience with async/await.
What is the Fetch API?
Fetch
is a specification that aims to standardize what a request, response, and everything in between, which the standard declares as fetching
(hence the name fetch
). The browser fetch
API and node-fetch
are implementations of this specification. The biggest and most important difference between fetch
and its predecessor XHR
is the fact that it's built around Promises. This means that developers no longer have to fear the callback hell, messy code, and the extremely verbose APIs that XHR
has.
There are a few more technical differences: i.e when a request returns with an HTTP status code 404
, the promise that is returned from the fetch
call doesn't get rejected.
node-fetch
brings all of this to the server-side. This means that developers no longer have to learn different APIs, their various terminologies, or how fetching
actually happens behind the scenes to perform HTTP requests from the server-side. It's as simple as running npm install node-fetch
and writing HTTP requests almost the same way you would in a browser.
Scraping the web with node-fetch
and cheerio
To get the gig rolling, you must first install cheerio
alongside node-fetch
. While node-fetch
allows us to get the HTML of any page, because the result will just be a bunch of text, you will need some tooling to extract what you need from it. cheerio
helps with that, it provides a very intuitive JQuery
-like API and will allows you to extract data from the HTML you received with node-fetch
.
Make sure you have a package.json
, if not:
- generate one by running
npm init
- Then install
cheerio
andnode-fetch
by running the following command:npm install cheerio node-fetch
For the purpose of this article, we will scrape reddit
:
const fetch = require('node-fetch');
const getReddit = async () => {
const response = await fetch('https://reddit.com/');
const body = await response.text();
console.log(body); // prints a chock full of HTML richness
return body;
};
fetch
has a single mandatory argument which is the resource URL
. When fetch
is called, it returns a promise which will resolve to a Response
object as soon as the server responds with the headers. At this point, the body
is not yet available. The promise that is returned resolves and it does not matter whether or not the request failed. The promise will only be rejected due to network errors like connectivity issues, meaning that the promise can resolves even if the servers respond with a 500 Server Error.
The Response
class implements the Body
class which is a ReadableStream
that gives a convenient set of promise-based methods meant for stream consumption.
Body.text()
is one of them, and since Response
implements Body
, all the methods that Body
has can be used by a Response
instance. Calling any of these methods returns a promise that eventually resolves to the data.
With this data, in this case, HTML text, we can use cheerio
to create a DOM, then query it to extract that interests you. For example, if you want a list of all the posts in the feed you could get the selector (using your browser's dev tools) for the post list and then use cheerio
like this:
const fetch = require('node-fetch');
const cheerio = require('cheerio');
const getReddit = async () => {
// get html text from reddit
const response = await fetch('https://reddit.com/');
// using await to ensure that the promise resolves
const body = await response.text();
// parse the html text and extract titles
const $ = cheerio.load(body);
const titleList = [];
// using CSS selector
$('._eYtD2XCVieq6emjKBH3m').each((i, title) => {
const titleNode = $(title);
const titleText = titleNode.text();
titleList.push(titleText);
});
console.log(titleList);
};
getReddit();
cheerio.load()
allows you to parse any HTML text into a query-able DOM. cheerio
provides various methods to extract components out of the now constructed DOM, one of which is each()
, this method allows you to iterate over a list of nodes. How do we know that we get a list? We're looking for a list of the titles of the posts on Reddit's home page, currently the class name of one such title is ._eYtD2XCVieq6emjKBH3m
and but it may change in the future.
By iterating over the list using each()
, you get each HTML element, which you can feed again to cheerio
and it will allow you to once again extract the text out of each title.
This process is fairly intuitive and can be done with any website, as long as the website in question does not have anti-scraping mechanisms to throttle, limit, or prevent you from scraping. While this can be worked around, the effort and dev time required to do so may simply just be unaffordable. This guide can help you out in such cases!
Using the options parameter in node-fetch
fetch
has a single mandatory argument and one optional argument and that is the options
object. The options
object allow you to customize the HTTP request to suit your needs, whether it's to send a cookie along with your request or a make POST
request (fetch makes GET
requests by default) you'll need define the options
argument.
The most common properties that you will make use of are:
method
-, the request HTTP method, it is by default set toGET
.headers
- The headers that you want to pass along with the request.body
- The body of your request, you would use the body property if you were making for example a POST request.
You can find out about the other properties available to you to customize your HTTP request here . Now to put it all together, let's send a POST request with some cookies and a few query parameters:
const fetch = require('node-fetch');
const { URL, URLSearchParams } = require('url');
(async () => {
const url = new URL('https://some-url.com');
const params = { param: 'test'};
const queryParams = new URLSearchParams(params).toString();
url.search = queryParams;
const fetchOptions = {
method: 'POST',
headers: {
'cookie': '<cookie>',
},
body: JSON.string({ hello: 'world' }),
};
await fetch(url, fetchOptions);
})();
Using the URL
module, it very easy to attach query parameters to the website URL that you wish to scrape. The URLSearchParams
class in particular is useful for this.
To send an HTTP POST
request, you must simply set the method
property to POST
. You would do the same for any other HTTP request method like PUT
or DELETE
. To send any cookies alongside the request, you have to make use of the cookie
header.
Making fetch requests in parallel
At times you may want to make multiple different fetch
calls to different URLs at the same time.
Doing them one after the other will ultimately lead to bad performance and hence long wait times for your end-users.
To solve this problem, you should parallelize your code. Sending an HTTP requests consume very little resources of your computer, it takes time only because your computer is waiting, idle, for the server to respond. We call those kind of task "io bound", as opposed to tasks that are slow because they consume a lot of computing power, those are "CPU-bound".
"io bound" tasks can be efficiently parallelized with promises. And since fetch
is promise-based, you can make use of Promise.all
to make multiple fetch calls at the same time like this:
const newProductsPagePromise = fetch('https://some-website.com/new-products');
const recommendedProductsPagePromise = fetch('https://some-website.com/recommended-products');
// Returns a promise that resolves to a list of the results
Promise.all([newProductsPagePromise, recommendedProductsPagePromise]);
Conclusion
With that, you've just mastered node-fetch
for web scraping. Although fetch
is great for simple use cases, it can get a tad bit difficult to get right when you have to deal with Single Page Applications that use Javascript to render most of it's page. Challenging tasks like scraping concurrently and such should be done by hand as node-fetch
is simply an HTTP request client like any other.
The other benefits of using node-fetch
is that it's much more efficient than using a headless browser. Even for simple tasks like
submitting a form
, headless browser are slow and use a lot of server resources.
Since you've mastered node-fetch
give ScrapingBee a try, you get the first 1000 requests for free to try it out. Check out the getting started guide
here
!
Scraping the web is challenging given the fact that anti-scraping mechanisms get smarter day by day. Even if you manage to do it, getting it done right can be quite a tedious task. ScrapingBee allows you to skip the noise and focus only on what matters the most: the data.
Resources
node-fetch
Github Repo - Contains a whole bunch of examples that address common and advanced use cases.- The Fetch API MDN Documentation
- Goes into extreme detail on how the Fetch API works in the browser, and also contains examples on how you would use
fetch()
- A Javascript developer's guide to cURL - If you like this article, you will love this guide about how to use cURL with Javascript.
Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee . He is also the author of the Java Web Scraping Handbook.