Scraping the Web With Node.js

Free Course

Build Your First Node.js Website

Node is a powerful tool to get JavaScript on the server. Use Node to build a great website.

Before web based API’s became the prominent way of sharing data between services we had web scraping. Web scraping is a technique in data extraction where you pull information from websites.

There are many ways this can be accomplished. It can be done manually by copy and pasting data from a website, using specialized software, or building your own scripts to scrape data. In this tutorial, we will be showing you how to build a simple web scraper that gets some general movie information from IMDB. The technologies we will be using to accomplish this are:

  • NodeJS
  • ExpressJS: The Node framework that everyone uses and loves.
  • Request: Helps us make HTTP calls
  • Cheerio: Implementation of core jQuery specifically for the server (helps us traverse the DOM and extract data)

Setup

Our setup will be pretty simple. If you’re already familiar with NodeJS, go ahead and setup your project and include Express, Request and Cheerio as your dependencies.

Here is our package.json file to get all the dependencies we need for our project.


{
  "name"         : "node-web-scrape",
  "version"      : "0.0.1",
  "description"  : "Scrape le web.",
  "main"         : "server.js",
  "author"       : "Scotch",
  "dependencies" : {
    "express"    : "latest",
    "request"    : "latest",
    "cheerio"    : "latest"
  }
}

With your package.json file all ready to go, just install your dependencies with:

npm install

With that setup, let’s take a look at what we’ll be creating. In this tutorial, we will make a single request to IMDB and get:

  • name of a movie
  • release year
  • IMDB community rating

Once we compile this information, we will save it to a JSON file on our computer. Please see the code examples below for our setup. For this tutorial we will not have a front-end user interface and will rely on our command window to guide us.

Our Application

Our web scraper is going to be very minimalistic. The basic flow will be as follows:

  1. Launch web server
  2. Visit a URL on our server that activates the web scraper
  3. The scraper will make a request to the website we want to scrape
  4. The request will capture the HTML of the website and pass it along to our server
  5. We will traverse the DOM and extract the information we want
  6. Next, we will format the extracted data into a format we need
  7. Finally, we will save this formatted data into a JSON file on our machine

If you’ve been following our other NodeJS tutorials you should be pretty familiar with how to structure of an application works. For this tutorial, we will set the entire logic in our server.js file.

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

app.get('/scrape', function(req, res){

  //All the web scraping magic will happen here

})

app.listen('8081')

console.log('Magic happens on port 8081');

exports = module.exports = app;

Making the Request

Now that we have the boilerplate of the application done, let’s get into the fun stuff. We are now on Step 3, and that is making the request to the external website we would like to scrape.

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

app.get('/scrape', function(req, res){
    // The URL we will scrape from - in our example Anchorman 2.

    url = 'http://www.imdb.com/title/tt1229340/';

    // The structure of our request call
    // The first parameter is our URL
    // The callback function takes 3 parameters, an error, response status code and the html

    request(url, function(error, response, html){

        // First we'll check to make sure no errors occurred when making the request

        if(!error){
            // Next, we'll utilize the cheerio library on the returned html which will essentially give us jQuery functionality

            var $ = cheerio.load(html);

            // Finally, we'll define the variables we're going to capture

            var title, release, rating;
            var json = { title : "", release : "", rating : ""};
        }
    })
})

app.listen('8081')
console.log('Magic happens on port 8081');
exports = module.exports = app;

The request function takes two parameters, the URL and a callback. For the URL parameter we will set the link of the IMDB movie we want to extract information from. In the callback, we will capture 3 parameters: error, response, and html.

Traversing the DOM

Movie Title

Now we are ready to start traversing the DOM and extracting information. First let’s get the movie name. We’ll head over to IMDB, open up Developer Tools and inspect the movie title element. We will be looking for a unique element that will help us single out the movie title. We notice that the <h1> tag is our best bet for the movie title and that the class header is unique. This seems like good starting spot.

IMDB DOM Inspect

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

app.get('/scrape', function(req, res){
    
    url = 'http://www.imdb.com/title/tt1229340/';

    request(url, function(error, response, html){
        if(!error){
            var $ = cheerio.load(html);

            var title, release, rating;
            var json = { title : "", release : "", rating : ""};

            // We'll use the unique header class as a starting point.

            $('.header').filter(function(){

           // Let's store the data we filter into a variable so we can easily see what's going on.

                var data = $(this);

           // In examining the DOM we notice that the title rests within the first child element of the header tag. 
           // Utilizing jQuery we can easily navigate and get the text by writing the following code:

                title = data.children().first().text();

           // Once we have our title, we'll store it to the our json object.

                json.title = title;
            })
        }
    })
})

app.listen('8081')
console.log('Magic happens on port 8081');
exports = module.exports = app;

Release Year

Now we are able to get the movie title. Next, we’ll repeat the process this time trying to find a unique element in the DOM for the movie release year. We notice that the year is also contained within the <h1> tag and we also notice that the year is contained within the last element of the header. This gives us enough information to extract the year by writing this code:

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

app.get('/scrape', function(req, res){
    
    url = 'http://www.imdb.com/title/tt1229340/';

    request(url, function(error, response, html){
        if(!error){
            var $ = cheerio.load(html);

            var title, release, rating;
            var json = { title : "", release : "", rating : ""};

            $('.header').filter(function(){
                var data = $(this);
                title = data.children().first().text();
            
                // We will repeat the same process as above.  This time we notice that the release is located within the last element.
                // Writing this code will move us to the exact location of the release year.

                release = data.children().last().children().text();

                json.title = title;

                // Once again, once we have the data extract it we'll save it to our json object

                json.release = release;
            })
        }
    })
})

app.listen('8081')
console.log('Magic happens on port 8081');
exports = module.exports = app;

Community Rating

Finally, to get the community rating, we repeat the above process. This time though, we notice that there is a very unique class name that will help us get the information really easily. The class name is .star-box-giga-star. So let’s write some code to extract that information.

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

app.get('/scrape', function(req, res){
    
    url = 'http://www.imdb.com/title/tt1229340/';

    request(url, function(error, response, html){
        if(!error){
            var $ = cheerio.load(html);

            var title, release, rating;
            var json = { title : "", release : "", rating : ""};

            $('.header').filter(function(){
                var data = $(this);
                title = data.children().first().text();
            
                release = data.children().last().children().text();

                json.title = title;
                json.release = release;
            })

            // Since the rating is in a different section of the DOM, we'll have to write a new jQuery filter to extract this information.

            $('.star-box-giga-star').filter(function(){
                var data = $(this);

                // The .star-box-giga-star class was exactly where we wanted it to be.
                // To get the rating, we can simply just get the .text(), no need to traverse the DOM any further

                rating = data.text();

                json.rating = rating;
            })
        }
    })
})

app.listen('8081')
console.log('Magic happens on port 8081');
exports = module.exports = app;

That’s all there is to it. If you wanted to extract more information, you can do so by repeating the steps we did above.

  1. Find a unique element or attribute on the DOM that will help you single out the data you need
  2. If no unique element exists on the particular tag, find the closest tag that does and set that as your starting point
  3. If needed, traverse the DOM to get to the data you would like to extract

Formatting and Extracting

Now that we have the data extracted, let’s format it and save it to our project folder. We have been storing our extracted data to a variable called json. Let’s save the data in this variable to our project folder. You’ll notice earlier that we required the fs library. If you didn’t know what this was for, this library gives us access to our computer’s file system. Take a look at the code below to see how we write files to the file system.

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

app.get('/scrape', function(req, res){

url = 'http://www.imdb.com/title/tt1229340/';

request(url, function(error, response, html){
    if(!error){
        var $ = cheerio.load(html);

    var title, release, rating;
    var json = { title : "", release : "", rating : ""};

    $('.header').filter(function(){
        var data = $(this);
        title = data.children().first().text();            
        release = data.children().last().children().text();

        json.title = title;
        json.release = release;
    })

    $('.star-box-giga-star').filter(function(){
        var data = $(this);
        rating = data.text();

        json.rating = rating;
    })
}

// To write to the system we will use the built in 'fs' library.
// In this example we will pass 3 parameters to the writeFile function
// Parameter 1 :  output.json - this is what the created filename will be called
// Parameter 2 :  JSON.stringify(json, null, 4) - the data to write, here we do an extra step by calling JSON.stringify to make our JSON easier to read
// Parameter 3 :  callback function - a callback function to let us know the status of our function

fs.writeFile('output.json', JSON.stringify(json, null, 4), function(err){

    console.log('File successfully written! - Check your project directory for the output.json file');

})

// Finally, we'll just send out a message to the browser reminding you that this app does not have a UI.
res.send('Check your console!')

    }) ;
})

app.listen(‘8081’) console.log(‘Magic happens on port 8081’); exports = module.exports = app;

With this code in place you are set to scrape and save the scraped data. Let’s start up our node server, navigate to http://localhost:8081/scrape and see what happens.

  • If everything went smoothly your browser should display a message telling you to check your command prompt.
  • When you check your command prompt you should see a message saying that your file was successfully written and that you should check your project folder.
  • Once you get to your project folder you should see a new file created called output.json.
  • Opening this file, will give you a nicely formatted JSON document that will have the extracted data.

Congrats! You just wrote your first web scraper!

Putting It All Together

In this tutorial, we built a simple a web scraper that extracted movie information from an IMBD page. We covered using the Request and Cheerio libraries to make external requests and add jQuery functionality to our NodeJS server. We showed you how to traverse the DOM using jQuery in Node and how to write to the file system. I hope you enjoyed this article. Feel free to ask any questions below.

A Note on Web Scraping

Web scraping falls within a gray area of the law. Scraping data for personal use within limits is generally ok but you should always get permission from the website owner before doing so. Our example here was very minimalistic in a sense (we only made one request to IMDB) so that it does not interfere with IMDB’s operations. Please scrape responsibly.