coderholic

Scraping the web with Node.io

Node.io is a relatively new screen scraping framework that allows you to easily scrape data from websites using Javascript, a language that I think is perfectly suited to the task. It's built on top of Node.js, but you don't need to know any Node.js to get started, and can run your node.io jobs straight from the command line.

The existing documentation is pretty good, and includes a few detailed examples, such as the one below that returns the number of google search results for some given keywords:

var nodeio = require('node.io');
var options = {timeout: 10};

exports.job = new nodeio.Job(options, {
    input: ['hello', 'foobar','weather'],
    run: function (keyword) {
        var self = this, results;
        this.getHtml('http://www.google.com/search?q=' + encodeURIComponent(keyword), function (err, $) {
            results = $('#resultStats').text.toLowerCase();
            self.emit(keyword + ' has ' + results);
        });
    }
});

Running this from the command line gives you the following output:

$ node.io google.js 
hello has about 878,000,000 results
foobar has about 2,630,000 results
weather has about 719,000,000 results
OK: Job complete

Scraping Multiple Pages

Unfortunately some of the documentation simply says coming soon, so you're left to guess the best way to put together more advanced scraping workflows. For example, I wanted to scrape the search results from GitHub. If you search for "django" then you (currently) get 6067 results spread over 203 pages.

What I could figure out from the documentation is that a node.io job passes through several stages: input, run, reduce, and output. The documentation also mentions that multiple invocations of the run method can be run in parallel, so the logical thing to do seems to be to pass in the page number to run, and have it scrape the results from a single page. You can then scrape lots of different pages in parallel.

To calculate the total number of pages, and pass the page numbers to the run method, I implemented an input method. There's not much documentation on this, but the key thing is to make sure it returns false once you're done, otherwise it'll keep getting called again and again. The other key thing is that you need to pass your data to the run method via the callback function, and it needs to be wrapped in an array. Here's the complete GitHub search results scraper:

var nodeio = require('node.io');
exports.job = new nodeio.Job({benchmark: true, max: 50}, {
    input: function(start, num, callback) {
        if(start !== 0) return false; // We only want the input method to run once
        var self = this;

        this.getHtml('https://github.com/search?type=Repositories&language=python&q=django&repo=&langOverride=&x=0&y=0&start_value=1', function(err, $) {
            if (err) self.exit(err);
            var total_pages = $('.pager_link').last().text;
            for(var i = 1; i < total_pages; i++) {
                callback([i]); // The page number will be passed to the run method
            }
            callback(null, false);
        });
    }, 
    run: function(page_number) {
        var self = this;
        this.getHtml('https://github.com/search?type=Repositories&language=python&q=django&repo=&langOverride=&x=0&y=0&start_value=' + page_number, function(err, $) {
            if (err) {
                console.log("ERROR", err);
                self.retry();
            }
            else {
                $('.result').each(function(listing) {
                    var project = {}
                    var title = $('h2 a', listing).fulltext;
                    project.author = title.substring(0, title.indexOf(" / "));
                    project.title = title.substring(title.indexOf(" / ") + 3);
                    project.link = "https://github.com" + $('h2 a', listing).attribs.href; 
                    var language = $('.language', listing).fulltext;
                    project.language = language.substring(1, language.length - 1); // Strip of leading and trailing brackets
                    project.description = $('.description', listing).fulltext
                    self.emit(project)
                });
            }
        });
    }
});

While my solution works I'm sure it's not optimal. By implementing an input method there's no way to specify a search term from the command line, which is far from ideal. Hopefully I'll be able to improve the scraper once some additional documentation is written, or after I've dug through the node.io code some more.

There's lots more than node.io can do. It has built in functions to do things like calculate the pagerank of a domain, resolving domain names to IPs, and lots of other useful utilities. Like Node.js it also has full support for coffeescript. It's a fantastic tool to have in your toolbox!

Posted on 15 Apr 2011
If you enjoyed reading this post you might want to follow @coderholic on twitter or browse though the full blog archive.