Web Scraping Using Nodejs

Question

I have created a simple web scraper that pulls in the article titles and URL from this website: http://espn.go.com/college-football/. However, the scraper only returns 46-50 articles, instead of all the articles from the site. I've tried changing the CSS selector that cheerio uses, but nothing changes with regards to the number of articles it scrapes. Here is the code I'm using:

var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var mongo = require('mongoskin');
var db = mongo.db("mongodb://localhost:27017/test", { native_parser: true });


url = 'http://espn.go.com/college-football/';

function Headline(title, link) {
    this.Title = title;
    this.link = link;
}

request(url, function (error, response, html) {
    if (!error) {
        var $ = cheerio.load(html);

        var result = [];

        // Grab the articles titles/url
        $('.text-container h1 a.realStory', '#news-feed-content').each(function (i, elem) {
            console.log($(elem).text(), elem.attribs.href);
            var articleObject = new Headline($(elem).text(), elem.attribs.href);
            result.push(articleObject);
        });
    }

    fs.writeFile('espn_articles.json', JSON.stringify(result, null, 4), function (err) {

        console.log('File successfully written! - Check your project directory for the output.json file');

    })

    db.collection('articles').insert(result, function (error, record) {
        if (error) throw error;
        console.log("data saved");
    });
});

scroll down page will see content gets added based in infinite scroll loading — charlietfl
– charlietfl, Commented Jun 27, 2016 at 19:05
@charlietfl Can you elaborate? I know the page auto scrolls, but I'd like to know how to get the articles from scrolling all the way down. — Rodney Wilson
– Rodney Wilson, Commented Jun 27, 2016 at 19:28

rchipka · Accepted Answer · 2016-07-09 17:55:21Z

Here's an example using Osmosis.

osmosis('http://espn.go.com/college-football/')
    .find('#news-feed-content .text-container')
    .set({
        author:   '.author',
        category: '.category-link',
        title:    '.realStory',
        link:     '.realStory@href',
        blurb:    'p'
    })
    .follow('.realStory@href')
    .set({
        date:    '.article-meta @data-date',
        images:  [ 'picture @srcset' ],
        content: '.article-body'
    })
    .data(function (article) {
        /*
        { author: '...',
          category: '...',
          title: 'Harbaugh, Michigan reel in Florida OL Herbert',
          link: '...',
          blurb: 'Jim Harbaugh and Michigan have landed another recruit from SEC country in Kai-Leon Herbert of Florida.',
          date: '2016-07-06T17:25:09Z',
          images: [ '...', '...' ],
          content: '...'
        }
        */

        db.collection('articles').insert(article, function (error, record) {
            // ...
        });
    })
    .log(console.log)
    .error(console.log)
    .debug(console.log);

Revln9 · Accepted Answer · 2016-06-27 20:40:24Z

2

when you take a look at the page with the chrome dev tools you'll see that it makes an api call everytime it renders more posts . here's the url : http://cdn.espn.go.com/core/now?render=true&partial=nowfeed&xhr=1&sport=ncf&offset=0&device=desktop&userab=8

I assume that the offset params is used for pagination.

Keep in mind that scraping is "illegal" in certain cases so better to ask for permission first

hope it helps !

answered Jun 27, 2016 at 20:40

Revln9

8475 silver badges10 bronze badges

2 Comments

Rodney Wilson Over a year ago

Thanks @Sam. I contacted them and made sure I could scrape the content before starting to project. I guess I'm still lost on how to grab the articles from scrolling down. Should I implement something like PhantomJS?

Revln9 Over a year ago

Hi, in this case you don't need to. just use a variable instead of the offset value at the url . Good luck

Collectives™ on Stack Overflow

Web Scraping Using Nodejs

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related