Implementing a Generic Web Scraper using Node.js

Question

I would like to implement a basic web scraper using Node.js that is as generic as possible. I want the application to be able to parse and return the text from any HTML, ignoring any Markup/CSS/Script, without having to know the structure of the HTML being parsed ahead of time.

I have been looking at using this library:

https://github.com/cheeriojs/cheerio

With the below code I am able to extract text from the body tag, however this also contains CSS and JavaScript. What would be the best way to extract only the text and not include the CSS/JavaScript?

Code:

 var request = require('request');
var cheerio = require('cheerio');
var URL = require('url-parse');

var pageToVisit = "http://www.arstechnica.com";
console.log("Visiting page " + pageToVisit);
request(pageToVisit, function (error, response, body) {
    if (error) {
        console.log("Error: " + error);
    }
    // Check status code (200 is HTTP OK)
    console.log("Status code: " + response.statusCode);
    if (response.statusCode === 200) {
        // Parse the document body
        var $ = cheerio.load(body);
        console.log($('body').text());
    }
});

Reading the library's documentation I saw that it provides a .remove() method. Maybe you could use it to remove unwanted elements. — t.m.adam
– t.m.adam, Commented Jan 15, 2019 at 17:31

Silvio Biasiol · Accepted Answer · 2019-01-15 15:45:45Z

2

looking on other answers I've seen that you could use regex in order to do so, here's an example:

let scriptRegex = /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi;
let styleRegex = /((<style>)|(<style type=.+))((\s+)|(\S+)|(\r+)|(\n+))(.+)((\s+)|(\S+)|(\r+)|(\n+))(<\/style>)/g;

// An example html content
const str = `
my cool html content
<style>
...
</style>
my cool html content
<style type="text/css">
...
</style>
my cool html content
<script> 
... 
</script>
my cool html content`;

// Strip the tags from the html
let result = str.replace(scriptRegex, '');
result = result.replace(styleRegex, '');

// There you go :)
console.log('Substitution result: ', result);

Hope it helps!

answered Jan 15, 2019 at 15:45

Silvio Biasiol

9649 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

bristweb · Accepted Answer · 2019-01-15 21:28:26Z

0

I believe cherio.load(body) is giving you a DOM. If so, you can use innerText something like this:

    // Parse the document body
    var $ = cheerio.load(body);
    console.log($('body').innerText);

If cherio is giving you HTML, you could convert it to DOM with JSDOM something like this:

    // Parse the document body
    const jsdom = require(jsdom);
    const dom = jsdom.JSDOM(cheerio.load(body),{"url": pageToVisit}).window.document.body;
    console.log(dom.innerText);

answered Jan 15, 2019 at 21:28

bristweb

1,25317 silver badges22 bronze badges

Collectives™ on Stack Overflow

Implementing a Generic Web Scraper using Node.js

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related