1

I would like to implement a basic web scraper using Node.js that is as generic as possible. I want the application to be able to parse and return the text from any HTML, ignoring any Markup/CSS/Script, without having to know the structure of the HTML being parsed ahead of time.

I have been looking at using this library:

https://github.com/cheeriojs/cheerio

With the below code I am able to extract text from the body tag, however this also contains CSS and JavaScript. What would be the best way to extract only the text and not include the CSS/JavaScript?

Code:

 var request = require('request');
var cheerio = require('cheerio');
var URL = require('url-parse');

var pageToVisit = "http://www.arstechnica.com";
console.log("Visiting page " + pageToVisit);
request(pageToVisit, function (error, response, body) {
    if (error) {
        console.log("Error: " + error);
    }
    // Check status code (200 is HTTP OK)
    console.log("Status code: " + response.statusCode);
    if (response.statusCode === 200) {
        // Parse the document body
        var $ = cheerio.load(body);
        console.log($('body').text());
    }
});
2
  • 1
    Reading the library's documentation I saw that it provides a .remove() method. Maybe you could use it to remove unwanted elements. Commented Jan 15, 2019 at 17:31
  • 1
    That would look like $('script,style').remove() Commented Jan 15, 2019 at 23:15

2 Answers 2

2

looking on other answers I've seen that you could use regex in order to do so, here's an example:

let scriptRegex = /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi;
let styleRegex = /((<style>)|(<style type=.+))((\s+)|(\S+)|(\r+)|(\n+))(.+)((\s+)|(\S+)|(\r+)|(\n+))(<\/style>)/g;

// An example html content
const str = `
my cool html content
<style>
...
</style>
my cool html content
<style type="text/css">
...
</style>
my cool html content
<script> 
... 
</script>
my cool html content`;

// Strip the tags from the html
let result = str.replace(scriptRegex, '');
result = result.replace(styleRegex, '');

// There you go :)
console.log('Substitution result: ', result);

Hope it helps!

Sign up to request clarification or add additional context in comments.

Comments

0

I believe cherio.load(body) is giving you a DOM. If so, you can use innerText something like this:

    // Parse the document body
    var $ = cheerio.load(body);
    console.log($('body').innerText);

If cherio is giving you HTML, you could convert it to DOM with JSDOM something like this:

    // Parse the document body
    const jsdom = require(jsdom);
    const dom = jsdom.JSDOM(cheerio.load(body),{"url": pageToVisit}).window.document.body;
    console.log(dom.innerText);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.