I would like to implement a basic web scraper using Node.js that is as generic as possible. I want the application to be able to parse and return the text from any HTML, ignoring any Markup/CSS/Script, without having to know the structure of the HTML being parsed ahead of time.
I have been looking at using this library:
https://github.com/cheeriojs/cheerio
With the below code I am able to extract text from the body tag, however this also contains CSS and JavaScript. What would be the best way to extract only the text and not include the CSS/JavaScript?
Code:
var request = require('request');
var cheerio = require('cheerio');
var URL = require('url-parse');
var pageToVisit = "http://www.arstechnica.com";
console.log("Visiting page " + pageToVisit);
request(pageToVisit, function (error, response, body) {
if (error) {
console.log("Error: " + error);
}
// Check status code (200 is HTTP OK)
console.log("Status code: " + response.statusCode);
if (response.statusCode === 200) {
// Parse the document body
var $ = cheerio.load(body);
console.log($('body').text());
}
});
.remove()method. Maybe you could use it to remove unwanted elements.$('script,style').remove()