How to use Node.js to create modified versions of html documents?

Question

I am trying to do this:

Read html document "myDocument.html" with Node
Insert contents of another html document named "foo.html" immediately after the open body tag of myDocument.html.
Insert contents of yet another html document named "bar.html" immediately before the close body tag of myDocument.html.
Save the modified version of "myDocument.html".

To do the above, I would need to search the DOM with Node to find the open and closing body tags. How can this be done?

Shrey Gupta · Accepted Answer · 2013-11-14 06:29:15Z

Very simply, you can use the native Filesystem module that comes with Node.JS. (var fs = require("fs")). This allows you to read and convert the HTML to a string, perform string replace functions, and finally save the file again by rewriting it.

The advantage is that this solution is completely native, and requires no external libraries. It is also completely faithful to the original HTML file.

//Starts reading the file and converts to string.
fs.readFile('myDocument.html', function (err, myDocData) {
      fs.readFile('foo.html', function (err, fooData) { //reads foo file
          myDocData.replace(/\<body\>/, "<body>" + fooData); //adds foo file to HTML
          fs.readFile('bar.html', function (err, barData) { //reads bar file
              myDocData.replace(/\<\/body\>/, barData + "</body>"); //adds bar file to HTML
              fs.writeFile('myDocumentNew.html', myDocData, function (err) {}); //writes new file.
          });
      });
});

Andrew · Accepted Answer · 2013-11-14 06:16:16Z

0

In a simple but not accurate way, you can do this:

str = str.replace(/(<body.*?>)/i, "$1"+read('foo.html'));

str = str.replace(/(<\/body>)/i, read('bar.html')+'$1');

It will not work if the myDocument content contains multiple "<body ..' or '</body>', e.g. in javascript, and also the foo.html and bar.html can not contains '$1' or '$2'...

If you can edit the content of myDocument, then you can leave some "placeholder" there(as html comments), like

<!--foo.html-->

Then, it's easy, just replace this "placeholder" .

answered Nov 14, 2013 at 6:16

Andrew

5,3601 gold badge22 silver badges22 bronze badges

1 Comment

Peter Lyons Over a year ago

I am required to link to this codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

Peter Lyons · Accepted Answer · 2013-11-14 07:05:23Z

0

Use the cheerio library, which has a simplified jQuery-ish API.

var cheerio = require('cheerio');
var dom = cheerio(myDocumentHTMLString);
dom('body').prepend(fooHTMLString);
dom('body').append(barHTMLString);
var finalHTML = dom.html();

And just to be clear since the legions of pro-regex individuals are already appearing in droves, yes you need a real parser. No you cannot use a regular expression. Read Stackoverflow lead developer Jeff Atwood's post on parsing HTML the Cthulhu way.

edited Nov 14, 2013 at 7:05

answered Nov 14, 2013 at 6:26

Peter Lyons

147k32 gold badges285 silver badges281 bronze badges

8 Comments

Shrey Gupta Over a year ago

Is the extra library really necessary though?

Peter Lyons Over a year ago

Glad you asked! codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

Shrey Gupta Over a year ago

Darn, you have the Jeff Atwood on your side there. However, seeing as only the body tag needs to be identified, I don't think a whole parsing library would be necessary as such. Nevertheless, you may want to mention the filesystem functions, as the OP specifically mentions reading and saving the file; not just modifying the dom.

Peter Lyons Over a year ago

Yes but stackoverflow requires "thoroughly researched" questions and attempted code snippets. Reading files in node is clearly documented and the web is full of examples. OP has >2K rep. I think the crux of his question has to do with the HTML modification not stuff you learn in your first 90 seconds of a node.js tutorial.

Supr Over a year ago

Jeff doesn't appear to be on anyone's side in that post:

It's considered good form to demand that regular expressions be considered verboten, totally off limits for processing HTML, but I think that's just as wrongheaded as demanding **every trivial HTML processing task be handled by a full-blown parsing engine**. It's more important to understand the tools, and their strengths and weaknesses, than it is to knuckle under to knee-jerk dogmatism.

|

Collectives™ on Stack Overflow

How to use Node.js to create modified versions of html documents?

3 Answers 3

Comments

1 Comment

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related