Removing script from HTML string using node-html-parser

Question

I have an API which receives a string containing HTML code and stores it in a database. I'm using node-html-parser package to perform some logic on the HTML.

Among other things, I want to remove any potentially-malicious script. According to the documentation, the package should be able to do this when instructed via the options object (see 'Global Methods' heading in previous link).

My code:

const parser = require('node-html-parser');
const html = `<p>My text</p><script></script>`
const options = {
    blockTextElements: {
        script: false
    }
}
const root = parser.parse(html, options)
return ({ html: root.innerHTML})

I tried modifying the options object with script: true, noscript: false, and noscript: true as well, but neither removed the script tags from the html.

Am I doing something wrong?

You should be mindful that this strategy still leaves you wide open for other JavaScript injection vectors via on* attributes in the HTML markup itself, among others. Dependent on how you're piecing the markup back together on the tail end, you may also still be very vulnerable to markup a la <scr<script>Ha!</script>ipt> alert(document.cookie);</script> (h/t to this SO thead). You really should re-evaluate mitigations for this type of attack dependend on your broader threat model. — esqew
– esqew, Commented May 28, 2022 at 21:37
A safer approach would be to use a more security-focused library like sanitize-html which is specifically geared to minimize or eliminate these potential attack vectors by allowing for the configuration of an explicit allow-list of HTML element types and attributes that fit your use case. — esqew
– esqew, Commented May 28, 2022 at 21:40
@esqew this is really insightful. I'll have a look into integrating this package for sanitisation. Thank you for making me aware. — Sam
– Sam, Commented May 28, 2022 at 21:55

Jaood_xD · Accepted Answer · 2022-05-28 21:24:48Z

3

Seems like the 'node-html-parser' is kind of buggy for script: false but we still can use this library to work with DOM. My solution is to use querySelectorAll to find all the <script> tags and remove them so the final solution might looks like:

const parser =  require('node-html-parser');
let html = '<html>asdasd<script></script></html>';
//convert plain html to dom
let dom = parser.parse(html);
//select all the script tags from the DOM and remove them
dom.querySelectorAll('script').forEach(x=> x.remove());

//now DOM contains everything except script tags
//to transform DOM back to plain html we just need to use method toString() 
console.log(dom.toString());

edited May 28, 2022 at 21:24

answered May 28, 2022 at 20:18

Jaood_xD

9832 gold badges6 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Removing script from HTML string using node-html-parser

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related