0

I have an API which receives a string containing HTML code and stores it in a database. I'm using node-html-parser package to perform some logic on the HTML.

Among other things, I want to remove any potentially-malicious script. According to the documentation, the package should be able to do this when instructed via the options object (see 'Global Methods' heading in previous link).

My code:

const parser = require('node-html-parser');
const html = `<p>My text</p><script></script>`
const options = {
    blockTextElements: {
        script: false
    }
}
const root = parser.parse(html, options)
return ({ html: root.innerHTML})

I tried modifying the options object with script: true, noscript: false, and noscript: true as well, but neither removed the script tags from the html.

Am I doing something wrong?

3
  • 1
    You should be mindful that this strategy still leaves you wide open for other JavaScript injection vectors via on* attributes in the HTML markup itself, among others. Dependent on how you're piecing the markup back together on the tail end, you may also still be very vulnerable to markup a la <scr<script>Ha!</script>ipt> alert(document.cookie);</script> (h/t to this SO thead). You really should re-evaluate mitigations for this type of attack dependend on your broader threat model. Commented May 28, 2022 at 21:37
  • 1
    A safer approach would be to use a more security-focused library like sanitize-html which is specifically geared to minimize or eliminate these potential attack vectors by allowing for the configuration of an explicit allow-list of HTML element types and attributes that fit your use case. Commented May 28, 2022 at 21:40
  • @esqew this is really insightful. I'll have a look into integrating this package for sanitisation. Thank you for making me aware. Commented May 28, 2022 at 21:55

1 Answer 1

3

Seems like the 'node-html-parser' is kind of buggy for script: false but we still can use this library to work with DOM. My solution is to use querySelectorAll to find all the <script> tags and remove them so the final solution might looks like:

const parser =  require('node-html-parser');
let html = '<html>asdasd<script></script></html>';
//convert plain html to dom
let dom = parser.parse(html);
//select all the script tags from the DOM and remove them
dom.querySelectorAll('script').forEach(x=> x.remove());

//now DOM contains everything except script tags
//to transform DOM back to plain html we just need to use method toString() 
console.log(dom.toString());
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.