Parsing HTML text with JS - Extra Node?

Question

everyone.

I'm building a software that does some text parsing to given HTML texts, and when I save all of the paragraphs from the HTML, I find an extra node.

I've created the

 <p id="original_content_js"> Original content via JS:<br> </p>

to save the received data from the parsing, and compare it so the data that is parsed (the original text).

This is the HTML code:

<p id="original_content_js">
Original content via JS:<br>
</p>

<div id="original_text">    

        <h3>Molly's Sheep</h3>
        <p>
            Molly had a little sheep. <br>
            Molly didn't like her sheep. Ir was too hairy.<br>
            So Molly took a big knife, and cut all of her sheep's fur.<br>
            Now Molly's sheep is cold.<br>
        </p>
        <p>
            But what Molly did not know, was that her sheep is a magical sheep;<br>
            Molly's sheep grows hair instantly, magically!<br>
            Oh, how wonderful, Molly's sheep,<br>
            Making hair, each and each<br>
            Hair grows quickly after cut,<br>
            That's what the story's all about.
        <p>     
    </div>

And this is the parsing code:

 var html_text_name = "original_text";
 var html_text = document.getElementById(html_text_name);
 var text_paragaphs = html_text.getElementsByTagName("p");
 for (var x=0; x<text_paragaphs.length; x++){
    document.getElementById("original_content_js").innerHTML += "ABC" +
    text_paragaphs[x].innerHTML + "CBA <br>";
 }

And the result I get into the original_content_js paragraph is:

 Original content via JS:
 ABC Molly had a little sheep. 
 Molly didn't like her sheep. Ir was too hairy.
 So Molly took a big knife, and cut all of her sheep's fur.
 Now Molly's sheep is cold.
 CBA 
 ABC But what Molly did not know, was that her sheep is a magical sheep;
 Molly's sheep grows hair instantly, magically!     
 Oh, how wonderful, Molly's sheep,
 Making hair, each and each
 Hair grows quickly after cut,
 That's what the story's all about. CBA 
 ABC CBA

So you can see that I'm getting things as expected - 2 paragraphs wrapped in "ABC" and "CBA", except for having another empty node in the end. Why is there another extra node?

Brad Mash · Accepted Answer · 2015-10-12 23:45:17Z

1

You are not checking that the paragraphs are properly closed. So, your code sees three opening p tags and assumes there are three paragraphs. The very last p tag should be a closed p tag. This is a problem because it sets text_paragraphs to 3 instead of 2. You will need to write a regex to check for this... but beware... writing regex for HTML parsing is a scary thing... and is typically impossible to accurately do 100% of the time.

EDIT: I'm not saying you shouldn't write a regex for checking if tags are properly closed depending on your situation... I'm just saying, be careful.

edited Oct 12, 2015 at 23:45

answered Oct 12, 2015 at 23:40

Brad Mash

696 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parsing HTML text with JS - Extra Node?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related