0

everyone.

I'm building a software that does some text parsing to given HTML texts, and when I save all of the paragraphs from the HTML, I find an extra node.

I've created the

 <p id="original_content_js"> Original content via JS:<br> </p>

to save the received data from the parsing, and compare it so the data that is parsed (the original text).

This is the HTML code:

<p id="original_content_js">
Original content via JS:<br>
</p>

<div id="original_text">    

        <h3>Molly's Sheep</h3>
        <p>
            Molly had a little sheep. <br>
            Molly didn't like her sheep. Ir was too hairy.<br>
            So Molly took a big knife, and cut all of her sheep's fur.<br>
            Now Molly's sheep is cold.<br>
        </p>
        <p>
            But what Molly did not know, was that her sheep is a magical sheep;<br>
            Molly's sheep grows hair instantly, magically!<br>
            Oh, how wonderful, Molly's sheep,<br>
            Making hair, each and each<br>
            Hair grows quickly after cut,<br>
            That's what the story's all about.
        <p>     
    </div>

And this is the parsing code:

 var html_text_name = "original_text";
 var html_text = document.getElementById(html_text_name);
 var text_paragaphs = html_text.getElementsByTagName("p");
 for (var x=0; x<text_paragaphs.length; x++){
    document.getElementById("original_content_js").innerHTML += "ABC" +
    text_paragaphs[x].innerHTML + "CBA <br>";
 }

And the result I get into the original_content_js paragraph is:

 Original content via JS:
 ABC Molly had a little sheep. 
 Molly didn't like her sheep. Ir was too hairy.
 So Molly took a big knife, and cut all of her sheep's fur.
 Now Molly's sheep is cold.
 CBA 
 ABC But what Molly did not know, was that her sheep is a magical sheep;
 Molly's sheep grows hair instantly, magically!     
 Oh, how wonderful, Molly's sheep,
 Making hair, each and each
 Hair grows quickly after cut,
 That's what the story's all about. CBA 
 ABC CBA

So you can see that I'm getting things as expected - 2 paragraphs wrapped in "ABC" and "CBA", except for having another empty node in the end. Why is there another extra node?

1 Answer 1

1

You are not checking that the paragraphs are properly closed. So, your code sees three opening p tags and assumes there are three paragraphs. The very last p tag should be a closed p tag. This is a problem because it sets text_paragraphs to 3 instead of 2. You will need to write a regex to check for this... but beware... writing regex for HTML parsing is a scary thing... and is typically impossible to accurately do 100% of the time.

EDIT: I'm not saying you shouldn't write a regex for checking if tags are properly closed depending on your situation... I'm just saying, be careful.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.