Better way of extracting text from HTML in Javascript

Question

I'm trying to scrape text from an HTML string by using container.innerText || container.textContent where container is the element from which I want to extract text.

Usually, the text I want to extract is located in <p> tags. So for the HTML below as an example:

<div id="container">
    <p>This is the first sentence.</p>
    <p>This is the second sentence.</p>
</div>

Using

var container = document.getElementById("container");
var text = container.innerText || container.textContent; // the text I want

will return This is the first sentence.This is the second sentence. without a space between the first period and the start of the second sentence.

My overall goal is to parse text using the Stanford CoreNLP, but its parser cannot detect that these are 2 sentences because they are not separated by a space. Is there a better way of extracting text from HTML such that the sentences are separated by a space character?

The HTML I'm parsing will have the text I want mostly in <p> tags, but the HTML may also contain <img>, <a>, and other tags embeeded between <p> tags.

Any purpose of jQuery tag?

A. Wolff
– A. Wolff

2014-11-24 18:30:22 +00:00
Commented Nov 24, 2014 at 18:30 — A. Wolff
– A. Wolff, Commented Nov 24, 2014 at 18:30

Niet the Dark Absol · Accepted Answer · 2014-11-24 18:36:05Z

3

As a dirty hack, try using this:

container.innerHTML.replace(/<.*?>/g," ").replace(/ +/g," ");

This will replace all tags with a space, then collapse multiple spaces into a single one.

Note that if there is a > inside an attribute value, this will mess you up. Avoiding this problem will require more elaborate parsing, such as looping through all text nodes and putting them together.

Longer but more robust method:

function recurse(result, node) {
    var c = node.childNodes, l = c.length, i;
    for( i=0; i<l; i++) {
        if( c[i].nodeType == 3) result += c.nodeValue + " ";
        if( c[i].nodeType == 1) result = recurse(result, c[i]);
    }
    return result;
}
recurse(container);

Assuming I haven't made a stupid mistake, this will perform a depth-first search for text nodes, appending their contents to the result as it goes.

edited Nov 24, 2014 at 18:36

answered Nov 24, 2014 at 18:30

Niet the Dark Absol

326k86 gold badges480 silver badges604 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

BrockLee Over a year ago

I have found a hacky way around this that takes care of the spaces as well as the > symbols, but I was hoping for more of a legitimate way of doing this task. If I can't find one, then I suppose this will have to suffice.

BrockLee Over a year ago

Fancy recursion you have there. This looks better than the way I was planning on extracting text. I'll try this out.

AWolf · Accepted Answer · 2020-12-20 00:27:31Z

2

jQuery has the method text() that does what you want. Will this work for you?

I'm not sure if it fits for everything that's in your container but it works in my example. It will also take the text of a <a>-tag and appends it to the text.

Update 20.12.2020

If you're not using jQuery. You could implement the text method with vanilla js like this:

const nodes = Array.from(document.querySelectorAll("#container"));
const text = nodes
  .filter((node) => !!node.textContent)
  .map((node) => node.textContent)
  .join(" ");

Using querySelectorAll("#container") to get every node in the container. Using Array.from so we can work with Array methods like filter, map & join.

Finally, generate the text by filtering out elements with-out textContent. Then use map to get each text and use join to add a space separator between the text.

$(function() {
    var textToParse = $('#container').text();
    $('#output').html(textToParse);
});

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="container">
    <p>This is the first sentence.</p>
    <p>This is the second sentence.</p>
    <img src="http://placehold.it/200x200" alt="Nice picture"></img>
    <p>Third sentence.</p>
</div>

<h2>output:</h2>
<div id="output"></div>

edited Dec 20, 2020 at 0:27

answered Nov 24, 2014 at 18:47

AWolf

9,0105 gold badges38 silver badges43 bronze badges

1 Comment

BrockLee Over a year ago

This answer seems to be the most efficient way to get text from HTML without having to resort to any cheap hacks, as I'm able to extract the sentences separated by whitespace with this. I was actually a bit hesitant to use jQuery because I'm using this to make a Wordpress plugin using the TinyMCE API, although I wasn't sure how to load jQuery into my script I'm writing. This I think is the correct answer to my problem, though I now to to find out how to load jQuery into my Wordpress plugin. Thanks.

Anupam Basak · Accepted Answer · 2014-11-24 19:28:53Z

1

You may use jQuery to traverse down the elements.

Here is the code :

$(document).ready(function()
{
    var children = $("#container").find("*");
    var text = "";

    while (children.html() != undefined)
    {
        text += children.html()+"\n";
        children = children.next();
    }

    alert(text);
});

Here is the fiddle : http://jsfiddle.net/69wezyc5/

answered Nov 24, 2014 at 19:28

Anupam Basak

1,52311 silver badges13 bronze badges

Comments

PeterKA · Accepted Answer · 2014-11-24 18:52:39Z

You can use the following function to extract and process the text as shown. It basically goes through all the children nodes of the target element and the child nodes of the child nodes and so on ... adding spaces at appropriate points:

function getInnerText( sel ) {
    var txt = '';
    $( sel ).contents().each(function() {
        var children = $(this).children();
        txt += ' ' + this.nodeType === 3 ? this.nodeValue : children.length ? getInnerText( this ) : $(this).text();
    });
    return txt;
}

function getInnerText( sel ) {
  var txt = '';
  $( sel ).contents().each(function() {
    var children = $(this).children();
    txt += ' ' + this.nodeType === 3 ? 
      this.nodeValue : children.length ? 
      getInnerText( this ) : $(this).text();
  });
  return txt;
}

alert( getInnerText( '#container' ) );

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<div id="container">
    Some other sentence
    <p>This is the first sentence.</p>
    <p>This is the second sentence.</p>
</div>

Collectives™ on Stack Overflow

Better way of extracting text from HTML in Javascript

4 Answers 4

2 Comments

Update 20.12.2020

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Update 20.12.2020

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related