I'm trying to scrape text from an HTML string by using container.innerText || container.textContent where container is the element from which I want to extract text.
Usually, the text I want to extract is located in <p> tags. So for the HTML below as an example:
<div id="container">
<p>This is the first sentence.</p>
<p>This is the second sentence.</p>
</div>
Using
var container = document.getElementById("container");
var text = container.innerText || container.textContent; // the text I want
will return This is the first sentence.This is the second sentence. without a space between the first period and the start of the second sentence.
My overall goal is to parse text using the Stanford CoreNLP, but its parser cannot detect that these are 2 sentences because they are not separated by a space. Is there a better way of extracting text from HTML such that the sentences are separated by a space character?
The HTML I'm parsing will have the text I want mostly in <p> tags, but the HTML may also contain <img>, <a>, and other tags embeeded between <p> tags.