Extract text array from HTML in JavaScript

Question

We are doing dynamic translation of HTML type documents using translator service API (e.g., Azure). For that we need to strip the Markup and extract only the text part, because the APIs have character limit and we don't want to send useless markup characters to the API.

So if there is a HTML like below:

<div>
<div>
    <p>Hello</p>
    <div>
        <p>There</p>
    </div>
    <div>World</div>
</div>
<div>
    <div>We are back</div>
    <div>
        <p>Members</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
            <tr>
                <td>Satt</td>
                <td>10</td>
            </tr>
            <tr>
                <td>Matt</td>
                <td>20</td>
            </tr>
        </table>
    </div>
</div>

We want the text values in an array, like:

["Hello", "There", "World", "We are back", "Members", "Name", "Age", "Satt", "10", "Matt", "20"]

What is the best approach to do this? Should I use Regular expressions to parse and extract the HTML or should I use some kind of recursive algorithm to get the texts.

Any help is appreciated, Thanks.

If you want to translate the document wouldn't you need to re-inject the translated text back into the document once you've translated it? If that's true, then its probably a lot better to separate the strings from the document, and render the document with the right translation straight away. Look into the i18n standard. — Olian04
– Olian04, Commented Jun 12, 2021 at 12:27

caramba · Accepted Answer · 2021-06-12 13:18:49Z

3

Update: You can select all needed HTML and then use a regex.

var result = [];
const regex = />([a-zA-Z \d\!]+)</gm;
const str = document.querySelectorAll('body *:not(style,script)')[0].innerHTML;
let m;

while ((m = regex.exec(str)) !== null) {
  result.push(m[1]);
}

console.log(result);

<div>
<div>
    <p>Hello</p>
    <div>
        <p>There</p>
    </div>
    <div>World</div>
</div>
<div>
    <div>We are back<span>Yeah!</span></div>
    <div>
        <p>Members</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
            <tr>
                <td>Satt</td>
                <td>10</td>
            </tr>
            <tr>
                <td>Matt</td>
                <td>20</td>
            </tr>
        </table>
    </div>
</div>

Follow this link for more information about the regex: https://regex101.com/r/NF7sXZ/1/

As pointed out by charlietfl in the comments the first answer, does not work with the following markup:

<div>We are back <span>Yeah!</span></div>

Because that markup was not part of the question this might still be a valid solution:

var result = [];
var items = document.querySelectorAll('body div, body p, body th, body td, body span')
// you could obviously also use the same selector as in the updated answer above

items.forEach(item => {
  if(1 === item.childNodes.length) { // check if there is no more childNodes, means there is only text inside this element
    result.push(item.innerText)
  }
})

console.log(result)

<div>
<div>
    <p>Hello</p>
    <div>
        <p>There</p>
    </div>
    <div>World</div>
</div>
<div>
    <div>We are back<span>Yeah!</span></div>
    <div>
        <p>Members</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
            <tr>
                <td>Satt</td>
                <td>10</td>
            </tr>
            <tr>
                <td>Matt</td>
                <td>20</td>
            </tr>
        </table>
    </div>
</div>

edited Jun 12, 2021 at 13:18

answered Jun 12, 2021 at 12:20

caramba

22.5k20 gold badges94 silver badges134 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

charlietfl Over a year ago

Fails with: <div>We are back <span>Yeah!</span></div>. Going to need a lot more elaborate check for text

caramba Over a year ago

Thanks for your feedback @charlietfl , I've updated the answer with a regex approach. should now work also with the given circumstances you have pointed out

Jack Fleeting · Accepted Answer · 2021-06-12 17:02:48Z

1

A non-regex approach to the problem - using xpath:

result = document.evaluate("//div//text()", document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
words = []
for(let i = 0; i < result.snapshotLength; i++) {
  let node = result.snapshotItem(i);       
  target = node.nodeValue.trim();       
  if (target.length>0) {
   words.push(target)}
}

console.log(words);

The output is your expected array.

answered Jun 12, 2021 at 17:02

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Collectives™ on Stack Overflow

Extract text array from HTML in JavaScript

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related