0

We are doing dynamic translation of HTML type documents using translator service API (e.g., Azure). For that we need to strip the Markup and extract only the text part, because the APIs have character limit and we don't want to send useless markup characters to the API.

So if there is a HTML like below:

<div>
<div>
    <p>Hello</p>
    <div>
        <p>There</p>
    </div>
    <div>World</div>
</div>
<div>
    <div>We are back</div>
    <div>
        <p>Members</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
            <tr>
                <td>Satt</td>
                <td>10</td>
            </tr>
            <tr>
                <td>Matt</td>
                <td>20</td>
            </tr>
        </table>
    </div>
</div>
We want the text values in an array, like:
["Hello", "There", "World", "We are back", "Members", "Name", "Age", "Satt", "10", "Matt", "20"]

What is the best approach to do this? Should I use Regular expressions to parse and extract the HTML or should I use some kind of recursive algorithm to get the texts.

Any help is appreciated, Thanks.

2
  • 1
    maybe a regex can do that Commented Jun 12, 2021 at 12:21
  • 2
    If you want to translate the document wouldn't you need to re-inject the translated text back into the document once you've translated it? If that's true, then its probably a lot better to separate the strings from the document, and render the document with the right translation straight away. Look into the i18n standard. Commented Jun 12, 2021 at 12:27

2 Answers 2

3

Update: You can select all needed HTML and then use a regex.

var result = [];
const regex = />([a-zA-Z \d\!]+)</gm;
const str = document.querySelectorAll('body *:not(style,script)')[0].innerHTML;
let m;

while ((m = regex.exec(str)) !== null) {
  result.push(m[1]);
}

console.log(result);
<div>
<div>
    <p>Hello</p>
    <div>
        <p>There</p>
    </div>
    <div>World</div>
</div>
<div>
    <div>We are back<span>Yeah!</span></div>
    <div>
        <p>Members</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
            <tr>
                <td>Satt</td>
                <td>10</td>
            </tr>
            <tr>
                <td>Matt</td>
                <td>20</td>
            </tr>
        </table>
    </div>
</div>

Follow this link for more information about the regex: https://regex101.com/r/NF7sXZ/1/


As pointed out by charlietfl in the comments the first answer, does not work with the following markup:

<div>We are back <span>Yeah!</span></div>

Because that markup was not part of the question this might still be a valid solution:

var result = [];
var items = document.querySelectorAll('body div, body p, body th, body td, body span')
// you could obviously also use the same selector as in the updated answer above

items.forEach(item => {
  if(1 === item.childNodes.length) { // check if there is no more childNodes, means there is only text inside this element
    result.push(item.innerText)
  }
})

console.log(result)
<div>
<div>
    <p>Hello</p>
    <div>
        <p>There</p>
    </div>
    <div>World</div>
</div>
<div>
    <div>We are back<span>Yeah!</span></div>
    <div>
        <p>Members</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
            <tr>
                <td>Satt</td>
                <td>10</td>
            </tr>
            <tr>
                <td>Matt</td>
                <td>20</td>
            </tr>
        </table>
    </div>
</div>

Sign up to request clarification or add additional context in comments.

2 Comments

Fails with: <div>We are back <span>Yeah!</span></div>. Going to need a lot more elaborate check for text
Thanks for your feedback @charlietfl , I've updated the answer with a regex approach. should now work also with the given circumstances you have pointed out
1

A non-regex approach to the problem - using xpath:

result = document.evaluate("//div//text()", document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
words = []
for(let i = 0; i < result.snapshotLength; i++) {
  let node = result.snapshotItem(i);       
  target = node.nodeValue.trim();       
  if (target.length>0) {
   words.push(target)}
}

console.log(words);

The output is your expected array.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.