3

I have an HTML string such as:

<p>
    <strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.
</p>

I want to convert this into a JavaScript array that looks like:

['<p>', '<strong>', '<em>', 'Lorem Ipsum ', '</em>', '</strong>', 'is simply dummy text of the printing ', '<em>', 'and', '</em>', 'typesetting industry.', '</p>']

I.e. it takes the HTML string and breaks it down into an array of tags and HTML content.

I have tried to use DomParser() as per this question:

const str = `<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const arr = [...doc.body.childNodes]
  .map(child => child.outerHTML || child.textContent);

However, this simply returns:

['<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>']

I have also tried to search for various Regex based solutions, but haven't been able to find any that break down the string exactly as I require.

Any suggestions?

Thanks

2
  • 1
    Don't use regex! Commented Jan 5, 2021 at 3:42
  • Whats the point? If you create a div with const frag = document.createElement('div'); frag.innerHTML = thatString;, then you can get Elements from that frag. Commented Jan 5, 2021 at 4:14

1 Answer 1

2

I'd make a recursive function to iterate over a given node and return an array of the text representation of its children:

const str = `<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
  const output = [];
  for (const child of node.childNodes) {
    if (child.nodeType === Node.TEXT_NODE) {
      output.push(child.textContent);
    } else if (child.nodeType === Node.ELEMENT_NODE) {
      output.push(`<${child.tagName}>`);
      output.push(...parseNode(child));
      output.push(`</${child.tagName}>`);
    }
  }
  return output;
};
console.log(parseNode(doc.body));

If you need to keep attributes too, you could take the outerHTML of the element and take the leading non-brackets:

const str = `<p style="color:green"><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
  const output = [];
  for (const child of node.childNodes) {
    if (child.nodeType === Node.TEXT_NODE) {
      output.push(child.textContent);
    } else if (child.nodeType === Node.ELEMENT_NODE) {
      const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1];
      output.push(`<${child.tagName}${attribs}>`);
      output.push(...parseNode(child));
      output.push(`</${child.tagName}>`);
    }
  }
  return output;
};
console.log(parseNode(doc.body));

If you need self-closing tags not to be expanded, check if the outerHTML of an element contains </:

const str = `<p style="color:green"><input readonly value="x"/><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
  const output = [];
  for (const child of node.childNodes) {
    if (child.nodeType === Node.TEXT_NODE) {
      output.push(child.textContent);
    } else if (child.nodeType === Node.ELEMENT_NODE) {
      const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1];
      output.push(`<${child.tagName}${attribs}>`);
      if (child.outerHTML.includes('</')) {
        // Not self closing:
        output.push(...parseNode(child));
        output.push(`</${child.tagName}>`);
      }
    }
  }
  return output;
};
console.log(parseNode(doc.body));

Sign up to request clarification or add additional context in comments.

8 Comments

Could you convert the magic numbers to constants?
@TedBrownlow for a quick reference : developer.mozilla.org/en-US/docs/Web/API/Node/nodeType
does this take into account a self closing tag?
@MatinKajabadi If expanding them is a problem for you, I guess you can check if the outerHTML contains </ or not
@MatinKajabadi My parser, or DOMParser? DOMParser will interpret the markup as accurately as it can. If the element must be a valid self-closing element, the browser will omit the end tags in the outerHTML.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.