Using JavaScript, how do I transform an HTML string into an array of HTML tags and text content?

Question

I have an HTML string such as:

<p>
    <strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.
</p>

I want to convert this into a JavaScript array that looks like:

['<p>', '<strong>', '<em>', 'Lorem Ipsum ', '</em>', '</strong>', 'is simply dummy text of the printing ', '<em>', 'and', '</em>', 'typesetting industry.', '</p>']

I.e. it takes the HTML string and breaks it down into an array of tags and HTML content.

I have tried to use DomParser() as per this question:

const str = `<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const arr = [...doc.body.childNodes]
  .map(child => child.outerHTML || child.textContent);

However, this simply returns:

['<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>']

I have also tried to search for various Regex based solutions, but haven't been able to find any that break down the string exactly as I require.

Any suggestions?

Thanks

Whats the point? If you create a div with const frag = document.createElement('div'); frag.innerHTML = thatString;, then you can get Elements from that frag. — StackSlave
– StackSlave, Commented Jan 5, 2021 at 4:14

CertainPerformance · Accepted Answer · 2021-01-05 04:09:17Z

2

I'd make a recursive function to iterate over a given node and return an array of the text representation of its children:

const str = `<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
  const output = [];
  for (const child of node.childNodes) {
    if (child.nodeType === Node.TEXT_NODE) {
      output.push(child.textContent);
    } else if (child.nodeType === Node.ELEMENT_NODE) {
      output.push(`<${child.tagName}>`);
      output.push(...parseNode(child));
      output.push(`</${child.tagName}>`);
    }
  }
  return output;
};
console.log(parseNode(doc.body));

If you need to keep attributes too, you could take the outerHTML of the element and take the leading non-brackets:

const str = `<p style="color:green"><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
  const output = [];
  for (const child of node.childNodes) {
    if (child.nodeType === Node.TEXT_NODE) {
      output.push(child.textContent);
    } else if (child.nodeType === Node.ELEMENT_NODE) {
      const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1];
      output.push(`<${child.tagName}${attribs}>`);
      output.push(...parseNode(child));
      output.push(`</${child.tagName}>`);
    }
  }
  return output;
};
console.log(parseNode(doc.body));

If you need self-closing tags not to be expanded, check if the outerHTML of an element contains </:

const str = `<p style="color:green"><input readonly value="x"/><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
  const output = [];
  for (const child of node.childNodes) {
    if (child.nodeType === Node.TEXT_NODE) {
      output.push(child.textContent);
    } else if (child.nodeType === Node.ELEMENT_NODE) {
      const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1];
      output.push(`<${child.tagName}${attribs}>`);
      if (child.outerHTML.includes('</')) {
        // Not self closing:
        output.push(...parseNode(child));
        output.push(`</${child.tagName}>`);
      }
    }
  }
  return output;
};
console.log(parseNode(doc.body));

edited Jan 5, 2021 at 4:09

answered Jan 5, 2021 at 3:44

CertainPerformance

373k55 gold badges354 silver badges359 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Ted Brownlow Over a year ago

Could you convert the magic numbers to constants?

Jon P Over a year ago

@TedBrownlow for a quick reference : developer.mozilla.org/en-US/docs/Web/API/Node/nodeType

Mateen Kajabadi Over a year ago

does this take into account a self closing tag?

CertainPerformance Over a year ago

@MatinKajabadi If expanding them is a problem for you, I guess you can check if the outerHTML contains </ or not

CertainPerformance Over a year ago

@MatinKajabadi My parser, or DOMParser? DOMParser will interpret the markup as accurately as it can. If the element must be a valid self-closing element, the browser will omit the end tags in the outerHTML.

|

Collectives™ on Stack Overflow

Using JavaScript, how do I transform an HTML string into an array of HTML tags and text content?

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related