17

Given this HTML as a string "html", how can I split it into an array where each header <h marks the start of an element?

Begin with this:

<h1>A</h1>
<h2>B</h2>
<p>Foobar</p>
<h3>C</h3>

Result:

["<h1>A</h1>", "<h2>B</h2><p>Foobar</p>", "<h3>C</h3>"]

What I've tried:

I wanted to use Array.split() with a regex, but the result splits each <h into its own element. I need to figure out how to capture from the start of one <h until the next <h. Then include the first one but exclude the second one.

var html = '<h1>A</h1><h2>B</h2><p>Foobar</p><h3>C</h3>';
var foo = html.split(/(<h)/);

Edit: Regex is not a requirement in anyway, it's just the only solution that I thought would work for generally splitting HTML strings in this way.

14
  • 3
    Why would you want to use regex for that? Commented Dec 28, 2015 at 10:11
  • If there's a way not to use regex, I'm totally willing to use that instead :) Commented Dec 28, 2015 at 10:12
  • You use a language hosted in the most advanced HTML parser on the planet, not using those HTML parsing capabilities is kinda silly. Commented Dec 28, 2015 at 10:13
  • 1
    What other work, please explain. (This is an XY problem, i.e. you've decided on a solution already and don't bother explaining the task anymore. Please explain the task itself, not the anticipated solution.) Commented Dec 28, 2015 at 10:18
  • 1
    @DonnyP Check out document.createDocumentFragment() Commented Dec 28, 2015 at 10:46

5 Answers 5

26

In your example you can use:

/
  <h   // Match literal <h
  (.)  // Match any character and save in a group
  >    // Match literal <
  .*?  // Match any character zero or more times, non greedy
  <\/h // Match literal </h
  \1   // Match what previous grouped in (.)
  >    // Match literal >
/g
var str = '<h1>A</h1><h2>B</h2><p>Foobar</p><h3>C</h3>'
str.match(/<h(.)>.*?<\/h\1>/g); // ["<h1>A</h1>", "<h2>B</h2>", "<h3>C</h3>"]

But please don't parse HTML with regexp, read RegEx match open tags except XHTML self-contained tags

Sign up to request clarification or add additional context in comments.

13 Comments

That is an amazing SO question / answer.
Downvote for advocating regex to an HTML problem. At your reputation, you should know better than that.
@DonnyP This is not code golf. "Doing it in one line" is not the goal of it. His answer is inappropriate to the problem. HTML cannot be dealt with using regular expressions. This will crash and burn, just wait and see when you try it out on real life code.
@DonnyP I think you got it! I show you that it is possible with your example data, but I also warns you that you should reconsider your approach, especially if you don't know what data you are dealing with. Feel free to try to see if it works on all your data sets. If it does, then great! But if it don't it's simply because you are trying to start a fire with water :-)
@DonnyP HTML is not "too variable". HTML is in a category of languages (non-regular) that regular expressions inherently cannot describe. This is a hard technical limitation of regular expressions. Trying to do it anyway means one of two things: Either you limit yourself to a strict sub-set of HTML that can be described as a regular language (you're not doing that, you take unknown code off of GitHub), or you have a nasty one-liner bug sitting in your code. I wonder if "But it's only one line!" is a good enough reason for the latter.
|
10

From the comments to the question, this seems to be the task:

I'm taking dynamic markdown that I'm scraping from GitHub. Then I want to render it to HTML, but wrap every title element in a ReactJS <WayPoint> component.

The following is a completely library-agnostic, DOM-API based solution.

function waypointify(html) {
    var div = document.createElement("div"), nodes;

    // parse HTML and convert into an array (instead of NodeList)
    div.innerHTML = html;
    nodes = [].slice.call(div.childNodes);

    // add <waypoint> elements and distribute nodes by headings
    div.innerHTML = "";
    nodes.forEach(function (node) {
        if (!div.lastChild || /^h[1-6]$/i.test(node.nodeName)) {
            div.appendChild( document.createElement("waypoint") );
        }
        div.lastChild.appendChild(node);
    });

    return div.innerHTML;
}

Doing the same in a modern library with less lines of code is absolutely possible, see it as a challenge.

This is what it produces with your sample input:

<waypoint><h1>A</h1></waypoint>
<waypoint><h2>B</h2><p>Foobar</p></waypoint>
<waypoint><h3>C</h3></waypoint>

Comments

2

I'm sure someone could reduce the for loop to put the angle brackets back in but this is how I'd do it.

var html = '<h1>A</h1><h2>B</h2><p>Foobar</p><h3>C</h3>';

//split on ><
var arr = html.split(/></g);

//split removes the >< so we need to determine where to put them back in.
for(var i = 0; i < arr.length; i++){
  if(arr[i].substring(0, 1) != '<'){
    arr[i] = '<' + arr[i];
  }

  if(arr[i].slice(-1) != '>'){
    arr[i] = arr[i] + '>';
  }
}

Additionally, we could actually remove the first and last bracket, do the split and then replace the angle brackets to the whole thing.

var html = '<h1>A</h1><h2>B</h2><p>Foobar</p><h3>C</h3>';

//remove first and last characters
html = html.substring(1, html.length-1);

//do the split on ><
var arr = html.split(/></g);

//add the brackets back in
for(var i = 0; i < arr.length; i++){
    arr[i] = '<' + arr[i] + '>';
}

Oh, of course this will fail with elements that have no content.

1 Comment

If you use a look ahead, you can actually keep the separator you are looking for: stackoverflow.com/questions/12001953/…
1

I just came across this question, needed the same thing in one of my projects. Did the following and works well for all HTML strings.


let splitArray = data.split("><")
    splitArray.forEach((item, index) => {

        if (index === 0) {
            splitArray[index] = item += ">"

            return
        }

        if (index === splitArray.length - 1) {
            splitArray[index] = "<" + item

            return
        }
        
        splitArray[index] = "<" + item + ">"
    })

console.log(splitArray)

where data is the HTML string

Comments

0

Hi I used this function to convert html String Dom in array

  static getArrayTagsHtmlString(str){
    let htmlSplit = str.split(">")
    let arrayElements = []
    let nodeElement =""
    htmlSplit.forEach((element)=>{  
      if (element.includes("<")) {
        nodeElement = element+">"   
       }else{
         nodeElement = element
        }
        arrayElements.push(nodeElement)
    })
    return arrayElements
  }

Happy code

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.