1

I have a string of elements on multiple lines (but i can change this to being all on one line if necessary) and I want to split it on the <section> element. I thought this would be easy, just str.split(regex), or even str.split('<section') but it's not working. It never breaks the sections out.

I've tried using a regular expression SecRegex = /<section.?>[\s\S]?</section>/; var fndSection = result.split(SecRegex);

Tried var fndSection = result.split('<section');

I've looked all over the net and from what I've found one of the two methods above should have worked.

result = '

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter> <title>Chapter Title</title>
<section id="Section ID">
<title>Section Title</title>
<para0>
<title>Para0 Title</title>
<para>blah blah</para>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<line>Title</line>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<list>Title</list>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<ipbchap>
<tags></tags>
</ipbchap>

</body>
<rear>
<tags></tags>
</rear>
</doc>'

Code

SecRegex = /<section.*?>[\s\S]*?<\/section>/;
var fndSection = result.split(SecRegex);

console.log("result string " + fndSection);

This is the result I'm getting from the code I have

result string <chapter id="chap2"> <title>THEORY</title> , , , , <chapter id="chap1"> <para0> <title></title></para0> </chapter> 
result string <chapter id="chap1"> <para0> <title></title></para0> </chapter> 
result string <chapter

As you can see

What I want is a string of <section>.*?</section> into an array

Thank you everyone for looking at this and helping me. I appreciate all your help.

Maxine

3
  • 3
    You shouldn't use regular expressions to parse html. I would use an html parser. Commented May 16, 2019 at 15:52
  • it's an SGML document Commented May 16, 2019 at 15:55
  • The question mark in '/<section.?>[\s\S]?</section>/;' will only match max one character. You need to replace it with a star '*', that will mach zero or more characters! Commented May 16, 2019 at 16:00

3 Answers 3

2

Your expression looks pretty great! You might just want to slightly modify it, maybe to something similar to:

/<section[a-z="'\s]+>([\s\S]*?)<\/section>/gmi

RegEx

If this wasn't your desired expression, you can modify/change your expressions in regex101.com.

RegEx Circuit

You can also visualize your expressions in jex.im:

enter image description here

JavaScript Test

const regex = /<section[a-z="'\s]+>([\s\S]*?)<\/section>/gmi;
const str = `<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>`;
const subst = `$1`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);


In case you might want to capture the section tags as well, you can simply wrap your entire expression in a capturing group:

const regex = /(<section[a-z="'\s]+>([\s\S]*?)<\/section>)/gmi;
const str = `<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>`;
const subst = `\n$1\n`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: \n', result);

Sign up to request clarification or add additional context in comments.

3 Comments

thank you this was great. I'm getting the sections now but I'm also getting the <chapter elements. I thought using a specific regex would only give me the <section elements. Is this not correct?
Hi! So take the input string and replace all the <chapter>elements? I'm not sure this will work because I have whole <chapter></chapter> and also <chapter><section> elements
the end result to all this is I'm trying to get the last <section> element and add a </chapter> element to the end of it. That's my big headache because I'm honestly stumped on it
1

You don't need to split the string - you want to extract the data that matches your pattern from it. You can do that using String#match. Note that you need to add the g flag to get all matches:

var result = `<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter> <title>Chapter Title</title>
<section id="Section ID">
<title>Section Title</title>
<para0>
<title>Para0 Title</title>
<para>blah blah</para>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<line>Title</line>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<list>Title</list>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<ipbchap>
<tags></tags>
</ipbchap>

</body>
<rear>
<tags></tags>
</rear>
</doc>`;
// the g flag is added ---------------------↓
SecRegex = /<section.*?>[\s\S]*?<\/section>/g;
var fndSection = result.match(SecRegex);


console.log("result string ", fndSection);

However, you are better off parsing the DOM and extracting the information you want from there - this is simple using DOMParser:

var result = `<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter> <title>Chapter Title</title>
<section id="Section ID">
<title>Section Title</title>
<para0>
<title>Para0 Title</title>
<para>blah blah</para>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<line>Title</line>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<list>Title</list>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<ipbchap>
<tags></tags>
</ipbchap>

</body>
<rear>
<tags></tags>
</rear>
</doc>`

var parser = new DOMParser();
var doc = parser.parseFromString(result, "text/html");

var sections = [...doc.getElementsByTagName("section")];
var fndSection = sections.map(section => section.outerHTML)
console.log(fndSection);

1 Comment

I put it into an array because I'm trying to get a count of the section elements so I can take action on the last occurance of a section within a chapter
1

Do not use RegEx on HTML (or any cousin of HTML). Collect your <section>s into a NodeList. Convert that NodeList into an Array. Convert each Node into a String. This could be done in one line:

const strings = Array.from(document.querySelectorAll('section')).map(section => section.outerHTML);

The following demo is a breakdown of the example above.

// Collect all <section>s into a NodeList
const sections = document.querySelectorAll('section');

// Convert NodeList into an Array
const array = Array.from(sections);

/*
Iterate through Array -- on each <section>...
convert it into a String
*/
const strings = array.map(section => section.outerHTML);

// View array as a template literal for a cleaner look
console.log(`${strings}`);

// Verifying it's an array of mutiple elements
console.log(strings.length);

// Verifying that they are in fact strings
console.log(typeof strings[0]);
<chapter id="chap1">
  <para0>
    <title></title>
  </para0>
</chapter>

<chapter id="chap2">
  <title>THEORY</title>
  <section id="Thoery">
    <title>theory Section</title>
    <para0 verstatus="ver">
      <title>Theory Para 0 </title>
      <text>blah blah</text>
    </para0>
  </section>

  <section id="Next section">
    <title>title</title>
    <para0>
      <title>Title</title>
      <text>blah blah</text>
    </para0>
  </section>

  <section id="More sections">
    <title>title</title>
    <para0>
      <title>Title</title>
      <text>blah blah</text>
    </para0>
  </section>

  <section id="section">
    <title>title</title>
    <para0>
      <title>Title</title>
      <text>blah blah</text>
    </para0>
  </section>

  <chapter id="chap1">
    <para0>
      <title></title>
    </para0>
  </chapter>

  <chapter id="chap1">
    <para0>
      <title></title>
    </para0>
  </chapter>

  <chapter>
    <title>Chapter Title</title>
    <section id="Section ID">
      <title>Section Title</title>
      <para0>
        <title>Para0 Title</title>
        <para>blah blah</para>
      </para0>
    </section>

    <section id="Next section">
      <title>title</title>
      <para0>
        <line>Title</line>
        <text>blah blah</text>
      </para0>
    </section>

    <section id="More sections">
      <title>title</title>
      <para0>
        <list>Title</list>
        <text>blah blah</text>
      </para0>
    </section>

    <section id="section">
      <title>title</title>
      <para0>
        <title>Title</title>
        <text>blah blah</text>
      </para0>
    </section>

    <ipbchap>
      <tags></tags>
    </ipbchap>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.