0

Description

I'm attempting to extract URLS and/or CDATA from XML. The current solution I have works well, but only returns the first element. How do I return multiple elements with this specfic regex?

The XML is in the form of:

<MediaFile>
https://some_url.com/file.mp4
</MediaFile>
<MediaFile>
https://some_url2.com/file.mp4
</MediaFile>

and

<MediaFile>
<!CDATA some data here with spaces sometimes>
</MediaFile>
...etc

What I'm trying to achieve

In my example, there are 3 mediafile tags and I'm trying to extract 3 different URLS and CDATA. The final solution should look something like

1st url https://example1.com/file.mp4
2nd url https://example2.com/file.mp4
3rd url <!CDATA some data example>

What I've tried:

link to regex101

const data = `<MediaFile delivery="progressive" width="640" height="360" type="video/mp4" bitrate="397" scalable="false" maintainAspectRatio="false">https://example1.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false">https://example2.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false"><!CDATA some data example></MediaFile>`;

const regex = /<MediaFile[^>]*type="video\/mp4"[^>]*>([\s\S]*?)<\/MediaFile>/gm;

const res = regex.exec(data);

console.log('1st url', res[1]);
console.log('2nd url', res[2]);
console.log('3rd url', res[3]);

3
  • Possible duplicate of How can I match multiple occurrences with a regex in JavaScript similar to PHP's preg_match_all()? Commented Sep 20, 2019 at 13:54
  • It is not possible reliably parse XML with a regular expression. It's the wrong tool for this job. Why not use an XML parser and save yourself a headache? Commented Sep 20, 2019 at 13:57
  • @spender xml parser doesn't work for that specific kind of xml. As these are external XMLs I have no control on what kind of XML I'll get. Commented Sep 20, 2019 at 14:30

2 Answers 2

1

It is probably better, not to use regular expressions, but the method document.querySelectorAll() to parse it instead:

const data = `<MediaFile delivery="progressive" width="640" height="360" type="video/mp4" bitrate="397" scalable="false" maintainAspectRatio="false">https://example1.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false">https://example2.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false"><!CDATA some data example></MediaFile>`;

var o=document.createElement('div');o.innerHTML=data.replace(/<!CDATA/g,'!CDATA');
var arr=Array.from(o.querySelectorAll('MediaFile'))
             .map(el=>el.innerHTML.replace('!CDATA','<!CDATA')
                                  .replace('&gt;','>'))

console.log(arr.join('\n'));

With a little "extra effort" you can mask the <!CDATA ... > sections with a replace() before creating the DOM element and later replace it "back" into its intended form by applying .replace('!CDATA','<!CDATA').replace('&gt;','>' on the .innerHTML-strings of the MediaFile elements.

Sign up to request clarification or add additional context in comments.

1 Comment

Nice solution but unfortunately it removes the <! from <!CDATA
1

You can try to parse it.

   const data = `<MediaFile delivery="progressive" width="640" height="360" type="video/mp4" bitrate="397" scalable="false" maintainAspectRatio="false">https://example1.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false">https://example2.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false"><!CDATA some data example></MediaFile>`;
    
    const parser = new DOMParser();
    const xmlDoc = parser.parseFromString(data,"text/html");
    
    console.log(xmlDoc.getElementsByTagName("MediaFile")[0].innerHTML);
    console.log(xmlDoc.getElementsByTagName("MediaFile")[1].innerHTML);
    console.log(xmlDoc.getElementsByTagName("MediaFile")[2].innerHTML);

2 Comments

text/html or application/xml
Thanks for the answer however this doesn't work because it converts <!CDATA to <!-- CDATA

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.