Exctract URLs and CDATA from XML string with regex

Question

Description

I'm attempting to extract URLS and/or CDATA from XML. The current solution I have works well, but only returns the first element. How do I return multiple elements with this specfic regex?

The XML is in the form of:

<MediaFile>
https://some_url.com/file.mp4
</MediaFile>
<MediaFile>
https://some_url2.com/file.mp4
</MediaFile>

and

<MediaFile>
<!CDATA some data here with spaces sometimes>
</MediaFile>
...etc

What I'm trying to achieve

In my example, there are 3 mediafile tags and I'm trying to extract 3 different URLS and CDATA. The final solution should look something like

1st url https://example1.com/file.mp4
2nd url https://example2.com/file.mp4
3rd url <!CDATA some data example>

What I've tried:

link to regex101

const data = `<MediaFile delivery="progressive" width="640" height="360" type="video/mp4" bitrate="397" scalable="false" maintainAspectRatio="false">https://example1.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false">https://example2.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false"><!CDATA some data example></MediaFile>`;

const regex = /<MediaFile[^>]*type="video\/mp4"[^>]*>([\s\S]*?)<\/MediaFile>/gm;

const res = regex.exec(data);

console.log('1st url', res[1]);
console.log('2nd url', res[2]);
console.log('3rd url', res[3]);

Possible duplicate of How can I match multiple occurrences with a regex in JavaScript similar to PHP's preg_match_all()? — MonkeyZeus
– MonkeyZeus, Commented Sep 20, 2019 at 13:54
It is not possible reliably parse XML with a regular expression. It's the wrong tool for this job. Why not use an XML parser and save yourself a headache? — spender
– spender, Commented Sep 20, 2019 at 13:57
@spender xml parser doesn't work for that specific kind of xml. As these are external XMLs I have no control on what kind of XML I'll get. — kemicofa
– kemicofa, Commented Sep 20, 2019 at 14:30

Carsten Massmann · Accepted Answer · 2019-09-20 14:46:39Z

1

It is probably better, not to use regular expressions, but the method document.querySelectorAll() to parse it instead:

const data = `<MediaFile delivery="progressive" width="640" height="360" type="video/mp4" bitrate="397" scalable="false" maintainAspectRatio="false">https://example1.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false">https://example2.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false"><!CDATA some data example></MediaFile>`;

var o=document.createElement('div');o.innerHTML=data.replace(/<!CDATA/g,'!CDATA');
var arr=Array.from(o.querySelectorAll('MediaFile'))
             .map(el=>el.innerHTML.replace('!CDATA','<!CDATA')
                                  .replace('&gt;','>'))

console.log(arr.join('\n'));

With a little "extra effort" you can mask the <!CDATA ... > sections with a replace() before creating the DOM element and later replace it "back" into its intended form by applying .replace('!CDATA','<!CDATA').replace('>','>' on the .innerHTML-strings of the MediaFile elements.

edited Sep 20, 2019 at 14:46

answered Sep 20, 2019 at 14:00

Carsten Massmann

28.4k3 gold badges25 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kemicofa Over a year ago

Nice solution but unfortunately it removes the <! from <!CDATA

Roman Panevnyk · Accepted Answer · 2019-09-20 13:53:09Z

1

You can try to parse it.

   const data = `<MediaFile delivery="progressive" width="640" height="360" type="video/mp4" bitrate="397" scalable="false" maintainAspectRatio="false">https://example1.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false">https://example2.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false"><!CDATA some data example></MediaFile>`;
    
    const parser = new DOMParser();
    const xmlDoc = parser.parseFromString(data,"text/html");
    
    console.log(xmlDoc.getElementsByTagName("MediaFile")[0].innerHTML);
    console.log(xmlDoc.getElementsByTagName("MediaFile")[1].innerHTML);
    console.log(xmlDoc.getElementsByTagName("MediaFile")[2].innerHTML);

answered Sep 20, 2019 at 13:53

Roman Panevnyk

3233 silver badges7 bronze badges

2 Comments

Roman Panevnyk Over a year ago

text/html or application/xml

kemicofa Over a year ago

Thanks for the answer however this doesn't work because it converts <!CDATA to <!-- CDATA

Collectives™ on Stack Overflow

Exctract URLs and CDATA from XML string with regex

Description

What I'm trying to achieve

What I've tried:

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Description

What I'm trying to achieve

What I've tried:

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related