Can't use javascript regex to get everything between html/xml tags

Question

So I receive some xml in plaintext (and no I can't use DOM or JSON because apparently I am not allowed to), I want to strip all elements encased in a certain element and put them into an array, where I can strip out the text in the individual segments. Now I am used to using POSIX regex and I will never actually understand the point behind PCRE regex, nor do I get the syntax.

Now here is the code I am using:

var strResponse = objResponse.text;
                    var strRegex = new RegExp("<item>(.*?)<\/item>","i");
                    var arrMatches = "";
                    var match;
                    while (match = strRegex.exec(strResponse)) {
                        arrMatches[] = match[1];
                    }

I have no idea why it won't find any matches with this code, can someone please help me on this and perhaps elaborate on what exactly it is I am continuously doing wrong with the PCRE syntax?

"Can't use javascript regex to get everything between html/xml tags…" Exactly, you can't use a JavaScript regex to parse html/xml. HTML and XML are not regular languages, and so cannot be parsed reliably with regular expressions. Many have tried. Many have failed. You'll need recursive descent, or a state machine, etc. -- e.g., a proper parser. If it's XML, it'll be a lot simpler than if it's HTML, which is not well-formed and thus requires dramatically more domain-specific knowledge. — T.J. Crowder
– T.J. Crowder, Commented Jul 6, 2011 at 14:19
Separately: You should be getting a syntax error with that code, this arrMatches[] = match[1]; is invalid. You have to have something within the []. It's not clear what you're using the brackets for, as you've assigned a string to arrMatches. — T.J. Crowder
– T.J. Crowder, Commented Jul 6, 2011 at 14:23
The dot character would be my first suspect but then, without some sample markup, I can't be sure. Can you post some sample markup as well? — Mrchief
– Mrchief, Commented Jul 6, 2011 at 14:23
@T.J., yeah I forgot to define it as an array, because I had to just quickly write the last part, to make it give more sense. — user328570
– user328570, Commented Jul 7, 2011 at 6:26
@Kris: Even if you assign an array to the variable, the syntax arrMatches[] = match[1]; is still incorrect. You need something inside the [] on the left. (I think you probably meant either arrMatches.push(match[1]); or arrMatches[arrMatches.length] = match[1];, both of which will add match[1] to the array.) — T.J. Crowder
– T.J. Crowder, Commented Jul 7, 2011 at 6:46

stema · Accepted Answer · 2011-07-06 14:29:32Z

1

If those tags are in different rows the . will not match the newline characters and therefor your expression will not match. This is just a guess, I don't know your source.

You can try

var strRegex = new RegExp("<item>([\\s\\S]*?)<\\/item>","i");

[\\s\\S] is a character class. containing all whitespace and all non whitespace characters. linebreaks are covered by the whitespace characters.

edited Jul 6, 2011 at 14:29

answered Jul 6, 2011 at 14:22

stema

93.5k20 gold badges110 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

T.J. Crowder Over a year ago

You need to either double your backslashes (as you're using a string), or much better, use literal syntax: var rex = /<item>([\s\S]*?)</\item>/i; But it's all in vain regardless, you can't use regular expressions as an entire solution to parsing XML or HTML.

stema Over a year ago

@T.J. Crowder I added the backslashes to stick to his code. As the OP wrote he is not able to use a parser, so he knows the difficulties too, I assume. I do.

T.J. Crowder Over a year ago

@Kris: Again, I'm not suggesting you use anything "external." I'm suggesting that you can write a simple recursive-descent parser for XML, which will be dramatically more reliable than trying to use a screwdriver to hammer in a nail. Nothing external, you're either allowed to write code to solve this problem or you're not. It doesn't have to be big, not with XML.

T.J. Crowder Over a year ago

@Kris: This page will get you started. You shouldn't have too much trouble finding examples. "Recursive descent" just means that because elements can be nested, the parser will tend to call itself. You could also use a state machine. I'm curious who it is that won't "allow" you to use the excellent parser built into the browser, though.

T.J. Crowder Over a year ago

@Kris: I was about to point you at the bits underlying that (I didn't realize you could use jQuery on your project; this is a place where jQuery isn't actually doing much, just dealing with an IE-ism). I'd recommend posting your own answer saying it turns out you could use the DOM parser after all (that's what that's doing) and accepting that.

|

Community · Accepted Answer · 2017-05-23 11:59:17Z

0

The best way to complete this task is using the following, to parse it as proper HTML and navigate it with the DOM parser: Javascript function to parse HTML string into DOM? Regex has it with being very faulty and is in general not very good for parsing irregular text like HTML structure.

edited May 23, 2017 at 11:59

CommunityBot

11 silver badge

answered Feb 11, 2013 at 12:07

user328570

Collectives™ on Stack Overflow

Can't use javascript regex to get everything between html/xml tags

2 Answers 2

13 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

13 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related