0

So I receive some xml in plaintext (and no I can't use DOM or JSON because apparently I am not allowed to), I want to strip all elements encased in a certain element and put them into an array, where I can strip out the text in the individual segments. Now I am used to using POSIX regex and I will never actually understand the point behind PCRE regex, nor do I get the syntax.

Now here is the code I am using:

var strResponse = objResponse.text;
                    var strRegex = new RegExp("<item>(.*?)<\/item>","i");
                    var arrMatches = "";
                    var match;
                    while (match = strRegex.exec(strResponse)) {
                        arrMatches[] = match[1];
                    }

I have no idea why it won't find any matches with this code, can someone please help me on this and perhaps elaborate on what exactly it is I am continuously doing wrong with the PCRE syntax?

5
  • 1
    "Can't use javascript regex to get everything between html/xml tags…" Exactly, you can't use a JavaScript regex to parse html/xml. HTML and XML are not regular languages, and so cannot be parsed reliably with regular expressions. Many have tried. Many have failed. You'll need recursive descent, or a state machine, etc. -- e.g., a proper parser. If it's XML, it'll be a lot simpler than if it's HTML, which is not well-formed and thus requires dramatically more domain-specific knowledge. Commented Jul 6, 2011 at 14:19
  • 1
    Separately: You should be getting a syntax error with that code, this arrMatches[] = match[1]; is invalid. You have to have something within the []. It's not clear what you're using the brackets for, as you've assigned a string to arrMatches. Commented Jul 6, 2011 at 14:23
  • The dot character would be my first suspect but then, without some sample markup, I can't be sure. Can you post some sample markup as well? Commented Jul 6, 2011 at 14:23
  • @T.J., yeah I forgot to define it as an array, because I had to just quickly write the last part, to make it give more sense. Commented Jul 7, 2011 at 6:26
  • @Kris: Even if you assign an array to the variable, the syntax arrMatches[] = match[1]; is still incorrect. You need something inside the [] on the left. (I think you probably meant either arrMatches.push(match[1]); or arrMatches[arrMatches.length] = match[1];, both of which will add match[1] to the array.) Commented Jul 7, 2011 at 6:46

2 Answers 2

1

If those tags are in different rows the . will not match the newline characters and therefor your expression will not match. This is just a guess, I don't know your source.

You can try

var strRegex = new RegExp("<item>([\\s\\S]*?)<\\/item>","i");

[\\s\\S] is a character class. containing all whitespace and all non whitespace characters. linebreaks are covered by the whitespace characters.

Sign up to request clarification or add additional context in comments.

13 Comments

You need to either double your backslashes (as you're using a string), or much better, use literal syntax: var rex = /<item>([\s\S]*?)</\item>/i; But it's all in vain regardless, you can't use regular expressions as an entire solution to parsing XML or HTML.
@T.J. Crowder I added the backslashes to stick to his code. As the OP wrote he is not able to use a parser, so he knows the difficulties too, I assume. I do.
@Kris: Again, I'm not suggesting you use anything "external." I'm suggesting that you can write a simple recursive-descent parser for XML, which will be dramatically more reliable than trying to use a screwdriver to hammer in a nail. Nothing external, you're either allowed to write code to solve this problem or you're not. It doesn't have to be big, not with XML.
@Kris: This page will get you started. You shouldn't have too much trouble finding examples. "Recursive descent" just means that because elements can be nested, the parser will tend to call itself. You could also use a state machine. I'm curious who it is that won't "allow" you to use the excellent parser built into the browser, though.
@Kris: I was about to point you at the bits underlying that (I didn't realize you could use jQuery on your project; this is a place where jQuery isn't actually doing much, just dealing with an IE-ism). I'd recommend posting your own answer saying it turns out you could use the DOM parser after all (that's what that's doing) and accepting that.
|
0

The best way to complete this task is using the following, to parse it as proper HTML and navigate it with the DOM parser: Javascript function to parse HTML string into DOM? Regex has it with being very faulty and is in general not very good for parsing irregular text like HTML structure.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.