PHP RegEx Grouping Multiple Matches

Question

I'm just trying my hand at crafting my very first regex. I want to be able to match a pseudo HTML element and extract useful information such as tag name, attributes etc.:

$string = '<testtag alpha="value" beta="xyz" gamma="abc"  >';

if (preg_match('/<(\w+?)(\s\w+?\s*=\s*".*?")+\s*>/', $string, $matches)) {
    print_r($matches);
}

Except, I'm getting:

Array ( [0] =>  [1] => testtag [2] => gamma="abc" )

Anyone know how I can get the other attributes? What am I missing?

Your very first regex should not be for matching HTML/XML, as this is the one thing that regexes are genuinely bad at. Believe me, they suck at it, and you should avoid using them for it right from the start. — Tomalak
– Tomalak, Commented Jul 6, 2009 at 15:59
But you have to admit it's a good way to learn their limitations. ;) — Alan Moore
– Alan Moore, Commented Jul 6, 2009 at 18:04
Probably, yes. ;-) It's easy to develop an "anything goes" attitude with regex, making you think that everything that is represented as text is text. XML and HTML are not text, they are structured data, and should be processed with data tools, not text tools. Best time to present the warning is when someone just begins with regex. :) — Tomalak
– Tomalak, Commented Jul 7, 2009 at 8:28
Thanks to all the people who tried to answer my question. It's looking like it's not possible to do it the way I wanted. Bah humbug! Why use one line of code when you can use twenty or even a whole library? Down with PHP, long live .NET! — Guillermo Phillips
– Guillermo Phillips, Commented Jul 11, 2009 at 14:45

Gumbo · Accepted Answer · 2009-07-06 15:50:00Z

3

Try this regular expression:

/<(\w+)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*'|[^'">\s]*))*)\s*>/

But you really shouldn’t use regular expressions for a context free language like HTML. Use a real parser instead.

answered Jul 6, 2009 at 15:50

Gumbo

657k112 gold badges792 silver badges852 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Tim Lytle Over a year ago

Care to elaborate on what you mean my 'real parser'?

Tomalak Over a year ago

@Tim Lytle: Regexes are no parsers. They are part of parsers, at most. A real parser is an XML DOM parser, for example - it can parse languages, whereas regexes can only find patterns.

Tim Lytle Over a year ago

@Tomalak Ah, did not understand what he meant. Makes perfect sense now.

Peter Boughton · Accepted Answer · 2009-07-06 17:57:58Z

1

As has been said, don't use RegEx for parsing HTML documents.

Try this PHP parser instead: http://simplehtmldom.sourceforge.net/

answered Jul 6, 2009 at 17:57

Peter Boughton

113k32 gold badges125 silver badges177 bronze badges

Comments

Alan Moore · Accepted Answer · 2009-07-06 18:01:32Z

0

Your second capturing group matches the attributes one at a time, each time overwriting the previous one. If you were using .NET regexes, you could use the Captures array to retrieve the individual captures, but I don't know of any other regex flavor that has that feature. Usually you have to do something like capture all of the attributes in one group, then use another regex on the captured text to break out the individual attributes.

This is why people tend to either love regexes or hate them (or both). You can do some truly amazing things with them, but you also keep running into simple tasks like this one that are ridiculously hard, if not impossible.

answered Jul 6, 2009 at 18:01

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

Collectives™ on Stack Overflow

PHP RegEx Grouping Multiple Matches

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related