1

I'm just trying my hand at crafting my very first regex. I want to be able to match a pseudo HTML element and extract useful information such as tag name, attributes etc.:

$string = '<testtag alpha="value" beta="xyz" gamma="abc"  >';

if (preg_match('/<(\w+?)(\s\w+?\s*=\s*".*?")+\s*>/', $string, $matches)) {
    print_r($matches);
}

Except, I'm getting:

Array ( [0] =>  [1] => testtag [2] => gamma="abc" ) 

Anyone know how I can get the other attributes? What am I missing?

4
  • 1
    Your very first regex should not be for matching HTML/XML, as this is the one thing that regexes are genuinely bad at. Believe me, they suck at it, and you should avoid using them for it right from the start. Commented Jul 6, 2009 at 15:59
  • But you have to admit it's a good way to learn their limitations. ;) Commented Jul 6, 2009 at 18:04
  • Probably, yes. ;-) It's easy to develop an "anything goes" attitude with regex, making you think that everything that is represented as text is text. XML and HTML are not text, they are structured data, and should be processed with data tools, not text tools. Best time to present the warning is when someone just begins with regex. :) Commented Jul 7, 2009 at 8:28
  • Thanks to all the people who tried to answer my question. It's looking like it's not possible to do it the way I wanted. Bah humbug! Why use one line of code when you can use twenty or even a whole library? Down with PHP, long live .NET! Commented Jul 11, 2009 at 14:45

3 Answers 3

3

Try this regular expression:

/<(\w+)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*'|[^'">\s]*))*)\s*>/

But you really shouldn’t use regular expressions for a context free language like HTML. Use a real parser instead.

Sign up to request clarification or add additional context in comments.

3 Comments

Care to elaborate on what you mean my 'real parser'?
@Tim Lytle: Regexes are no parsers. They are part of parsers, at most. A real parser is an XML DOM parser, for example - it can parse languages, whereas regexes can only find patterns.
@Tomalak Ah, did not understand what he meant. Makes perfect sense now.
1

As has been said, don't use RegEx for parsing HTML documents.

Try this PHP parser instead: http://simplehtmldom.sourceforge.net/

Comments

0

Your second capturing group matches the attributes one at a time, each time overwriting the previous one. If you were using .NET regexes, you could use the Captures array to retrieve the individual captures, but I don't know of any other regex flavor that has that feature. Usually you have to do something like capture all of the attributes in one group, then use another regex on the captured text to break out the individual attributes.

This is why people tend to either love regexes or hate them (or both). You can do some truly amazing things with them, but you also keep running into simple tasks like this one that are ridiculously hard, if not impossible.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.