3

I found this useful regex code here while looking to parse HTML tag attributes:

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

It works great, but it's missing one key element that I need. Some attributes are event triggers that have inline Javascript code in them like this:

onclick="doSomething(this, 'foo', 'bar');return false;"

Or:

onclick='doSomething(this, "foo", "bar");return false;'

I can't figure out how to get the original expression to not count the quotes from the JS (single or double) while it's nested inside the set of quotes that contain the attribute's value.

I SHOULD add that this is not being used to parse an entire HTML document. It's used as an argument in an older "array to select menu" function that I've updated. One of the arguments is a tag that can append extra HTML attributes to the form element.

I've made an improved function and am deprecating the old... but in case somewhere in the code is a call to the old function, I need it to parse these into the new array format. Example:

// Old Function
function create_form_element($array, $type, $selected="", $append_att="") { ... }
// Old Call
create_form_element($array, SELECT, $selected_value, "onchange=\"something(this, '444');\"");

The new version takes an array of attr => value pairs to create extra tags.

create_select($array, $selected_value, array('style' => 'width:250px;', 'onchange' => "doSomething('foo', 'bar')"));

This is merely a backwards compatibility issue where all calls to the OLD function are routed to the new one, but the $append_att argument in the old function needs to be made into an array for the new one, hence my need to use regex to parse small HTML snippets. If there is a better, light-weight way to accomplish this, I'm open to suggestions.

4
  • 3
    do not parse html with regex... -_- Commented Mar 8, 2010 at 10:06
  • 2
    stackoverflow.com/questions/1732348/… Commented Mar 8, 2010 at 10:10
  • 2
    This is exactly why you shouldn't parse html with regex. Commented Mar 8, 2010 at 10:21
  • 1
    so, title="Hello 'cruel' World" wouldn't work either. Commented Mar 8, 2010 at 10:40

3 Answers 3

4

The problem with your regular expression is that it tries to handle both single and double quotes at the same time. It doesn't support attribute values that contain the other quote. This regex will work better:

(\w+)=("[^<>"]*"|'[^<>']*'|\w+)
Sign up to request clarification or add additional context in comments.

2 Comments

Close, but HTML 4.01 attribute values can contain angle brackets. Also, an attribute name and an unquoted attribute value may contain dashes, dots and colons. A better expression is thus: ([\w\-.:]+)\s*=\s*("[^"]*"|'[^']*'|[\w\-.:]+) (Pedantic, I know...)
Well, this doesn't work for Text where there's a a="b+c" without a tag around it, maybe try: (<\w.)([\w\-.:]+)\s*=\s*("[^"]*"|'[^']*'|[\w\-.:]+)
2

following regex will work as per HTML syntax specs available here

http://www.w3.org/TR/html-markup/syntax.html

regex patterns

// valid tag names
$tagname = '[0-9a-zA-Z]+';
// valid attribute names
$attr = "[^\s\\x00\"'>/=\pC]+";
// valid unquoted attribute values
$uqval = "[^\s\"'=><`]*";
// valid single-quoted attribute values
$sqval = "[^'\\x00\pC]*";
// valid double-quoted attribute values
$dqval = "[^\"\\x00\pC]*";
// valid attribute-value pairs
$attrval = "(?:\s+$attr\s*=\s*\"$dqval\")|(?:\s+$attr\s*=\s*'$sqval')|(?:\s+$attr\s*=\s*$uqval)|(?:\s+$attr)"; 

and the final regex query will be

    // start tags + all attr formats
    $patt[] = "<(?'starttags'$tagname)(?'tagattrs'($attrval)*)\s*(?'voidtags'[/]?)>";

    // end tags
    $patt[] = "</(?'endtags'$tagname)\s*>"; // end tag

    // full regex pcre pattern
    $patt = implode("|", $patt);
    // search and match
    preg_match_all("#$patt#imuUs",$data,$matches);

hope this helps.

Comments

0

Even better would be to use backreferences, in PHP the regular expression would be:

([a-zA-Z_:][-a-zA-Z0-9_:.]+)=(["'])(.*?)\\2

Where \\2 is a reference to (["'])

Also this regular expression will match attributes containing _, - and :, which are allowed according to W3C, however, this expression wont match attributes which values are not contained in quotes.

5 Comments

What about: "attrib_name = unquoted_value"?
Spaces in attribute/value definitions is –as far as I know– not allowed.. Or isn't that what you are asking?
Sorry for not being clear. Your solution matches attributes with values that are quoted, (either name="dq val" or name='sq val'), but fails to match attributes with unquoted values (name=uq_val.
Yes i'm aware of that, and i also mentioned that in my answer ;) Nevertheless, according to the W3 xhtml spec, attribute values must always be quoted. (w3.org/TR/xhtml1/#h-4.4) And now with html5 and data attributes filled with objects, the quoting is almost necessary.
Koen, extra whitespace between an attribute, the = sign, and the attribute value is allowed. Also, I don’t see how XHTML is relevant anno 2013. In HTML quotes around attribute values are optional (in general), but depending on the attribute value you want to use you’re still gonna need them every now and then. See Unquoted attribute values in HTML, CSS and JavaScript for more information.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.