PHP regex for valid XML tag name

Question

What is a good general regex (in PHP terms) to determine if a string is a valid XML tag name?

I startet using /[^>]+/i but that also matches something like 4 \<< which obviously isn't a valid tag name.

So I tried combining all valid characters like /[a-z][a-z0-9_-]*/i which also isn't quite right, as XML allows virtually any character in tag names also of foreign languages.

I'm stuck on that now - should I just check if there are whitespace characters? Or is there more to it?

Gordon · Accepted Answer · 2011-09-21 06:53:48Z

10

why dont you just use an XML parser/generator which already knows the rules?

function isValidXmlElementName($elementName)
{
    try {
        new DOMElement($elementName);
    } catch (DOMException $e) {
        return false;
    }
    return true;
}

var_dump(isValidXmlElementName(' ')); // false 
var_dump(isValidXmlElementName('1')); // false
var_dump(isValidXmlElementName('-')); // false
var_dump(isValidXmlElementName('a')); // true

answered Sep 21, 2011 at 6:53

Gordon

318k76 gold badges548 silver badges566 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mark Byers · Accepted Answer · 2011-09-21 06:40:29Z

4

From the XML specification:

[4]     NameStartChar      ::=      ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]    NameChar       ::=      NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]     Name       ::=      NameStartChar (NameChar)*

answered Sep 21, 2011 at 6:40

Mark Byers

844k202 gold badges1.6k silver badges1.5k bronze badges

1 Comment

F.P Over a year ago

That looks good, but how can I adapt that in PHP regex? Will the interpreter understand the range values like #xC0-#xD6?

hoppa · Accepted Answer · 2011-09-21 06:55:06Z

From the same specification but then a bit more clear:

"Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name. The character #x037E, GREEK QUESTION MARK, is excluded because when normalized it becomes a semicolon, which could change the meaning of entity references."

As far as I can interpret that, almost everything goes. As Gordon states below, using a parser which knows the rules is best!

Collectives™ on Stack Overflow

PHP regex for valid XML tag name

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related