4

What is a good general regex (in PHP terms) to determine if a string is a valid XML tag name?

I startet using /[^>]+/i but that also matches something like 4 \<< which obviously isn't a valid tag name.

So I tried combining all valid characters like /[a-z][a-z0-9_-]*/i which also isn't quite right, as XML allows virtually any character in tag names also of foreign languages.

I'm stuck on that now - should I just check if there are whitespace characters? Or is there more to it?

0

3 Answers 3

10

why dont you just use an XML parser/generator which already knows the rules?

function isValidXmlElementName($elementName)
{
    try {
        new DOMElement($elementName);
    } catch (DOMException $e) {
        return false;
    }
    return true;
}

var_dump(isValidXmlElementName(' ')); // false 
var_dump(isValidXmlElementName('1')); // false
var_dump(isValidXmlElementName('-')); // false
var_dump(isValidXmlElementName('a')); // true
Sign up to request clarification or add additional context in comments.

Comments

4

From the XML specification:

[4]     NameStartChar      ::=      ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]    NameChar       ::=      NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]     Name       ::=      NameStartChar (NameChar)*

1 Comment

That looks good, but how can I adapt that in PHP regex? Will the interpreter understand the range values like #xC0-#xD6?
1

From the same specification but then a bit more clear:

"Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name. The character #x037E, GREEK QUESTION MARK, is excluded because when normalized it becomes a semicolon, which could change the meaning of entity references."

As far as I can interpret that, almost everything goes. As Gordon states below, using a parser which knows the rules is best!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.