0

I'm trying to detect whether a string is XML/HTML formatted, or some other format like CSV or JSON, which may contain HTML as data, or just generic text which may contain random < or > characters. I am NOT trying to validate complete XML or HTML documents--the strings I am testing may just be snippets of XML/HTML, or they may be snippets of something else. So, my criteria are that the string must contain at least one properly-formatted XML tag, and that tag must start at the beginning of the string, barring any whitespace. (At this point, you may have guessed that I am trying to auto-detect the mime-type of textual content before sending it back to the browser. BTW, I'm in PHP.)

I have a regex that will detect the XML/HTML tag:

~<[a-z]+.*?(>.*?</[a-z]+>|/>)~i

And I have a regex that will tell me if the tag starts the string, ignoring whitespace:

~^\s*<~

Problem is, I cannot figure out how to combine both of these into a single regex. The difficulty seems to stem from the "greedy" aspect of regex, particularly if the subject contains nested tags. Help?

8
  • Try: /<([^>]+)>.+?<\/\1>/ Commented Sep 13, 2013 at 21:36
  • ~^(\s+)?<[a-z]+.*?(>.*?</[a-z]+>|/>)~i ? Commented Sep 13, 2013 at 21:40
  • BTW you should also consider that <?xml version="1.0"?><xmltag attr="1" /> is valid XML. Commented Sep 13, 2013 at 21:46
  • @elclanrs Does that address the preceding whitespace? Commented Sep 13, 2013 at 21:49
  • @dev-null-dweller Yes. I tried that one, but it doesn't work if the subject contains nested XML tags. Commented Sep 13, 2013 at 21:50

1 Answer 1

1

The following example seems to work for me:

<?php

$multiline = <<<'EOD'
<html>
<a>Another Tag</a>
</html>
EOD;

$singletag = <<<'EOD'
<html/>
EOD;


$badformat = <<<'EOD'
<html><html>
EOD;

$nothtml = <<<'EOD'
<html><html>
EOD;

$regex = '~^\s*<([a-z\:]+)[^>]*(?:/>|>.*</\1>)~sim';
echo preg_match($regex, $multiline) . "\n"; // 1
echo preg_match($regex, $singletag) . "\n"; // 1
echo preg_match($regex, $badformat) . "\n"; // 0
echo preg_match($regex, $nothtml) . "\n"; // 0

If you were using this on multiline HTML (which sounds likely), you didn't have the right modifiers:

  • s for PCRE_DOTALL, . character will include newlines
  • m for PCRE_MULTILINE, match whole text, not treating each line as it's own string

By the way:

  • I also made this more strict, so that it has to find a matching closing tag (using \1 backreference)
  • There are other valid starts to HTML/XML documents, as noted in the comments (e.g. HTML doctype or XML header). Regex may not be the best solution for this.
  • You can also consider not being so strict in requiring a tag at the beginning of the file, or creating further rules for creating a score for "best guess" document type.
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks so much for pointing out the modifiers...and I hate to beg the question, but, supposing I need this to also work with the HTML doctype or XML headers (as noted), what then? Is there a better way, besides regex, to do this?
@CodeCavalier I suppose it depends on your goals and what kind of accuracy you need -- do you need to be 100% sure that it is actually HTML if you say it is? Or do you need to just assign the best guess for each? What if it looks like both (ie. HTML with JS in it or JS with HTML in it)? If you need 100% precision you could employ a parser for each language. On the other hand if you need a best guess, you could relax your rules a little bit and look for hints (well-formed tags, or JSON structure would be a good start).
@CodeCavalier Keep in mind that there are so many edge cases that perfection is going to be difficult, so you need to figure out what tradeoffs are acceptable. For example, what about a JS file that starts out // <html></html> or a HTML file that starts out <html><!-- { ... }.
@CodeCavalier I typed all that and then I just read your comment about not wanting to add content to a format that doesn't match. First of all, it should be easy to tell if a file is CSV. Check for presence of commas on every line (if it's more than one column). You can also probably assume it's CSV or JSON if it doesn't contain any HTML tags, etc. Lastly, the only truly correct approach is to keep track of the format in the first place. Any chance you can do that?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.