Detecting HTML formatted strings with regex

Question

I'm trying to detect whether a string is XML/HTML formatted, or some other format like CSV or JSON, which may contain HTML as data, or just generic text which may contain random < or > characters. I am NOT trying to validate complete XML or HTML documents--the strings I am testing may just be snippets of XML/HTML, or they may be snippets of something else. So, my criteria are that the string must contain at least one properly-formatted XML tag, and that tag must start at the beginning of the string, barring any whitespace. (At this point, you may have guessed that I am trying to auto-detect the mime-type of textual content before sending it back to the browser. BTW, I'm in PHP.)

I have a regex that will detect the XML/HTML tag:

~<[a-z]+.*?(>.*?</[a-z]+>|/>)~i

And I have a regex that will tell me if the tag starts the string, ignoring whitespace:

~^\s*<~

Problem is, I cannot figure out how to combine both of these into a single regex. The difficulty seems to stem from the "greedy" aspect of regex, particularly if the subject contains nested tags. Help?

BTW you should also consider that <?xml version="1.0"?><xmltag attr="1" /> is valid XML. — dev-null-dweller
– dev-null-dweller, Commented Sep 13, 2013 at 21:46
@dev-null-dweller Yes. I tried that one, but it doesn't work if the subject contains nested XML tags. — Code Cavalier
– Code Cavalier, Commented Sep 13, 2013 at 21:50

Nicole · Accepted Answer · 2013-09-13 21:47:16Z

1

The following example seems to work for me:

<?php

$multiline = <<<'EOD'
<html>
<a>Another Tag</a>
</html>
EOD;

$singletag = <<<'EOD'
<html/>
EOD;


$badformat = <<<'EOD'
<html><html>
EOD;

$nothtml = <<<'EOD'
<html><html>
EOD;

$regex = '~^\s*<([a-z\:]+)[^>]*(?:/>|>.*</\1>)~sim';
echo preg_match($regex, $multiline) . "\n"; // 1
echo preg_match($regex, $singletag) . "\n"; // 1
echo preg_match($regex, $badformat) . "\n"; // 0
echo preg_match($regex, $nothtml) . "\n"; // 0

If you were using this on multiline HTML (which sounds likely), you didn't have the right modifiers:

s for PCRE_DOTALL, . character will include newlines
m for PCRE_MULTILINE, match whole text, not treating each line as it's own string

By the way:

I also made this more strict, so that it has to find a matching closing tag (using \1 backreference)
There are other valid starts to HTML/XML documents, as noted in the comments (e.g. HTML doctype or XML header). Regex may not be the best solution for this.
You can also consider not being so strict in requiring a tag at the beginning of the file, or creating further rules for creating a score for "best guess" document type.

answered Sep 13, 2013 at 21:47

Nicole

33.3k11 gold badges78 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Code Cavalier Over a year ago

Thanks so much for pointing out the modifiers...and I hate to beg the question, but, supposing I need this to also work with the HTML doctype or XML headers (as noted), what then? Is there a better way, besides regex, to do this?

Nicole Over a year ago

@CodeCavalier I suppose it depends on your goals and what kind of accuracy you need -- do you need to be 100% sure that it is actually HTML if you say it is? Or do you need to just assign the best guess for each? What if it looks like both (ie. HTML with JS in it or JS with HTML in it)? If you need 100% precision you could employ a parser for each language. On the other hand if you need a best guess, you could relax your rules a little bit and look for hints (well-formed tags, or JSON structure would be a good start).

Nicole Over a year ago

@CodeCavalier Keep in mind that there are so many edge cases that perfection is going to be difficult, so you need to figure out what tradeoffs are acceptable. For example, what about a JS file that starts out // <html></html> or a HTML file that starts out <html><!-- { ... }.

Nicole Over a year ago

@CodeCavalier I typed all that and then I just read your comment about not wanting to add content to a format that doesn't match. First of all, it should be easy to tell if a file is CSV. Check for presence of commas on every line (if it's more than one column). You can also probably assume it's CSV or JSON if it doesn't contain any HTML tags, etc. Lastly, the only truly correct approach is to keep track of the format in the first place. Any chance you can do that?

Collectives™ on Stack Overflow

Detecting HTML formatted strings with regex

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related