7

What is the best technique to be used in-order to find out that a string contains a valid html with correct syntax?

I tried looking into HTMLParser from module html.parser and if it doesn't produce any error during parsing, I conclude that the string is a valid HTML . However it didn't help me as it was even parsing invalid strings without raising any errors.

from html.parser import HTMLParser

parser = HTMLParser()

parser.feed('<h1> hi')
parser.close()

I expected it to throw some exception or error since the closing tag is missing but it didn't.

5
  • Which Python version do you use? docs.python.org/3/library/html.parser.html - quote: "This parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element.". You can also read this answer: stackoverflow.com/questions/24749103/… Commented Jul 4, 2019 at 11:39
  • @s3n0 I used python 3 . I didn't see that documentation. Is there some other library that is recommended in such cases ? Commented Jul 4, 2019 at 11:44
  • Of course... as I mentioned before... read this answer please: stackoverflow.com/a/27174001/9808870 ...it might help if you need it. Commented Jul 4, 2019 at 11:46
  • The answer in above link only checks if the tag is self closing or not. What I want is to find out if a string is valid html text. @s3n0 Commented Jul 4, 2019 at 11:52
  • 1
    Yes, the same module may to be used :) (from bs4 import BeautifulSoup). Read this one question+answer: stackoverflow.com/questions/24856035/… Commented Jul 4, 2019 at 12:14

2 Answers 2

3
    from bs4 import BeautifulSoup
    st = """<html>
    ... <head><title>I'm title</title></head>
    ... </html>"""
    st1="who are you"
    bool(BeautifulSoup(st, "html.parser").find())
    True
    bool(BeautifulSoup(st1, "html.parser").find())
    False
Sign up to request clarification or add additional context in comments.

1 Comment

This doesn't work. It returns True for invalid html like <div>div> and <div<>div<
3

The traditional HTMLParser from html.parser doesn't validate errors from HTML tagging, it only "tokenize" each content from the string.

You might want to take a look at py_w3c. It doesn't look like that anybody looks after this module, but sure is effective in identifying errors:

from py_w3c.validators.html.validator import HTMLValidator


val = HTMLValidator()
val.validate_fragment("<h1> hey yo")

for error in val.errors:
    print(error.get("message"))
$ python3.7 html-parser.py
Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
Element “head” is missing a required instance of child element “title”.
End of file seen and there were open elements.
Unclosed element “h1”.

1 Comment

This solution (py_w3c) sends the htm to the w3c server... not usable offline and generates unnecessary traffic

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.