Validating if a string is a valid HTML in python?

Question

What is the best technique to be used in-order to find out that a string contains a valid html with correct syntax?

I tried looking into HTMLParser from module html.parser and if it doesn't produce any error during parsing, I conclude that the string is a valid HTML . However it didn't help me as it was even parsing invalid strings without raising any errors.

from html.parser import HTMLParser

parser = HTMLParser()

parser.feed('<h1> hi')
parser.close()

I expected it to throw some exception or error since the closing tag is missing but it didn't.

Which Python version do you use? docs.python.org/3/library/html.parser.html - quote: "This parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element.". You can also read this answer: stackoverflow.com/questions/24749103/… — s3n0
– s3n0, Commented Jul 4, 2019 at 11:39
@s3n0 I used python 3 . I didn't see that documentation. Is there some other library that is recommended in such cases ? — Sumit
– Sumit, Commented Jul 4, 2019 at 11:44
Of course... as I mentioned before... read this answer please: stackoverflow.com/a/27174001/9808870 ...it might help if you need it. — s3n0
– s3n0, Commented Jul 4, 2019 at 11:46
The answer in above link only checks if the tag is self closing or not. What I want is to find out if a string is valid html text. @s3n0 — Sumit
– Sumit, Commented Jul 4, 2019 at 11:52
Yes, the same module may to be used :) (from bs4 import BeautifulSoup). Read this one question+answer: stackoverflow.com/questions/24856035/… — s3n0
– s3n0, Commented Jul 4, 2019 at 12:14

Rahul Verma · Accepted Answer · 2019-07-04 11:57:40Z

3

    from bs4 import BeautifulSoup
    st = """<html>
    ... <head><title>I'm title</title></head>
    ... </html>"""
    st1="who are you"
    bool(BeautifulSoup(st, "html.parser").find())
    True
    bool(BeautifulSoup(st1, "html.parser").find())
    False

answered Jul 4, 2019 at 11:57

Rahul Verma

3,2262 gold badges15 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Guilherme Garnier Over a year ago

This doesn't work. It returns True for invalid html like <div>div> and <div<>div<

Carlos Damázio · Accepted Answer · 2019-07-04 12:41:27Z

3

The traditional HTMLParser from html.parser doesn't validate errors from HTML tagging, it only "tokenize" each content from the string.

You might want to take a look at py_w3c. It doesn't look like that anybody looks after this module, but sure is effective in identifying errors:

from py_w3c.validators.html.validator import HTMLValidator


val = HTMLValidator()
val.validate_fragment("<h1> hey yo")

for error in val.errors:
    print(error.get("message"))

$ python3.7 html-parser.py
Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
Element “head” is missing a required instance of child element “title”.
End of file seen and there were open elements.
Unclosed element “h1”.

answered Jul 4, 2019 at 12:41

Carlos Damázio

1041 silver badge5 bronze badges

1 Comment

Jan Wilmans Over a year ago

This solution (py_w3c) sends the htm to the w3c server... not usable offline and generates unnecessary traffic

Collectives™ on Stack Overflow

Validating if a string is a valid HTML in python?

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related