0

When parsing some HTML using BeautifulSoup or PyQuery, they will use a parser like lxml or html5lib. Let's say I've a file containing the following

<span>  é    and    ’  </span>

In my environnement they seems incorrectly encoded, using PyQuery:

>>> doc = pq(filename=PATH, parser="xml")
>>> doc.text()
'é and â\u20ac\u2122'
>>> doc = pq(filename=PATH, parser="html")
>>> doc.text()
'Ã\x83© and ââ\x82¬â\x84¢'
>>> doc = pq(filename=PATH, parser="soup")
>>> doc.text()
'é and â\u20ac\u2122'
>>> doc = pq(filename=PATH, parser="html5")
>>> doc.text()
'é and â\u20ac\u2122'

Beyond the fact that the encoding seems incorrect, one of the main problem is that doc.text() returns an instance of str instead of bytes which isn't a normal thing according to that question I asked yesterday.

Also, passing the argument encoding='utf-8' to PyQuery seems useless, I tried 'latin1' nothing change. I also tried to add some meta data because I read that lxml read them to figure out what encoding to use but it doesn't change anything:

<!DOCTYPE html>
<html lang="fr" dir="ltr">
<head>
<meta http-equiv="content-type" content="text/html;charset=latin1"/>
<span>  é    and    ’  </span>
</head>
</html>  

If I use lxml directly it seems a bit different

>>> from lxml import etree
>>> tree = etree.parse(PATH)
>>> tree.docinfo.encoding
'UTF-8'

>>> result = etree.tostring(tree.getroot(), pretty_print=False)
>>> result
b'<span>  &#233;    and    &#8217;  </span>'

>>> import html
>>> html.unescape(result.decode('utf-8'))
'<span>  é    and    \u2019  </span>\n'

Erf, It drives me a bit crazy, your help would be appreciated

1
  • I think the problem is in the filename=PATH, beacuse when i run from pyquery import PyQuery as pq \n html = '<span> é and ’ </span>' \n doc = pq(html, parser='html') \n print(doc.text()), it returns "é and '" Commented Sep 1, 2018 at 11:49

1 Answer 1

1

I think I figured it out. It seems that, even BeautifulSoup or PyQuery enable to do it, it is a bad idea to open directly a file containing some special UTF-8 chars. Especially, what confused me the most is that '’' symbol which seems not handled correctly by my Windows Terminal. So, the solution is to pre-process the file before parsing it:

def pre_process_html_content(html_content, encoding=None):
    """Pre process bytes coming from file or request."""
    if not isinstance(html_content, bytes):
        raise TypeError("html_content must a bytes not a " + str(type(html_content)))

    html_content = html_content.decode(encoding)


    # Handle weird symbols here
    html_content = html_content.replace('\u2019', "'")

    return html_content


def sanitize_html_file(path, encoding=None):
    with open(path, 'rb') as f:
        content = f.read()
    encoding = encoding or 'utf-8'

    return pre_process_html_content(content, encoding)


def open_pq(path, parser=None, encoding=None):
    """Macro for open HTML file with PyQuery."""
    content = sanitize_html_file(path, encoding)
    parser = parser or 'xml'

    return pq(content, parser=parser)


doc = open_pq(PATH)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.