Given the flexibility of HTML codes, parsing out paragraphs as seen by a user through browser seems to be a quite non-trivial task.
For now, I have a not so robust solution:
tree = lxml.etree.fromstring(html, lxml.etree.HTMLParser()) if isinstance(html, basestring) else html
for skiptag in ('//script', '//iframe', '//style',
'//link', '//meta', '//noscript', '//option'):
for node in tree.xpath(skiptag):
node.getparent().remove(node)
paragraphs = lxml.etree.tostring(tree, encoding=unicode, method='text')
The problems I am facing are mainly about how to tackle abnormality (or say, free-styles).
One quite common case is that many paragraphs are written in one line (e.g. code below) in the HTML, and my code will parse them into one paragraph.
<p>bla, bla 1.</p><p><u><span class="colored"><strong>bla, bla 2.</strong></span></u></p><p>bla, bla. 3;</p><p>bla, bla. 3</p>
My questions are:
- is there any good ways in general to parse out paragraphs correctly?
- in this particular case, how should I optimize my code to correctly get paragraphs from one-line of HTML, given not only
<p>can represent paragraph, but many other ways may be applied to the free-style? - any general advice?