Parsing out paragraphs in one line of html using python

Question

Given the flexibility of HTML codes, parsing out paragraphs as seen by a user through browser seems to be a quite non-trivial task.

For now, I have a not so robust solution:

tree = lxml.etree.fromstring(html, lxml.etree.HTMLParser()) if isinstance(html, basestring) else html 
for skiptag in ('//script', '//iframe', '//style', 
                '//link', '//meta', '//noscript', '//option'):    
    for node in tree.xpath(skiptag):
        node.getparent().remove(node)
paragraphs = lxml.etree.tostring(tree, encoding=unicode, method='text')

The problems I am facing are mainly about how to tackle abnormality (or say, free-styles).

One quite common case is that many paragraphs are written in one line (e.g. code below) in the HTML, and my code will parse them into one paragraph.

<p>bla, bla 1.</p><p><u><span class="colored"><strong>bla, bla 2.</strong></span></u></p><p>bla, bla. 3;</p><p>bla, bla. 3</p>

My questions are:

is there any good ways in general to parse out paragraphs correctly?
in this particular case, how should I optimize my code to correctly get paragraphs from one-line of HTML, given not only <p> can represent paragraph, but many other ways may be applied to the free-style?
any general advice?

Raymond Hettinger · Accepted Answer · 2011-11-07 02:27:24Z

3

Use the xpath method to loop over all paragraphs:

for para in tree.xpath("//p"):
    ...

answered Nov 7, 2011 at 2:27

Raymond Hettinger

229k67 gold badges405 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ekhumoro · Accepted Answer · 2011-11-07 17:55:37Z

1

Have a look at html2text.

It may not do exactly what you want, but it's only a 500 line script, so it should be pretty easy to adapt it to your particular needs.

answered Nov 7, 2011 at 17:55

ekhumoro

122k23 gold badges272 silver badges400 bronze badges

Collectives™ on Stack Overflow

Parsing out paragraphs in one line of html using python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related