0

Given the flexibility of HTML codes, parsing out paragraphs as seen by a user through browser seems to be a quite non-trivial task.

For now, I have a not so robust solution:

tree = lxml.etree.fromstring(html, lxml.etree.HTMLParser()) if isinstance(html, basestring) else html 
for skiptag in ('//script', '//iframe', '//style', 
                '//link', '//meta', '//noscript', '//option'):    
    for node in tree.xpath(skiptag):
        node.getparent().remove(node)
paragraphs = lxml.etree.tostring(tree, encoding=unicode, method='text')

The problems I am facing are mainly about how to tackle abnormality (or say, free-styles).

One quite common case is that many paragraphs are written in one line (e.g. code below) in the HTML, and my code will parse them into one paragraph.

<p>bla, bla 1.</p><p><u><span class="colored"><strong>bla, bla 2.</strong></span></u></p><p>bla, bla. 3;</p><p>bla, bla. 3</p> 

My questions are:

  • is there any good ways in general to parse out paragraphs correctly?
  • in this particular case, how should I optimize my code to correctly get paragraphs from one-line of HTML, given not only <p> can represent paragraph, but many other ways may be applied to the free-style?
  • any general advice?

2 Answers 2

3

Use the xpath method to loop over all paragraphs:

for para in tree.xpath("//p"):
    ...
Sign up to request clarification or add additional context in comments.

Comments

1

Have a look at html2text.

It may not do exactly what you want, but it's only a 500 line script, so it should be pretty easy to adapt it to your particular needs.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.