Python: Separating an HTML snippets to paragraphs

Question

I have a snippet of HTML that contains paragraphs. (I mean p tags.) I want to split the string into the different paragraphs. For instance:

'''
<p class="my_class">Hello!</p>
<p>What's up?</p>
<p style="whatever: whatever;">Goodbye!</p>
'''

Should become:

['<p class="my_class">Hello!</p>',
 '<p>What's up?</p>'
 '<p style="whatever: whatever;">Goodbye!</p>']

What would be a good way to approach this?

Very near (or even identical if you will) duplicate here: stackoverflow.com/questions/972749/… Quick answer: use beautifulsoup — ChristopheD
– ChristopheD, Commented Feb 16, 2010 at 22:27

Crast · Accepted Answer · 2010-02-16 23:02:12Z

5

If your string only contains paragraphs, you may be able to get away with a nicely crafted regex and re.split(). However, if your string is more complex HTML, or not always valid HTML, you might want to look at the BeautifulSoup package.

Usage goes like:

from BeautifulSoup import BeautifulSoup 

soup = BeautifulSoup(some_html)

paragraphs = list(unicode(x) for x in soup.findAll('p'))

edited Feb 16, 2010 at 23:02

answered Feb 16, 2010 at 22:28

Crast

16.4k6 gold badges47 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mike Graham Over a year ago

Regular expressions are the wrong tool for this. HTML is not a regular language and therefore regex are inherently unable to parse HTML. Using an HTML parser, like you show in the latter part of your post, is more robust as well as more easy and readable.

Mike Graham · Accepted Answer · 2010-02-16 22:33:20Z

2

Use lxml.html to parse the HTML into the form you want. This is essentially the same advice as the people who are recommending BeautifulSoup, except lxml is still being actively developed and BeatifulSoup development has slowed.

answered Feb 16, 2010 at 22:33

Mike Graham

77.2k16 gold badges105 silver badges131 bronze badges

Comments

Lukáš Lalinský · Accepted Answer · 2010-02-16 22:27:27Z

0

Use BeautifulSoup to parse the HTML and iterate over the paragraphs.

answered Feb 16, 2010 at 22:27

Lukáš Lalinský

41.5k6 gold badges109 silver badges128 bronze badges

4 Comments

dubiousjim Over a year ago

BeautifulSoup also works but is only necessary if the html might be ugly/invalid. The stdlib etree can also do this. I prefer lxml because it's more powerful. At one point there was talk of including BeautifulSoup into it; I don't know where that's gone.

Lukáš Lalinský Over a year ago

xml.etree can parse XML, which the code in the question is not.

dubiousjim Over a year ago

I believe I've used it to parse html. Maybe I'm misremembering. But this seems to confirm my memory: effbot.org/zone/element-index.htm#usage

dubiousjim Over a year ago

or maybe the issue is that we only have a snippet here...?

dubiousjim · Accepted Answer · 2010-02-16 22:27:54Z

0

The xml.etree (std lib) or lxml.etree (enhanced) make this easy to do, but I'm not going to get the answer cred for this because I don't remember the exact syntax. I keep mixing it up with similar packages and have to look it up afresh every time.

answered Feb 16, 2010 at 22:27

dubiousjim

4,8622 gold badges39 silver badges35 bronze badges

Collectives™ on Stack Overflow

Python: Separating an HTML snippets to paragraphs

4 Answers 4

1 Comment

Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related