1

Possible Duplicate:
Strip html from strings in python

While making a small browser like application, I am facing the problem of spliting the different tags. Consider the string

<html> <h1> good morning </h1> welcome </html>

I need the following output: ['good morning','welcome']

How can I do that in python?

0

3 Answers 3

3

I would use xml.etree.ElementTree:

def get_text(etree):
    for child in etree:
        if child.text:
           yield child.text
        if child.tail:
           yield child.tail

import xml.etree.ElementTree as ET
root = ET.fromstring('<html> <h1> good morning </h1> welcome </html>')
print list(get_text(root))
Sign up to request clarification or add additional context in comments.

Comments

1

You can use one of pythons html / xml parsers.

Beautiful soup is popular. lmxl is popular too.

The above are third party pacakges you could use standard library too

http://docs.python.org/library/xml.etree.elementtree.html

Comments

0

I would use the python library Beautiful Soup to achieve your goal. It's just a couple of lines with its help:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<html> <h1> good morning </h1> welcome </html>')
print [text for text in soup.stripped_strings]

Comments