How to remove texts within html tags in python? [duplicate]

Question

Possible Duplicate:
Strip html from strings in python

While making a small browser like application, I am facing the problem of spliting the different tags. Consider the string

<html> <h1> good morning </h1> welcome </html>

I need the following output: ['good morning','welcome']

How can I do that in python?

mgilson · Accepted Answer · 2012-10-08 18:19:40Z

3

I would use xml.etree.ElementTree:

def get_text(etree):
    for child in etree:
        if child.text:
           yield child.text
        if child.tail:
           yield child.tail

import xml.etree.ElementTree as ET
root = ET.fromstring('<html> <h1> good morning </h1> welcome </html>')
print list(get_text(root))

answered Oct 8, 2012 at 18:19

mgilson

312k70 gold badges656 silver badges722 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

dm03514 · Accepted Answer · 2012-10-08 18:10:50Z

1

You can use one of pythons html / xml parsers.

Beautiful soup is popular. lmxl is popular too.

The above are third party pacakges you could use standard library too

http://docs.python.org/library/xml.etree.elementtree.html

answered Oct 8, 2012 at 18:10

dm03514

56.2k18 gold badges117 silver badges147 bronze badges

Comments

halex · Accepted Answer · 2012-10-08 18:29:44Z

0

I would use the python library Beautiful Soup to achieve your goal. It's just a couple of lines with its help:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<html> <h1> good morning </h1> welcome </html>')
print [text for text in soup.stripped_strings]

answered Oct 8, 2012 at 18:29

halex

16.4k6 gold badges60 silver badges67 bronze badges

Collectives™ on Stack Overflow

How to remove texts within html tags in python? [duplicate]

3 Answers 3

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Linked

Related