Python: strip html from text data

Question

My question is slightly related to: Strip HTML from strings in Python

I am looking for a simple way to strip HTML code from text. For example:

string = 'foo <SOME_VALID_HTML_TAG> something </SOME_VALID_HTML_TAG> bar'
stripIt(string)

Would then yield foo bar.

Is there any simple tool to achieve this in Python? The HTML code could be nested.

I think you might want to use the accepted answer on the question you linked - how is what you're doing different? — girasquid
– girasquid, Commented Jan 5, 2011 at 18:47
In the related question, the user wanted stripIt('<HTML_TAG>foo</HTML_TAG>') to yield foo, while in my case I want it to return ''. — Jernej
– Jernej, Commented Jan 5, 2011 at 18:48
Right - my mistake. I didn't see the edit to your question, and thought that something was the tag you wanted stripped out. — girasquid
– girasquid, Commented Jan 5, 2011 at 18:49
is "SOME_VALID_HTML_TAG" set to a particular tag? Do you want the outermost tag to be removed? — milkypostman
– milkypostman, Commented Jan 5, 2011 at 19:16

Hugh Bothwell · Accepted Answer · 2011-01-06 00:47:24Z

6

import lxml.html
import re

def stripIt(s):
    doc = lxml.html.fromstring(s)   # parse html string
    txt = doc.xpath('text()')       # ['foo ', ' bar']
    txt = ' '.join(txt)             # 'foo   bar'
    return re.sub('\s+', ' ', txt)  # 'foo bar'

s = 'foo <SOME_VALID_HTML_TAG> something </SOME_VALID_HTML_TAG> bar'
stripIt(s)

returns

foo bar

answered Jan 6, 2011 at 0:47

Hugh Bothwell

57k9 gold badges91 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Vahid CH Over a year ago

i think, lxml is better than other modules, this works like charm.

mmmdreg Over a year ago

This is good because there is only one space between the resulting 'foo' and 'bar', as OP requested. Some of the other solutions leave two spaces.

milkypostman · Accepted Answer · 2011-01-05 19:45:51Z

5

from BeautifulSoup import BeautifulSoup

def removeTags(html, *tags):
    soup = BeautifulSoup(html)
    for tag in tags:
        for tag in soup.findAll(tag):
            tag.replaceWith("")

    return soup


testhtml = '''
<html>
    <head>
        <title>Page title</title>
    </head>
    <body>text here<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>'''

print removeTags(testhtml, 'b', 'p')

answered Jan 5, 2011 at 19:45

milkypostman

3,04329 silver badges23 bronze badges

Comments

mmmdreg · Accepted Answer · 2013-05-24 03:04:53Z

4

You could use regex:

def stripIt(s):
  txt = re.sub('<[^<]+?>.*?</[^<]+?>', '', s) # Remove html tags
  return re.sub('\s+', ' ', txt)              # Normalize whitespace

However, I would prefer Hugh Bothwell's solution as it would be more robust than pure regex.

answered May 24, 2013 at 3:04

mmmdreg

6,6682 gold badges26 silver badges19 bronze badges

Comments

Brent Newey · Accepted Answer · 2011-01-05 19:44:07Z

2

Try this solution:

from BeautifulSoup import BeautifulSoup

def stripIt(string, tag):
    soup = BeautifulSoup(string)

    rmtags = soup.findAll(tag)
    for t in rmtags:
        string = string.replace(str(t), '')
    return string

string = 'foo <p> something </p> bar'
print stripIt(string, 'p')
>>> foo  bar

string = 'foo <a>bar</a> baz <a>quux</a>'
print stripIt(string, 'a')
>>> foo  baz

Edit: This only works on validly nested tags, so for example:

string = 'blaz <div>baz <div>quux</div></div>'
print stripIt(string, 'div')
>>> blaz

string = 'blaz <a>baz <a>quux</a></a>'
print stripIt(string, 'a')
>>> blaz <a>baz </a>

edited Jan 5, 2011 at 19:44

answered Jan 5, 2011 at 19:34

Brent Newey

4,5093 gold badges31 silver badges33 bronze badges

Comments

tobib · Accepted Answer · 2013-02-21 10:50:37Z

2

If anyone has this problem and is already working with the jinja templating language: You can use the filter striptags in templates and the function jinja2.filters.do_striptags() in your code.

answered Feb 21, 2013 at 10:50

tobib

2,5044 gold badges25 silver badges38 bronze badges

Comments

scoffey · Accepted Answer · 2011-01-05 19:47:49Z

You can take advantage of HTMLParser by overriding methods accordingly:

from HTMLParser import HTMLParser

class HTMLStripper(HTMLParser):

    text_parts = []
    depth = 0

    def handle_data(self, data):
        if self.depth == 0:
            self.text_parts.append(data.strip())

    def handle_charref(self, ref):
        data = unichr(int(ref))
        self.handle_data(data)

    def handle_starttag(self, tag, attrs):
        self.depth += 1

    def handle_endtag(self, tag):
        if self.depth > 0:
            self.depth -= 1

    def handle_entityref(self, ref):
        try:
            data = unichr(name2codepoint[ref])
            self.handle_data(data)
        except KeyError:
            pass

    def get_stripped_text(self):
        return ' '.join(self.text_parts)

def strip_html_from_text(html):
    parser = HTMLStripper()
    parser.feed(html)
    return parser.get_stripped_text()

def main():
    import sys
    html = sys.stdin.read()
    text = strip_html_from_text(html)
    print text

if __name__ == '__main__':
    main()

Collectives™ on Stack Overflow

Python: strip html from text data

6 Answers 6

2 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related