220

I have a text like this:

text = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""

using pure Python, with no external module I want to have this:

>>> print remove_tags(text)
Title A long text..... a link

I know I can do it using lxml.html.fromstring(text).text_content() but I need to achieve the same in pure Python using builtin or std library for 2.6+

How can I do that?

2
  • 2
    Any specific reason why you don't want to use an external module.? Commented Mar 12, 2012 at 6:08
  • 1
    no permissions to install modules on the server... Commented Mar 13, 2012 at 4:32

5 Answers 5

448

Using a regex

Using a regex, you can clean everything inside <> :

import re
# as per recommendation from @freylis, compile once only
CLEANR = re.compile('<.*?>') 

def cleanhtml(raw_html):
  cleantext = re.sub(CLEANR, '', raw_html)
  return cleantext

Some HTML texts can also contain entities that are not enclosed in brackets, such as '&nsbm'. If that is the case, then you might want to write the regex as

CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')

This link contains more details on this.

Using BeautifulSoup

You could also use BeautifulSoup additional package to find out all the raw text.

You will need to explicitly set a parser when calling BeautifulSoup I recommend "lxml" as mentioned in alternative answers (much more robust than the default one (html.parser) (i.e. available without additional install).

from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text

But it doesn't prevent you from using external libraries, so I recommend the first solution.

EDIT: To use lxml you need to pip install lxml.

Sign up to request clarification or add additional context in comments.

16 Comments

if you want to compile regexp, best way is compile outside function. In you exemple every call cleanhtml must be compile regexp again
BeautifulSoup is good when the markup is heavy, else try to avoid it as it's very slow.
Great answer. You forgot the colon at the end of def cleanhtml(raw_html) though :)
Nice answer. You might want to explicitly set your parser in BeautifulSoup, using cleantext = BeautifulSoup(raw_html, "html.parser").text
The first half of this answer should be removed because it is terribly wrong to try this. HTML needs to be parsed as a tree and understood that <script> and other tags can contain anything. I say this with the politest regard and c24b acknowledged this.
|
52

Python has several XML modules built in. The simplest one for the case that you already have a string with the full HTML is xml.etree, which works (somewhat) similarly to the lxml example you mention:

def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

1 Comment

This worked for me but be carefull of the html tags from autoclose type. Example : </br> I got a "ParseError: mismatched tag: line 1, column 9" cause this tag is close without being open before. This is the same for all html tags autoclosed.
42

Note that this isn't perfect, since if you had something like, say, <a title=">"> it would break. However, it's about the closest you'd get in non-library Python without a really complex function:

import re

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

However, as lvc mentions xml.etree is available in the Python Standard Library, so you could probably just adapt it to serve like your existing lxml version:

def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

10 Comments

I like your regex approach, maybe it will be better if performance's an important factor.
And in addition, it works with strings not starting with an xml tag, it that would be the case
@DouglasCamata regex is not more performant than an xml parser.
It's worth noting that this will break if you have a text < in your document.
@PatrickT you need to export it - import xml.etree
|
9

There's a simple way to this in any C-like language. The style is not Pythonic but works with pure Python:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

The idea based in a simple finite-state machine and is detailed explained here: http://youtu.be/2tu9LTDujbw

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you're interested in the class(about smart debugging with python) I give you a link: https://www.udacity.com/course/software-debugging--cs259. It's free!

2 Comments

This will break on mismatched quotes, and is quite slow due to adding to the output character by character. But it ilustrates enough, that writing a primitive character-by-character parser isn't a big deal.
This answer is great for teaching HTML or Python, but misses a crucial point for production use: meeting standards is hard, and using a well-supported library can avoid weeks of research and/or bug-hunting in an otherwise healthy deadline.
-15
global temp

temp =''

s = ' '

def remove_strings(text):

    global temp 

    if text == '':

        return temp

    start = text.find('<')

    end = text.find('>')

    if start == -1 and end == -1 :

        temp = temp + text

    return temp

newstring = text[end+1:]

fresh_start = newstring.find('<')

if newstring[:fresh_start] != '':
    
    temp += s+newstring[:fresh_start]

remove_strings(newstring[fresh_start:])

return temp

1 Comment

Your answer is: a) awfully formated (violates pep8 for example), b) overkill because there are tools to do the same, c) prone to fail (what happens when html has > character in one of the attributes?), d) global in XXI century in such trivial case?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.