Python code to remove HTML tags from a string [duplicate]

Question

I have a text like this:

text = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""

using pure Python, with no external module I want to have this:

>>> print remove_tags(text)
Title A long text..... a link

I know I can do it using lxml.html.fromstring(text).text_content() but I need to achieve the same in pure Python using builtin or std library for 2.6+

How can I do that?

Any specific reason why you don't want to use an external module.? — RanRag
– RanRag, Commented Mar 12, 2012 at 6:08

Benjamin Loison · Accepted Answer · 2024-09-05 16:06:53Z

448

Using a regex

Using a regex, you can clean everything inside <> :

import re
# as per recommendation from @freylis, compile once only
CLEANR = re.compile('<.*?>') 

def cleanhtml(raw_html):
  cleantext = re.sub(CLEANR, '', raw_html)
  return cleantext

Some HTML texts can also contain entities that are not enclosed in brackets, such as '&nsbm'. If that is the case, then you might want to write the regex as

CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')

This link contains more details on this.

Using BeautifulSoup

You could also use BeautifulSoup additional package to find out all the raw text.

You will need to explicitly set a parser when calling BeautifulSoup I recommend "lxml" as mentioned in alternative answers (much more robust than the default one (html.parser) (i.e. available without additional install).

from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text

But it doesn't prevent you from using external libraries, so I recommend the first solution.

EDIT: To use lxml you need to pip install lxml.

edited Sep 5, 2024 at 16:06

Benjamin Loison

5,7514 gold badges20 silver badges37 bronze badges

answered Oct 19, 2012 at 21:26

c24b

5,5826 gold badges30 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

freylis Over a year ago

if you want to compile regexp, best way is compile outside function. In you exemple every call cleanhtml must be compile regexp again

Ethan Over a year ago

BeautifulSoup is good when the markup is heavy, else try to avoid it as it's very slow.

bjesus Over a year ago

Great answer. You forgot the colon at the end of def cleanhtml(raw_html) though :)

Zemogle Over a year ago

Nice answer. You might want to explicitly set your parser in BeautifulSoup, using cleantext = BeautifulSoup(raw_html, "html.parser").text

ldmtwo Over a year ago

The first half of this answer should be removed because it is terribly wrong to try this. HTML needs to be parsed as a tree and understood that <script> and other tags can contain anything. I say this with the politest regard and c24b acknowledged this.

|

pdaawr · Accepted Answer · 2022-06-06 10:41:01Z

52

Python has several XML modules built in. The simplest one for the case that you already have a string with the full HTML is xml.etree, which works (somewhat) similarly to the lxml example you mention:

def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

edited Jun 6, 2022 at 10:41

pdaawr

5989 silver badges20 bronze badges

answered Mar 12, 2012 at 6:04

lvc

35.2k10 gold badges76 silver badges100 bronze badges

1 Comment

1ronmat Over a year ago

This worked for me but be carefull of the html tags from autoclose type. Example : </br> I got a "ParseError: mismatched tag: line 1, column 9" cause this tag is close without being open before. This is the same for all html tags autoclosed.

Benjamin Loison · Accepted Answer · 2024-09-05 16:07:13Z

42

Note that this isn't perfect, since if you had something like, say, <a title=">"> it would break. However, it's about the closest you'd get in non-library Python without a really complex function:

import re

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

However, as lvc mentions xml.etree is available in the Python Standard Library, so you could probably just adapt it to serve like your existing lxml version:

def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

edited Sep 5, 2024 at 16:07

Benjamin Loison

5,7514 gold badges20 silver badges37 bronze badges

answered Mar 12, 2012 at 5:57

Amber

531k89 gold badges643 silver badges558 bronze badges

10 Comments

Douglas Camata Over a year ago

I like your regex approach, maybe it will be better if performance's an important factor.

kiril Over a year ago

And in addition, it works with strings not starting with an xml tag, it that would be the case

Slater Victoroff Over a year ago

@DouglasCamata regex is not more performant than an xml parser.

Slater Victoroff Over a year ago

It's worth noting that this will break if you have a text < in your document.

Amber Over a year ago

@PatrickT you need to export it - import xml.etree

|

Igor Medeiros · Accepted Answer · 2019-10-10 20:43:35Z

9

There's a simple way to this in any C-like language. The style is not Pythonic but works with pure Python:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

The idea based in a simple finite-state machine and is detailed explained here: http://youtu.be/2tu9LTDujbw

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you're interested in the class(about smart debugging with python) I give you a link: https://www.udacity.com/course/software-debugging--cs259. It's free!

edited Oct 10, 2019 at 20:43

answered Jan 22, 2013 at 17:27

Igor Medeiros

4,1462 gold badges29 silver badges34 bronze badges

2 Comments

Tomasz Gandor Over a year ago

This will break on mismatched quotes, and is quite slow due to adding to the output character by character. But it ilustrates enough, that writing a primitive character-by-character parser isn't a big deal.

jpaugh Over a year ago

This answer is great for teaching HTML or Python, but misses a crucial point for production use: meeting standards is hard, and using a well-supported library can avoid weeks of research and/or bug-hunting in an otherwise healthy deadline.

Benjamin Loison · Accepted Answer · 2024-09-05 16:07:58Z

-15

global temp

temp =''

s = ' '

def remove_strings(text):

    global temp 

    if text == '':

        return temp

    start = text.find('<')

    end = text.find('>')

    if start == -1 and end == -1 :

        temp = temp + text

    return temp

newstring = text[end+1:]

fresh_start = newstring.find('<')

if newstring[:fresh_start] != '':
    
    temp += s+newstring[:fresh_start]

remove_strings(newstring[fresh_start:])

return temp

edited Sep 5, 2024 at 16:07

Benjamin Loison

5,7514 gold badges20 silver badges37 bronze badges

answered Feb 25, 2013 at 9:39

user1899895

631 silver badge5 bronze badges

1 Comment

Drachenfels Over a year ago

Your answer is: a) awfully formated (violates pep8 for example), b) overkill because there are tools to do the same, c) prone to fail (what happens when html has > character in one of the attributes?), d) global in XXI century in such trivial case?

Collectives™ on Stack Overflow

Python code to remove HTML tags from a string [duplicate]

5 Answers 5

Using a regex

Using BeautifulSoup

16 Comments

1 Comment

10 Comments

2 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Using a regex

Using BeautifulSoup

16 Comments

1 Comment

10 Comments

2 Comments

1 Comment

Linked

Related