0

I wanted to remove all the tags in HTML file. For that I used re module of python. For example, consider the line <h1>Hello World!</h1>.I want to retain only "Hello World!". In order to remove the tags, I used re.sub('<.*>','',string). For obvious reasons the result I get is an empty string (The regexp identifies the first and last angle brackets and removes everything in between). How could I get over this issue?

5 Answers 5

1

You can make the match non-greedy: '<.*?>'

You also need to be careful, HTML is a crafty beast, and can thwart your regexes.

Sign up to request clarification or add additional context in comments.

Comments

1

Parse the HTML using BeautifulSoup, then only retrieve the text.

1 Comment

Is BeatifulSoup a module in python? or What is it?
1

make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy

off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/

Comments

1

Use a parser, either lxml or BeautifulSoup:

import lxml.html
print lxml.html.fromstring(mystring).text_content()

Related questions:

Using regular expressions to parse HTML: why not?

Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

Comments

0

Beautiful Soup is great for parsing html!

You might not require it now, but it's worth learning to use it. Will help you in the future too.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.