Processing a HTML file using Python

Question

I wanted to remove all the tags in HTML file. For that I used re module of python. For example, consider the line <h1>Hello World!</h1>.I want to retain only "Hello World!". In order to remove the tags, I used re.sub('<.*>','',string). For obvious reasons the result I get is an empty string (The regexp identifies the first and last angle brackets and removes everything in between). How could I get over this issue?

Ned Batchelder · Accepted Answer · 2011-10-08 03:45:06Z

1

You can make the match non-greedy: '<.*?>'

You also need to be careful, HTML is a crafty beast, and can thwart your regexes.

edited Oct 8, 2011 at 3:45

answered Oct 8, 2011 at 3:38

Ned Batchelder

378k77 gold badges583 silver badges675 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sunjay Varma · Accepted Answer · 2011-10-08 03:36:08Z

1

Parse the HTML using BeautifulSoup, then only retrieve the text.

answered Oct 8, 2011 at 3:36

Sunjay Varma

5,1756 gold badges38 silver badges53 bronze badges

1 Comment

PaulDaviesC Over a year ago

Is BeatifulSoup a module in python? or What is it?

akonsu · Accepted Answer · 2011-10-08 03:39:55Z

1

make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy

off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/

answered Oct 8, 2011 at 3:39

akonsu

29.7k39 gold badges126 silver badges204 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:27:32Z

1

Use a parser, either lxml or BeautifulSoup:

import lxml.html
print lxml.html.fromstring(mystring).text_content()

Comments

varunl · Accepted Answer · 2011-10-08 06:22:30Z

0

Beautiful Soup is great for parsing html!

You might not require it now, but it's worth learning to use it. Will help you in the future too.

answered Oct 8, 2011 at 6:22

varunl

20.4k5 gold badges33 silver badges47 bronze badges

Collectives™ on Stack Overflow

Processing a HTML file using Python

5 Answers 5

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related