Python removing website html tags not working

Question

I am trying to remove the text inside the <>(html tags) and write the outcome to a new file. For example, one line of text could be:

< asdf> Text <here>more text< /asdf >

So the program would write to the output file: "Text more text", excluding those inside the html tags.

This is my attempt so far:

import urllib.request

data=urllib.request.urlopen("some website").read()

text1=data.decode("utf-8")

import re

def asd(text1):

    x=re.compile("<>")

    y=re.sub(x,"",text1)

    file1=open("textfileoutput.txt","w")

    file1.write(y)

    return y

asd(text1)

It doesn't seem to write the clean version, still has the tags. Thank you for your help.

Your regular expression will only match "<>". I suggest a solution like BeautifulSoup Grab Visible Webpage Text. — Galen
– Galen, Commented Dec 14, 2017 at 2:24
You are right, fixed it with replacing a line with this: x=re.compile(r"<[^>]+>") Program works now. Thank you. — Jaakkath
– Jaakkath, Commented Dec 14, 2017 at 2:28
What if the tag contains a > somewhere in it? As alecxe pointed out, trying to parse HTML with regular expressions is usually not the best. — Galen
– Galen, Commented Dec 14, 2017 at 2:33

alecxe · Accepted Answer · 2017-12-14 02:25:55Z

2

x=re.compile("<>")

I am not sure why do you think this expression is going to match < asdf> or < /asdf >.

In any case, approaching HTML with regular expressions can rarely be justified. Use a more appropriate tool for the task - an HTML parser.

Example using BeautifulSoup and it's unwrap() method:

In [1]: from bs4 import BeautifulSoup

In [2]: html = "<asdf>Text more text</asdf>"

In [3]: soup = BeautifulSoup(html, "html.parser")

In [4]: soup.asdf.unwrap()
Out[4]: <asdf></asdf>

In [5]: print(soup)
Text more text

answered Dec 14, 2017 at 2:25

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sraw Over a year ago

For some others who cares about performance, BeautifulSoup is really slow even uses lxml as parser. If your html text is affirmatively well-formatted and you trust your regex expression, there is no problem to use it.

Jacky Wang · Accepted Answer · 2017-12-14 02:29:46Z

1

Simply replace re.compile("<>") with re.compile(r"<[^<>]*>") is enough

answered Dec 14, 2017 at 2:29

Jacky Wang

3,5403 gold badges30 silver badges50 bronze badges

1 Comment

Galen Over a year ago

What if the tag contains a > somewhere in it?

Collectives™ on Stack Overflow

Python removing website html tags not working

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related