I am trying to remove the text inside the <>(html tags) and write the outcome to a new file. For example, one line of text could be:
< asdf> Text <here>more text< /asdf >
So the program would write to the output file: "Text more text", excluding those inside the html tags.
This is my attempt so far:
import urllib.request
data=urllib.request.urlopen("some website").read()
text1=data.decode("utf-8")
import re
def asd(text1):
x=re.compile("<>")
y=re.sub(x,"",text1)
file1=open("textfileoutput.txt","w")
file1.write(y)
return y
asd(text1)
It doesn't seem to write the clean version, still has the tags. Thank you for your help.