I have an HTML file encoded in utf-8. I want to ouput it to a text file, encoded in utf-8. Here's the code I'm using:
import codecs
IN = codecs.open("E2P3.html","r",encoding="utf-8")
codehtml = IN.read()
#codehtml = codehtml.decode("utf-8")
texte = re.sub("<br>","\n",codehtml)
#texte = texte.encode("utf-8")
OUT = codecs.open("E2P3.txt","w",encoding="utf-8")
OUT.write(texte)
IN.close()
OUT.close()
As you can see, I've tried using both 'decode' and 'codecs'. Neither of these work, my output text file defaults as Occidental (Windows-1252) and some entities become gibberish. What am I doing wrong here?