0

I have an HTML file encoded in utf-8. I want to ouput it to a text file, encoded in utf-8. Here's the code I'm using:

import codecs
IN = codecs.open("E2P3.html","r",encoding="utf-8")
codehtml = IN.read()

#codehtml = codehtml.decode("utf-8") 

texte = re.sub("<br>","\n",codehtml)

#texte = texte.encode("utf-8") 

OUT = codecs.open("E2P3.txt","w",encoding="utf-8")
OUT.write(texte)

IN.close()
OUT.close()

As you can see, I've tried using both 'decode' and 'codecs'. Neither of these work, my output text file defaults as Occidental (Windows-1252) and some entities become gibberish. What am I doing wrong here?

1
  • Why do you think that the output file is encoded as Windows-1252? Are you using an editor that can't detect a UTF-8 file without a BOM? Commented Feb 15, 2014 at 21:47

1 Answer 1

1

When opening a UTF-8 file with the codecs module, as you did, the contents of the file are automatically decoded into Unicode strings, so you must not try to decode them again.

The same is true when writing the file; if you write it using the codecs module, the Unicode string you're passing will automatically be encoded to whatever encoding you specified.

To make it explicit that you're dealing with Unicode strings, it might be a better idea to use Unicode literals, as in

texte = re.sub(u"<br>", u"\n",codehtml)

although it doesn't really matter in this case (which could also be written as

texte = codehtml.replace(u"<br>", u"\n")

since you're not actually using a regular expression).

If the application doesn't recognize the UTF-8 file, it might help saving it with a BOM (Byte Order Mark) (which is generally discouraged, but if the application can't recognize a UTF-8 file otherwise, it's worth a try):

OUT = codecs.open("E2P3.txt","w",encoding="utf-8-sig")
Sign up to request clarification or add additional context in comments.

3 Comments

The problem I have isn't with the re module, though. The text contains characters such as ’ (or U+2019 in Unicode), and these characters become '’' once I open the text file with other applications. So this means that the codecs module does not, actually, encode my file in utf-8. I just don't understand why.
’ is UTF-8 for U+2019! If you see these characters, it means that whatever editor you're using thinks it's reading a Windows-1252 file. The editor is wrong, not the file.
Oh. Well that explains a lot then! I'm supposed to use that text file with a text analysis program (not a text editor), so the problem probably comes from that program. I think I'll just replace or delete these entities, then. Thanks for your help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.