What is the proper way to use codecs' encoding in Python?

Question

I have an HTML file encoded in utf-8. I want to ouput it to a text file, encoded in utf-8. Here's the code I'm using:

import codecs
IN = codecs.open("E2P3.html","r",encoding="utf-8")
codehtml = IN.read()

#codehtml = codehtml.decode("utf-8") 

texte = re.sub("<br>","\n",codehtml)

#texte = texte.encode("utf-8") 

OUT = codecs.open("E2P3.txt","w",encoding="utf-8")
OUT.write(texte)

IN.close()
OUT.close()

As you can see, I've tried using both 'decode' and 'codecs'. Neither of these work, my output text file defaults as Occidental (Windows-1252) and some entities become gibberish. What am I doing wrong here?

Why do you think that the output file is encoded as Windows-1252? Are you using an editor that can't detect a UTF-8 file without a BOM? — Tim Pietzcker
– Tim Pietzcker, Commented Feb 15, 2014 at 21:47

Tim Pietzcker · Accepted Answer · 2014-02-15 21:56:28Z

1

When opening a UTF-8 file with the codecs module, as you did, the contents of the file are automatically decoded into Unicode strings, so you must not try to decode them again.

The same is true when writing the file; if you write it using the codecs module, the Unicode string you're passing will automatically be encoded to whatever encoding you specified.

To make it explicit that you're dealing with Unicode strings, it might be a better idea to use Unicode literals, as in

texte = re.sub(u"<br>", u"\n",codehtml)

although it doesn't really matter in this case (which could also be written as

texte = codehtml.replace(u"<br>", u"\n")

since you're not actually using a regular expression).

If the application doesn't recognize the UTF-8 file, it might help saving it with a BOM (Byte Order Mark) (which is generally discouraged, but if the application can't recognize a UTF-8 file otherwise, it's worth a try):

OUT = codecs.open("E2P3.txt","w",encoding="utf-8-sig")

edited Feb 15, 2014 at 21:56

answered Feb 15, 2014 at 21:45

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

kormak Over a year ago

The problem I have isn't with the re module, though. The text contains characters such as ’ (or U+2019 in Unicode), and these characters become 'â€™' once I open the text file with other applications. So this means that the codecs module does not, actually, encode my file in utf-8. I just don't understand why.

Tim Pietzcker Over a year ago

â€™ is UTF-8 for U+2019! If you see these characters, it means that whatever editor you're using thinks it's reading a Windows-1252 file. The editor is wrong, not the file.

kormak Over a year ago

Oh. Well that explains a lot then! I'm supposed to use that text file with a text analysis program (not a text editor), so the problem probably comes from that program. I think I'll just replace or delete these entities, then. Thanks for your help!

Collectives™ on Stack Overflow

What is the proper way to use codecs' encoding in Python?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related