Python Text Encoding

Question

I have this text in a file - Recuérdame (notice it's a French word). When I read this file with a python script, I get this text as Recuérdame.

I read it as a unicode string. Do I need to find what the encoding of the text is & decode this? or is my terminal playing tricks on me?

effbot might be able to help you.. effbot.org/zone/unicode-objects.htm — William
– William, Commented Dec 16, 2010 at 6:37
possible duplicate of Convert XML/HTML Entities into Unicode String in Python — Josh Lee
– Josh Lee, Commented Dec 16, 2010 at 6:37
Actually, I think this is Spanish (never heard this in French, anyway). — Cameron
– Cameron, Commented Dec 16, 2010 at 6:55

Community · Accepted Answer · 2017-05-23 11:47:49Z

5

Yes, you need to know the encoding of the text file to turn in into a unicode string (from the bytes that make up the file).

For example, if you know the encoding is UTF-8:

with open('foo.txt', 'rb') as f:
    contents = f.read().decode('utf-8-sig')   # -sig takes care of BOM if present

The text in your file seems not to be encoded Unicode, however; the accented character is apparently stored as an XML entity, which will have to be converted manually (tip of the hat to jleedev for the link).

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Dec 16, 2010 at 6:38

Cameron

99.4k29 gold badges206 silver badges233 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Srikar Appalaraju Over a year ago

what's BOM (in context of -sig) ?

Cameron Over a year ago

@MovieYoda: Ah, check out this article. Basically, when it takes multiple bytes together to represent a single character (as can be the case with UTF-8), those bytes could be interpreted in the a different order than intended (this order is called endianness). Because of this, a special unambiguous (and optional, in the case of UTF-8) mark is placed at the beginning of the file to indicate the endianness of the file. -sig removes the BOM if it's present so you don't get the marker appearing as part of your unicode string.

Lennart Regebro · Accepted Answer · 2010-12-16 06:46:35Z

1

It is not a Unicode string. It's a string in whatever encoding it is encoded in. Hence it's a UTF-8 or a Latin-1 or something else string. In this case, &#xE9 is a HTML/XML entity representing é, specifically. It's an encoding used in HTML and XML to encode non-ascii data.

To decode that into Unicode, look at Fredrik Lundhs method: http://effbot.org/zone/re-sub.htm#unescape-html

answered Dec 16, 2010 at 6:46

Lennart Regebro

173k45 gold badges230 silver badges254 bronze badges

2 Comments

tchrist Over a year ago

Yes and no. It represents a numeric code point. You can’t say it’s an escaped UTF-8 character. It may be a Unicode character, but that’s something different.

Lennart Regebro Over a year ago

Sure, all characters that exist in the set of Unicode characters are Unicode characters, of course. But with that definition, anything that can be decoded into Unicode is a Unicode string, including ASCII strings, and then the term "Unicode string" loses all meaning. A Unicode string is a string of Unicode data, and in Python, thats something held in a Unicode object. Anything that is encoded should not be called a Unicode string, it just makes people confused.

nils · Accepted Answer · 2010-12-16 07:06:04Z

0

It is HTML an this construct is called „entity“. You can use

def entity_decode(match):
    _, is_hex, entity = match.groups()
    base = 16 if is_hex else 10
    return unichr(int(entity, base))

print re.sub("(?i)(&#(x?)([^;]+);)", 
       entity_decode,
       "Recu&#x90;rdame")

to decode all etities.

Edit: Yes, they are of course not latin1, now it should work with all entities

edited Dec 16, 2010 at 7:06

answered Dec 16, 2010 at 6:46

nils

6283 silver badges8 bronze badges

2 Comments

Lennart Regebro Over a year ago

No, there are entities that are not Latin-1. Such as Α a greek Alpha . They are UCS-2, which is two byte and quite tricky to combine with your technique.

Lennart Regebro Over a year ago

It was a problem with your Latin-1 decoding technique, yes. Now you are using unichr, which works with number enteties. It still however, does not work with named enteties. And once you add that, your code will be the same as effbots code, that everyone else links to already. :-)

user1866080 · Accepted Answer · 2012-12-23 15:32:08Z

0

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

answered Dec 23, 2012 at 15:32

user1866080

351 silver badge4 bronze badges

Collectives™ on Stack Overflow

Python Text Encoding

4 Answers 4

2 Comments

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related