0

I have this text in a file - Recuérdame (notice it's a French word). When I read this file with a python script, I get this text as Recuérdame.

I read it as a unicode string. Do I need to find what the encoding of the text is & decode this? or is my terminal playing tricks on me?

3

4 Answers 4

5

Yes, you need to know the encoding of the text file to turn in into a unicode string (from the bytes that make up the file).

For example, if you know the encoding is UTF-8:

with open('foo.txt', 'rb') as f:
    contents = f.read().decode('utf-8-sig')   # -sig takes care of BOM if present

The text in your file seems not to be encoded Unicode, however; the accented character is apparently stored as an XML entity, which will have to be converted manually (tip of the hat to jleedev for the link).

Sign up to request clarification or add additional context in comments.

2 Comments

what's BOM (in context of -sig) ?
@MovieYoda: Ah, check out this article. Basically, when it takes multiple bytes together to represent a single character (as can be the case with UTF-8), those bytes could be interpreted in the a different order than intended (this order is called endianness). Because of this, a special unambiguous (and optional, in the case of UTF-8) mark is placed at the beginning of the file to indicate the endianness of the file. -sig removes the BOM if it's present so you don't get the marker appearing as part of your unicode string.
1

It is not a Unicode string. It's a string in whatever encoding it is encoded in. Hence it's a UTF-8 or a Latin-1 or something else string. In this case, &#xE9 is a HTML/XML entity representing é, specifically. It's an encoding used in HTML and XML to encode non-ascii data.

To decode that into Unicode, look at Fredrik Lundhs method: http://effbot.org/zone/re-sub.htm#unescape-html

2 Comments

Yes and no. It represents a numeric code point. You can’t say it’s an escaped UTF-8 character. It may be a Unicode character, but that’s something different.
Sure, all characters that exist in the set of Unicode characters are Unicode characters, of course. But with that definition, anything that can be decoded into Unicode is a Unicode string, including ASCII strings, and then the term "Unicode string" loses all meaning. A Unicode string is a string of Unicode data, and in Python, thats something held in a Unicode object. Anything that is encoded should not be called a Unicode string, it just makes people confused.
0

It is HTML an this construct is called „entity“. You can use

def entity_decode(match):
    _, is_hex, entity = match.groups()
    base = 16 if is_hex else 10
    return unichr(int(entity, base))

print re.sub("(?i)(&#(x?)([^;]+);)", 
       entity_decode,
       "Recurdame")

to decode all etities.

Edit: Yes, they are of course not latin1, now it should work with all entities

2 Comments

No, there are entities that are not Latin-1. Such as Α a greek Alpha . They are UCS-2, which is two byte and quite tricky to combine with your technique.
It was a problem with your Latin-1 decoding technique, yes. Now you are using unichr, which works with number enteties. It still however, does not work with named enteties. And once you add that, your code will be the same as effbots code, that everyone else links to already. :-)
0

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.