Python csv: UnicodeDecodeError

Question

I'm reading in a file with Python's csv module, and have Yet Another Encoding Question (sorry, there are so many on here).

In the CSV file, there are £ signs. After reading the row in and printing it, they have become \xa3.

Trying to encode them as Unicode produces a UnicodeDecodeError:

row = [unicode(x.strip()) for x in row]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128)

I have been reading the csv documentation and the numerous other questions about this on StackOverflow. I think that £ becoming \xa3 in ASCII means that the original CSV file is in UTF-8.

(Incidentally, is there a quick way to check the encoding of a CSV file?)

If it's in UTF-8, then shouldn't the csv module be able to cope with it? It seems to be transforming all the symbols into ASCII, even though the documentation claims it accepts UTF-8.

I've tried adding a unicode_csv_reader function as described in the csv examples, but it doesn't help.

---- EDIT -----

I should clarify one thing. I have seen this question, which looks very similar. But adding the unicode_csv_reader function defined there produces a different error instead:

yield [unicode(cell, 'utf-8') for cell in row]
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 8: unexpected code byte

So maybe my file isn't UTF8 after all? How can I tell?

riwalk · Accepted Answer · 2010-08-13 19:24:54Z

7

Try using the "ISO-8859-1" for your encoding. It seems like you are dealing with extended ASCII, not Unicode.

Edit:

Here's some simple code that deals with extended ASCII:

>>> s = "La Pe\xf1a"
>>> print s
La Pe±a
>>> print s.decode("latin-1")
La Peña
>>>

Even better, dealing with the exact character that is giving you problems:

>>> s = "12\xa3"
>>> print s.decode("latin-1")
12£
>>>

edited Aug 13, 2010 at 19:24

answered Aug 13, 2010 at 19:10

riwalk

14.3k6 gold badges54 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

AP257 Over a year ago

Do you mean use: yield [unicode(cell, 'ISO-8859-1') for cell in row] instead, in the unicode_csv_reader function? Unfortunately that doesn't help - back to the ordinal not in range(128) error again.

riwalk Over a year ago

It wouldn't make much sense to use a function called unicode() when dealing with ASCII. What I am saying is that you are dealing with a file that is encoded using a "ISO-8859-1" encoding. I didn't post any code, because I don't know how to do it off the top of my head, but your problem is that you need to decode it as ISO-8859-1, not Unicode.

AP257 Over a year ago

OK, thanks. I'll investigate. How did you know it was ISO-8859-1? In other words, is there a way for me to check encodings myself, rather than just ask dumb questions on StackOverflow :)

riwalk Over a year ago

Not a dumb question at all. I had to work on a project where we were working on a web scraping tool, and we needed to scrape international sites. I spent two full weeks immersing myself in the intricate details of encoding, and to this day I am one of the few at my workplace who has a firm grasp over them.

ryanjdillon Over a year ago

@AP257 This is old, but you can check the charset on linux/unix by using file -i filename. With eastern European languages, I've seen the enca command mentioned before.

|

John Machin · Accepted Answer · 2010-08-13 21:52:02Z

0

If you are on Windows, it is highly likely that the encoding that you should use is one of the cp125X family ... e.g. if you are in Western Europe or the Americas, it will be cp1252. Windows software often uses bytes in the range \x80 to \x9F inclusive to encode fancy punctuation characters whereas that range is reserved in ISO-8859-X for the rarely used "C1 Control Characters".

You can find out the usual encoding in your locale by running this at the command line:

python -c "import locale; print locale.getpreferredencoding()"

answered Aug 13, 2010 at 21:52

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

5 Comments

riwalk Over a year ago

He is having difficulty reading £ signs, and you're assuming that the file was conveniently saved on whatever settings his computer prefers? I would be careful making the assumption that the file is something that was saved using his machine.

John Machin Over a year ago

@Stargazer712: No, I'm not assuming anything. I'm suggesting that it is highly likely that the file was created on a machine in the same locale and using the same operating system as the machine the OP is using.

riwalk Over a year ago

My experience with encodings (as I mentioned earlier) came from scraping the web. I assure you it is not a safe assumption.

John Machin Over a year ago

@Stargazer712: Which part of "I'm not assuming anything" don't you understand? I'm suggesting that the OP should check whether cp125X might not be more appropriate, i.e. more future-proof.

riwalk Over a year ago

"I'm suggesting that it is highly likely that the file was created on a machine in the same locale..." -- That's an assumption, and I'm done talking about this.

Collectives™ on Stack Overflow

Python csv: UnicodeDecodeError

2 Answers 2

8 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related