encoding issue when reading CSV file with python

Question

I have hit a road block when trying to read a CSV file with python.

UPDATE: if you want to just skip the character or error you can open the file like this:

with open(os.path.join(directory, file), 'r', encoding="utf-8", errors="ignore") as data_file:

So far I have tried.

for directory, subdirectories, files in os.walk(root_dir):
    for file in files:
        with open(os.path.join(directory, file), 'r') as data_file:
            reader = csv.reader(data_file)
            for row in reader:
                print (row)

the error I am getting is:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>

I have Tried

with open(os.path.join(directory, file), 'r', encoding="UTF-8") as data_file:

Error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 223: character maps to <undefined>

Now if I just print the data_file it says they are cp1252 encoded but if I try

with open(os.path.join(directory, file), 'r', encoding="cp1252") as data_file:

The error I get is:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>

I also tried the recommended package.

The error I get is:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>

The line I am trying to parse is:

2015-11-28 22:23:58,670805374291832832,479174464,"MarkCrawford15","RT @WhatTheFFacts: The tallest man in the world was Robert Pershing Wadlow of Alton, Illinois. He was slighty over 8 feet 11 inches tall.","None

any thoughts or help is appreciated.

cp1252, according to google, is a windows character encoding. What's your environment and where did the files come from? If you open the csv file in nano, for instance, does it say that it's in dos format? — Ogaday
– Ogaday, Commented Dec 2, 2015 at 15:26
I dont understand what you mean by open file in nano... I am on a windows machine. — user3271518
– user3271518, Commented Dec 2, 2015 at 15:31
Oh, ok. I thought you might be on unix - I've had trouble parsing DOS formatted files on linux before and thought it may have been a similar issue. Nano is an in terminal text editor common on linux systems. — Ogaday
– Ogaday, Commented Dec 2, 2015 at 15:33
try for row in reader: data = [unicode(i,'utf-8') for i in row] print data — Learner
– Learner, Commented Dec 2, 2015 at 15:38
I just tried it I am getting the error UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 127: character maps to <undefined> — user3271518
– user3271518, Commented Dec 2, 2015 at 15:57

Learner · Accepted Answer · 2015-12-02 18:09:47Z

1

I would use csvkit, that uses automatic detection of apposite encoding and decoding. e.g.

import csvkit
reader = csvkit.reader(data_file)

As disscussed in the chat- solution is-

for directory, subdirectories, files in os.walk(root_dir): 
    for file in files: 
        with open(os.path.join(directory, file), 'r', encoding="utf-8") as data_file: 
            reader = csv.reader(data_file) 
            for row in reader: 
                data = [i.encode('ascii', 'ignore').decode('ascii') for i in row] 
                print (data)

edited Dec 2, 2015 at 18:09

answered Dec 2, 2015 at 15:12

Learner

5,3001 gold badge29 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

user3271518 Over a year ago

thanks man I dont have the ability to install packages in my environment currently

Ogaday Over a year ago

Have you come across miniconda? It doesn't require admin privileges to use.

Learner Over a year ago

for row in reader: data = [i.encode('utf-8') for i in row] print data

Learner Over a year ago

Could you post the content of row

user3271518 Over a year ago

ya so its really weird the row when I open with excel or a text editor has no issues and is 2015-11-28 07:32:32,670581036858933248,3256765652 but from looking at the error it looks like there is just a character that doesnt exist

|

Collectives™ on Stack Overflow

encoding issue when reading CSV file with python

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related