0

I have hit a road block when trying to read a CSV file with python.

UPDATE: if you want to just skip the character or error you can open the file like this:

with open(os.path.join(directory, file), 'r', encoding="utf-8", errors="ignore") as data_file:

So far I have tried.

for directory, subdirectories, files in os.walk(root_dir):
    for file in files:
        with open(os.path.join(directory, file), 'r') as data_file:
            reader = csv.reader(data_file)
            for row in reader:
                print (row)

the error I am getting is:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>

I have Tried

with open(os.path.join(directory, file), 'r', encoding="UTF-8") as data_file:

Error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 223: character maps to <undefined>

Now if I just print the data_file it says they are cp1252 encoded but if I try

with open(os.path.join(directory, file), 'r', encoding="cp1252") as data_file:

The error I get is:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>

I also tried the recommended package.

The error I get is:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>

The line I am trying to parse is:

2015-11-28 22:23:58,670805374291832832,479174464,"MarkCrawford15","RT @WhatTheFFacts: The tallest man in the world was Robert Pershing Wadlow of Alton, Illinois. He was slighty over 8 feet 11 inches tall.","None

any thoughts or help is appreciated.

5
  • cp1252, according to google, is a windows character encoding. What's your environment and where did the files come from? If you open the csv file in nano, for instance, does it say that it's in dos format? Commented Dec 2, 2015 at 15:26
  • I dont understand what you mean by open file in nano... I am on a windows machine. Commented Dec 2, 2015 at 15:31
  • Oh, ok. I thought you might be on unix - I've had trouble parsing DOS formatted files on linux before and thought it may have been a similar issue. Nano is an in terminal text editor common on linux systems. Commented Dec 2, 2015 at 15:33
  • try for row in reader: data = [unicode(i,'utf-8') for i in row] print data Commented Dec 2, 2015 at 15:38
  • I just tried it I am getting the error UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 127: character maps to <undefined> Commented Dec 2, 2015 at 15:57

1 Answer 1

1

I would use csvkit, that uses automatic detection of apposite encoding and decoding. e.g.

import csvkit
reader = csvkit.reader(data_file)

As disscussed in the chat- solution is-

for directory, subdirectories, files in os.walk(root_dir): 
    for file in files: 
        with open(os.path.join(directory, file), 'r', encoding="utf-8") as data_file: 
            reader = csv.reader(data_file) 
            for row in reader: 
                data = [i.encode('ascii', 'ignore').decode('ascii') for i in row] 
                print (data)
Sign up to request clarification or add additional context in comments.

9 Comments

thanks man I dont have the ability to install packages in my environment currently
Have you come across miniconda? It doesn't require admin privileges to use.
for row in reader: data = [i.encode('utf-8') for i in row] print data
Could you post the content of row
ya so its really weird the row when I open with excel or a text editor has no issues and is 2015-11-28 07:32:32,670581036858933248,3256765652 but from looking at the error it looks like there is just a character that doesnt exist
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.