Importing file with unknown encoding from Python into MongoDB

Question

Working on importing a tab-delimited file over HTTP in Python.

Before inserting a row's data into MongoDB, I'm removing slashes, ticks and quotes from the string.

Whatever the encoding of the data is, MongoDB is throwing me the exception:

bson.errors.InvalidStringData: strings in documents must be valid UTF-8

So in an endeavour to solve this problem, from the reading I've done I want to as quickly as I can, convert the row's data to Unicode using the unicode() function. In addition, I have tried calling the decode() function passing "unicode" as the first parameter but receive the error:

LookupError: unknown encoding: unicode

From there, I can make my string manipulations such as replacing the slashes, ticks, and quotes. Then before inserting the data into MongoDB, convert it to UTF-8 using the str.encode('utf-8') function.

Problem: When converting to Unicode, I am receiving the error

UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 1258: ordinal not in range(128)

With this error, I'm not exactly sure where to continue.

My question is this: How do I successfully import the data from a file without knowing its encoding and successfully insert it into MongoDB, which requires UTF-8?

Thanks Much!

But you said "importing a tab-delimited file over HTTP" ... where does "CSV" fit into that? — John Machin
– John Machin, Commented Jan 13, 2011 at 22:18

John Machin · Accepted Answer · 2011-01-13 22:54:20Z

7

Try these in order:

(0) Check that your removal of the slashes/ticks/etc is not butchering the data. What's a tick? Please show your code. Please show a sample of the raw data ... use print repr(sample_raw data) and copy/paste the output into an edit of your question.

(1) There's an old maxim: "If the encoding of a file is unknown, or stated to be ISO-8859-1, it is cp1252" ... where are you getting it from? If it's coming from Western Europe, the Americas, or any English/French/Spanish-speaking country/territory elsewhere, and it's not valid UTF-8, then it's likely to be cp1252

[Edit 2] Your error byte 0x93 decodes to U+201C LEFT DOUBLE QUOTATION MARK for all encodings cp1250 to cp1258 inclusive ... what language is the text written in? [/Edit 2]

(2) Save the file (before tick removal), then open the file in your browser: Does it look sensible? What do you see when you click on View / Character Encoding?

(3) Try chardet

Edit with some more advice:

Once you know what the encoding is (let's assume it's cp1252):

(1) convert your input data to unicode: uc = raw_data.decode('cp1252')

(2) process the data (remove slashes/ticks/etc) as unicode: clean_uc = manipulate(uc)

(3) you need to output your data encoded as utf8: to_mongo = clean_uc.encode('utf8')

Note 1: Your error message says "can't decode byte 0x93 in position 1258" ... 1258 bytes is a rather long chunk of text; is this reasonable? Have you had a look at the data that it is complaining about? How? what did you see?

Note 2: Please consider reading the Python Unicode HOWTO and this article

edited Jan 13, 2011 at 22:54

answered Jan 13, 2011 at 22:05

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

John Machin Over a year ago

@Joshua Burns: Thanks for accepting the answer, but future readers will like me be wondering what the outcome was ... cp1252, or something else?

John Machin Over a year ago

@Joshua Burns: Sorry, I don't understand "indeed". I didn't say that it was Latin-1. I said it was likely to be cp125X. Latin-1 is not cp125X. Your error byte 0x93 is some weird never-seen-in-the-real-world control characted when decoded as Latin-1.

Joshua Burns Over a year ago

The file was originally written in English, and was provided by an external source. I'm guessing somewhere down the line some of the data became corrupt and was never fixed. Encoding the text as Latin-1 resolved the issue for this scenario, even though it does mean a non-realistic character was represented.

Collectives™ on Stack Overflow

Importing file with unknown encoding from Python into MongoDB

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related