6

Working on importing a tab-delimited file over HTTP in Python.

Before inserting a row's data into MongoDB, I'm removing slashes, ticks and quotes from the string.

Whatever the encoding of the data is, MongoDB is throwing me the exception:

bson.errors.InvalidStringData: strings in documents must be valid UTF-8

So in an endeavour to solve this problem, from the reading I've done I want to as quickly as I can, convert the row's data to Unicode using the unicode() function. In addition, I have tried calling the decode() function passing "unicode" as the first parameter but receive the error:

LookupError: unknown encoding: unicode

From there, I can make my string manipulations such as replacing the slashes, ticks, and quotes. Then before inserting the data into MongoDB, convert it to UTF-8 using the str.encode('utf-8') function.

Problem: When converting to Unicode, I am receiving the error

UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 1258: ordinal not in range(128)

With this error, I'm not exactly sure where to continue.

My question is this: How do I successfully import the data from a file without knowing its encoding and successfully insert it into MongoDB, which requires UTF-8?

Thanks Much!

1
  • But you said "importing a tab-delimited file over HTTP" ... where does "CSV" fit into that? Commented Jan 13, 2011 at 22:18

1 Answer 1

7

Try these in order:

(0) Check that your removal of the slashes/ticks/etc is not butchering the data. What's a tick? Please show your code. Please show a sample of the raw data ... use print repr(sample_raw data) and copy/paste the output into an edit of your question.

(1) There's an old maxim: "If the encoding of a file is unknown, or stated to be ISO-8859-1, it is cp1252" ... where are you getting it from? If it's coming from Western Europe, the Americas, or any English/French/Spanish-speaking country/territory elsewhere, and it's not valid UTF-8, then it's likely to be cp1252

[Edit 2] Your error byte 0x93 decodes to U+201C LEFT DOUBLE QUOTATION MARK for all encodings cp1250 to cp1258 inclusive ... what language is the text written in? [/Edit 2]

(2) Save the file (before tick removal), then open the file in your browser: Does it look sensible? What do you see when you click on View / Character Encoding?

(3) Try chardet

Edit with some more advice:

Once you know what the encoding is (let's assume it's cp1252):

(1) convert your input data to unicode: uc = raw_data.decode('cp1252')

(2) process the data (remove slashes/ticks/etc) as unicode: clean_uc = manipulate(uc)

(3) you need to output your data encoded as utf8: to_mongo = clean_uc.encode('utf8')

Note 1: Your error message says "can't decode byte 0x93 in position 1258" ... 1258 bytes is a rather long chunk of text; is this reasonable? Have you had a look at the data that it is complaining about? How? what did you see?

Note 2: Please consider reading the Python Unicode HOWTO and this article

Sign up to request clarification or add additional context in comments.

3 Comments

@Joshua Burns: Thanks for accepting the answer, but future readers will like me be wondering what the outcome was ... cp1252, or something else?
@Joshua Burns: Sorry, I don't understand "indeed". I didn't say that it was Latin-1. I said it was likely to be cp125X. Latin-1 is not cp125X. Your error byte 0x93 is some weird never-seen-in-the-real-world control characted when decoded as Latin-1.
The file was originally written in English, and was provided by an external source. I'm guessing somewhere down the line some of the data became corrupt and was never fixed. Encoding the text as Latin-1 resolved the issue for this scenario, even though it does mean a non-realistic character was represented.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.