Working on importing a tab-delimited file over HTTP in Python.
Before inserting a row's data into MongoDB, I'm removing slashes, ticks and quotes from the string.
Whatever the encoding of the data is, MongoDB is throwing me the exception:
bson.errors.InvalidStringData: strings in documents must be valid UTF-8
So in an endeavour to solve this problem, from the reading I've done I want to as quickly as I can, convert the row's data to Unicode using the unicode() function. In addition, I have tried calling the decode() function passing "unicode" as the first parameter but receive the error:
LookupError: unknown encoding: unicode
From there, I can make my string manipulations such as replacing the slashes, ticks, and quotes. Then before inserting the data into MongoDB, convert it to UTF-8 using the str.encode('utf-8') function.
Problem: When converting to Unicode, I am receiving the error
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 1258: ordinal not in range(128)
With this error, I'm not exactly sure where to continue.
My question is this: How do I successfully import the data from a file without knowing its encoding and successfully insert it into MongoDB, which requires UTF-8?
Thanks Much!