85

I have a problem with encoding of the path variable and inserting it to the SQLite database. I tried to solve it with encode("utf-8") function which didn't help. Then I used unicode() function which gives me type unicode.

print type(path)                  # <type 'unicode'>
path = path.replace("one", "two") # <type 'str'>
path = path.encode("utf-8")       # <type 'str'> strange
path = unicode(path)              # <type 'unicode'>

Finally I gained unicode type, but I still have the same error which was present when the type of the path variable was str

sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

Could you help me solve this error and explain the correct usage of encode("utf-8") and unicode() functions? I'm often fighting with it.

This execute() statement raised the error:

cur.execute("update docs set path = :fullFilePath where path = :path", locals())

I forgot to change the encoding of fullFilePath variable which suffers with the same problem, but I'm quite confused now. Should I use only unicode() or encode("utf-8") or both?

I can't use

fullFilePath = unicode(fullFilePath.encode("utf-8"))

because it raises this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 32: ordinal not in range(128)

Python version is 2.7.2

4
  • 2
    Your exact question has already been answered: [stackoverflow.com/questions/2392732/… [1]: stackoverflow.com/questions/2392732/… Commented Apr 23, 2012 at 20:51
  • have you converted both used variables to unicode? Commented Apr 23, 2012 at 21:04
  • 2
    Learning how Python 3 handles text and data has really helped me understand everything. It is then easy to apply the knowledge to Python 2. Commented Apr 23, 2012 at 21:04
  • here is the slides of a great talk about unicode in python -- link Commented Aug 23, 2014 at 13:05

3 Answers 3

135

str is text representation in bytes, unicode is text representation in characters.

You decode text from bytes to unicode and encode a unicode into bytes with some encoding.

That is:

>>> 'abc'.decode('utf-8')  # str to unicode
u'abc'
>>> u'abc'.encode('utf-8') # unicode to str
'abc'

UPD Sep 2020: The answer was written when Python 2 was mostly used. In Python 3, str was renamed to bytes, and unicode was renamed to str.

>>> b'abc'.decode('utf-8') # bytes to str
'abc'
>>> 'abc'.encode('utf-8'). # str to bytes
b'abc'
Sign up to request clarification or add additional context in comments.

4 Comments

Very good answer, straight to the point. I'd add that unicode speaks about letters or symbols, or more generically: runes while str represents a bytes string in a certain encoding, that you must decode (obviously in the correct encoding) to get the specific runes
Python 3.8 >> 'str' object has no attribute 'decode'
do you have documentation for change unicode to str? I cant find
@cikatomo It's one one of the key changes in Python 3: docs.python.org/3.0/whatsnew/…
88

You are using encode("utf-8") incorrectly. Python byte strings (str type) have an encoding, Unicode does not. You can convert a Unicode string to a Python byte string using uni.encode(encoding), and you can convert a byte string to a Unicode string using s.decode(encoding) (or equivalently, unicode(s, encoding)).

If fullFilePath and path are currently a str type, you should figure out how they are encoded. For example, if the current encoding is utf-8, you would use:

path = path.decode('utf-8')
fullFilePath = fullFilePath.decode('utf-8')

If this doesn't fix it, the actual issue may be that you are not using a Unicode string in your execute() call, try changing it to the following:

cur.execute(u"update docs set path = :fullFilePath where path = :path", locals())

5 Comments

This statement fullFilePath = fullFilePath.decode("utf-8") still raises error UnicodeEncodeError: 'ascii' codec can't encode characters in position 32-34: ordinal not in range(128). fullFilePath is a combination of type str and string taken from text column of db table which should be utf-8 encoding.
According to this but it can be UTF-8, UTF-16BE or UTF-16LE. Can I find out it somehow?
@xralf, If you are combining different str objects you may be mixing encodings. Can you show the result of print repr(fullFilePath)?
I can show it only before the call of decode(). The problematic characters are \u0161 and \u0165.
@xralf - So it is already unicode? Try changing the execute call to unicode: cur.execute(u"update docs set path = :fullFilePath where path = :path", locals())
1

Make sure you've set your locale settings right before running the script from the shell, e.g.

$ locale -a | grep "^en_.\+UTF-8"
en_GB.UTF-8
en_US.UTF-8
$ export LC_ALL=en_GB.UTF-8
$ export LANG=en_GB.UTF-8

Docs: man locale, man setlocale.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.