10

I have this code part of a function that replace badly encoded foreign characters from a string :

s = "String from an old database with weird mixed encodings"
s = str(bytes(odbc_str.strip(), 'cp1252'))
s = s.replace('\\x82', 'é')
s = s.replace('\\x8a', 'è')
(...)
print(s)
# b"String from an old database with weird mixed encodings"

I need here a "real" string, not bytes. But whend i want to decode them, i have an exception :

s = "String from an old database with weird mixed encodings"
s = str(bytes(odbc_str.strip(), 'cp1252'))
s = s.replace('\\x82', 'é')
s = s.replace('\\x8a', 'è')
(...)
print(s.decode("utf-8"))
# AttributeError: 'str' object has no attribute 'decode'
  • Do you know why s is bytes here ?
  • Why can't i decode it to a real string ?
  • Do you know how to do it the clean way ? (today i return s[2:][:-1]. Working but very ugly, and i would like to understand this behavior)

Thanks in advance !

EDIT :

pypyodbc in python3 use all unicode by default. That confused me. On connect, you can tell him to use ANSI.

con_odbc = pypyodbc.connect("DSN=GP", False, False, 0, False)

Then, i can convert the returned stuffs into cp850, which is the initial codepage of the database.

str(odbc_str, "cp850", "replace")

No more need to manualy replace each special character. Thank you very much pepr

2
  • 1
    str.decode no longer exists in 3.x. See docs.python.org/3/howto/unicode.html for dealing with strings and bytes in 3.x Commented Sep 24, 2014 at 10:19
  • 1
    decode is for converting bytes to abstract characters that compose the string. The string in Python 3 is expected to contain only valid characters. This is the reason for not having .decode -- there are no bytes in a Python 3 string. Commented Sep 24, 2014 at 11:46

1 Answer 1

4

The printed b"String from an old database with weird mixed encodings" is not the representation of the string content. It is the value of the string content. As you did not pass the encoding argument to str()... (see the doc https://docs.python.org/3.4/library/stdtypes.html#str)

If neither encoding nor errors is given, str(object) returns object.__str__(), which is the “informal” or nicely printable string representation of object. For string objects, this is the string itself. If object does not have a __str__() method, then str() falls back to returning repr(object).

This is what happened in your case. The b" are actually two characters that are the part of the string content. You can also try:

s1 = 'String from an old database with weird mixed encodings'
print(type(s1), repr(s1))
by = bytes(s1, 'cp1252')
print(type(by), repr(by))
s2 = str(by)
print(type(s2), repr(s2))

and it prints:

<class 'str'> 'String from an old database with weird mixed encodings'
<class 'bytes'> b'String from an old database with weird mixed encodings'
<class 'str'> "b'String from an old database with weird mixed encodings'"

This is the reason why s[2:][:-1] works for you.

If you think more about it, then (in my opinion) or you want to get bytes or bytearray from the database (if possible), and to fix the bytes (see bytes.translate https://docs.python.org/3.4/library/stdtypes.html?highlight=translate#bytes.translate) or you successfully get the string (being lucky that there was no exception when constructing that string), and you want to replace the wrong characters by the correct characters (see also str.translate() https://docs.python.org/3.4/library/stdtypes.html?highlight=translate#str.translate).

Possibly, the ODBC used internally the wrong encoding. (That is the content of the database may be correct, but it was misinterpreted by the ODBC, and you are not able to tell the ODBC what is the correct encoding.) Then you want to encode the string back to bytes using that wrong encoding, and then decode the bytes using the right encoding.

Sign up to request clarification or add additional context in comments.

4 Comments

I use Visuafoxpro Driver to access a xbase .dbf table. Character set seems to be cp1252 with all special characters in ascii. I have tried a lot of different encodings, and i have the best results with these ones. Thank you very much for your help !
I do not know how you tell the VisualFoxPro driver the encoding. Was it different from cp1252?
Is there a way to convert string like "b'hello'" back to bytes format? I need to do this, as I have file containing unicode data as string representation, and to parse it, would need to convert the file text to bytes. Thank you in advance.
@skadoosh: Any file contains bytes, so it is difficult to answer your case without knowing what encoding was used when your file was written. A string representation in Python 3 means unicode string representation. This way you should open the file in text mode with explicitly given encoding. Only after reading you can get rid of unwanted characters. It would be better to open a new question with the specific sample of your content.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.