character encoding in python

Question

I have a byte stream that looks like this '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'

str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'

str_data is wrote into text file using the following code

file = open("test_doc","w")
file.write(str_data)
file.close()

If test_doc is opened in a web browser and character encoding is set to Japanese it just works fine.

I am using reportlab for generating pdf . using the following code

from reportlab.pdfbase import pdfmetrics
from reportlab.pdfgen.canvas import Canvas
from reportlab.pdfbase.cidfonts import CIDFont


pdfmetrics.registerFont(CIDFont('HeiseiMin-W3','90ms-RKSJ-H'))
pdfmetrics.registerFont(CIDFont('HeiseiKakuGo-W5','90ms-RKSJ-H'))
c = Canvas('test1.pdf')
c.setFont('HeiseiMin-W3-90ms-RKSJ-H', 6)

message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'

message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88';

c.drawString(100, 675,message1)
c.save()

Here I use message1 variable which gives output in Japanese I need to use message3 instead of message1 to generate the pdf. message3 generated garabage probably because of improper encoding.

Could you rephrase the quetion... I am not sure what you are asking for — Mike Pennington
– Mike Pennington, Commented Apr 16, 2011 at 8:47

John Machin · Accepted Answer · 2011-04-16 11:44:45Z

2

Here is an answer:

message1 is encoded in shift_jis; message3 and str_data are encoded in UTF-8. All appear to represent Japanese text. See the following IDLE session:

>>> message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'
>>> print message1.decode('shift_jis')
これは平成明朝です。
>>> message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88'
>>> print message3.decode('UTF-8')
テスト
>>>str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
>>> print str_data.decode('UTF-8')
日本語
>>>

Google Translate detects the language as Japanese and translates them to the English "This is the Heisei Mincho.", "Test", and "Japanese" respectively.

What is the question?

answered Apr 16, 2011 at 11:44

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Anuj Over a year ago

oh i forgot to use print to check out the answer .. my bad .. in fact i tried decode('UTF-8') .. thanks for the help

John Machin Over a year ago

@drewk: I'm not a mind reader. Even after the OP's response, I don't have a clue what his problem was.

dawg · Accepted Answer · 2011-04-17 02:09:01Z

1

If you need to detect these encodings on the fly, you can take a look at Mark Pilgrim's excellent open source Universal Encoding Detector.

#!/usr/bin/env python

import chardet 
message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'
print chardet.detect(message1)
message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88'
print chardet.detect(message3)
str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
print chardet.detect(str_data)

Output:

{'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
{'confidence': 0.87625, 'encoding': 'utf-8'}
{'confidence': 0.87625, 'encoding': 'utf-8'}

edited Apr 17, 2011 at 2:09

answered Apr 17, 2011 at 1:40

dawg

105k24 gold badges142 silver badges217 bronze badges

Comments

Achim · Accepted Answer · 2011-04-16 08:55:25Z

0

I guess you have to learn more about encoding of strings in general. A string in python has no encoding information attached, so it's up to you to use it in the right way or convert it appropriately. Have a look at unicode strings, the encode / decode methods and the codecs module. And check whether c.drawString might also allow to pass a unicode string, which might make your live much easier.

answered Apr 16, 2011 at 8:55

Achim

15.7k15 gold badges92 silver badges161 bronze badges

Collectives™ on Stack Overflow

character encoding in python

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related