1

I have a byte stream that looks like this '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'

str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'

str_data is wrote into text file using the following code

file = open("test_doc","w")
file.write(str_data)
file.close()

If test_doc is opened in a web browser and character encoding is set to Japanese it just works fine.

I am using reportlab for generating pdf . using the following code

from reportlab.pdfbase import pdfmetrics
from reportlab.pdfgen.canvas import Canvas
from reportlab.pdfbase.cidfonts import CIDFont


pdfmetrics.registerFont(CIDFont('HeiseiMin-W3','90ms-RKSJ-H'))
pdfmetrics.registerFont(CIDFont('HeiseiKakuGo-W5','90ms-RKSJ-H'))
c = Canvas('test1.pdf')
c.setFont('HeiseiMin-W3-90ms-RKSJ-H', 6)

message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'

message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88';

c.drawString(100, 675,message1)
c.save()

Here I use message1 variable which gives output in Japanese I need to use message3 instead of message1 to generate the pdf. message3 generated garabage probably because of improper encoding.

1
  • Could you rephrase the quetion... I am not sure what you are asking for Commented Apr 16, 2011 at 8:47

3 Answers 3

2

Here is an answer:

message1 is encoded in shift_jis; message3 and str_data are encoded in UTF-8. All appear to represent Japanese text. See the following IDLE session:

>>> message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'
>>> print message1.decode('shift_jis')
これは平成明朝です。
>>> message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88'
>>> print message3.decode('UTF-8')
テスト
>>>str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
>>> print str_data.decode('UTF-8')
日本語
>>> 

Google Translate detects the language as Japanese and translates them to the English "This is the Heisei Mincho.", "Test", and "Japanese" respectively.

What is the question?

Sign up to request clarification or add additional context in comments.

2 Comments

oh i forgot to use print to check out the answer .. my bad .. in fact i tried decode('UTF-8') .. thanks for the help
@drewk: I'm not a mind reader. Even after the OP's response, I don't have a clue what his problem was.
1

If you need to detect these encodings on the fly, you can take a look at Mark Pilgrim's excellent open source Universal Encoding Detector.

#!/usr/bin/env python

import chardet 
message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'
print chardet.detect(message1)
message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88'
print chardet.detect(message3)
str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
print chardet.detect(str_data)

Output:

{'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
{'confidence': 0.87625, 'encoding': 'utf-8'}
{'confidence': 0.87625, 'encoding': 'utf-8'}

Comments

0

I guess you have to learn more about encoding of strings in general. A string in python has no encoding information attached, so it's up to you to use it in the right way or convert it appropriately. Have a look at unicode strings, the encode / decode methods and the codecs module. And check whether c.drawString might also allow to pass a unicode string, which might make your live much easier.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.