0

I'm a student to learn python scrapy(crawler).

I want to convert unicode string to str in python. but this unicode string is not common string. this unicode is unicode format. please see below code.

# python 2.7
...
print(type(name[0]))
print(name[0])
print(type(keyword_name_temp))
print(keyword_name_temp)
...

I can see console like below, when run upper script.

$ <type 'unicode'>
$ 서용교 ## this words is korean characters
$ <type 'unicode'>
$ u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4'

I want see "keyword_name_temp" as korean. but I don't know how to do...

I got the name list and keyword_name_temp from html code with http request.

name list fundamentally was String format.

keyword_name_temp fundamentally was unicode format.

please anybody help me !

3 Answers 3

1

u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4' contains real backslashes (backslash being an escape character in Python string literals, python interpreter prints backslash in strings as \\) followed by u and hex sequences, not literal Unicode characters U+C9C0 etc. which are commonly written using \u escape sequence (Would that string happen to come from some JSON object perhaps?)

You can construct a JSON string out of it, and use json.loads() to transform to a unicode string:

Example in Python 2.7:

>>> s1 = u'서용교'
>>> type(s1)
<type 'unicode'>
>>> s1
u'\uc11c\uc6a9\uad50'
>>> print(s1)
서용교
>>> 
>>> 
>>> s2 = u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4'
>>> type(s2)
<type 'unicode'>
>>>
>>> # put that unicode string between double-quotes
>>> # so that json module can interpret it
>>> ts2 = u'"%s"' % s2
>>> ts2
u'"\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4"'
>>>
>>> import json
>>> json.loads(ts2)
u'\uc9c0\ubc29\uc790\uce58\ub2e8\uccb4'
>>> print(json.loads(ts2))
지방자치단체
>>> 

Another option is to make it a string literal

>>> import ast
>>>
>>> # construct a string literal, with the 'u' prefix
>>> s2_literal = u'u"%s"' % s2
>>> s2_literal
u'u"\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4"'
>>> print(ast.literal_eval(s2_literal))
지방자치단체
>>> 
>>> # also works with single-quotes string literals
>>> s2_literal2 = u"u'%s'" % s2
>>> s2_literal2
u"u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4'"
>>> 
>>> print(ast.literal_eval(s2_literal2))
지방자치단체
>>> 
Sign up to request clarification or add additional context in comments.

Comments

1

The simplest solution would be to switch to Python 3, where strings are Unicode by default.

Comments

0

You string is unicode, and if you know the encoding: utf-8 for example, you can try

print name[0].decode("utf-8")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.