0

I've a string of the form "text: u'\u0644'", how to extract in python the inner unicode string? (i.e. to have u'\u0644')

When I use split() I got "u'\\u0644'" which is a simple string!

15
  • If this is JSON, use json.loads to decode it to Python data structures. Commented Aug 24, 2014 at 14:45
  • no it is not json, it is raw text Commented Aug 24, 2014 at 14:46
  • where did it come from? Likely there's an easier way to get the information out of it. Commented Aug 24, 2014 at 14:47
  • the text was crawled from facebook, wonder if this may help! Commented Aug 24, 2014 at 14:49
  • 1
    1) Facebook has an API which will be easier than screen-scraping their HTML, and 2) the text you are getting from their HTML is likely valid JSON. I strongly suggest that you back up and reconsider how you are approaching this.... Commented Aug 24, 2014 at 14:50

1 Answer 1

1

You can use ast.literal_eval() to safely convert the literal string:

>>> from ast import literal_eval

>>> s = "text: u'\u0644'"

>>> unicode_part = s.split(':')[-1].strip()
>>> unicode_part
"u'\\u0644'"

>>> unicode_string = literal_eval(unicode_part)
>>> unicode_string
u'\u0644'
>>> print unicode_string
ل
Sign up to request clarification or add additional context in comments.

2 Comments

why not just split on whitespace?
You can, and originally I did that, but the string looks like it's some sort of key value pair where the key and value are delimited by :, hence using : for the split. If you could be certain that there was always a space, then you could split on the space, or even on : if there was always one space and avoid the .strip() - but this way is probably more robust.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.