2

I've got a variable:

age_expectations = dictionary['looking_for']['age']
print type(age_expectations), age_expectations

The output is:

<type 'unicode'> 22‑35

When I'm trying to split it with the dash I've got the following problem:

res = age_expectations.split('-')
print res

And the output look like:

[u'22\u201135']

Instead of:

["22", "35"]

What is the problem? I've tried many encoding and decoding but not really sure to understand how it's work. Does the problem come from the split?

2
  • can describe more details, you are reading from file ? Commented Dec 6, 2016 at 12:57
  • Yeah I'm reading from a file, which contain data crawled on a website Commented Dec 6, 2016 at 13:04

2 Answers 2

2

Use unicode to split the unicode like,

>>> u_code = u'\u0032\u0032\u2011\u0033\u0035'
>>> print u_code
22‑35
>>> u_code.split('-')
[u'22\u201135']
>>> u_code.split(u'\u2011')
[u'22', u'35']
>>>
Sign up to request clarification or add additional context in comments.

Comments

1

As you can see from your code, the hyphen in your age_expectations variable is the unicode U+2011 character, not the standard "-" hyphen. You would have seen it from the start if you had printed the variable's representation instead:

>>> uu = u"22\u201135"
>>> print uu
22‑35
>>> print repr(uu)
u'22\u201135'
>>> 

So you need to either replace the u"\u2011" character with a simple hyphen (if you can have any of them in your data) or just simply split the string on u"\u2011" (if you're sure you'll always get this as delimiter).

1 Comment

If it's a non-breaking hyphen it might even be replaced by an empty string, but that depends on the context.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.