0

Here are a few examples (unicode) string:

a = u'\u03c3\u03c4\u03b7\u03bd \u03a0\u03bb\u03b1\u03c4\u03b5\u03af\u03b1 \u03c4\u03bf\u03c5'
b = u'\u010deprav so mu doma\u010di in strici duhovniki odtegovali denarno pomo\u010d . Kljub temu mu je uspelo'
c = u'sovi\xe9ticas excepto Georgia , inclusive las 3 rep\xfablicas que hab\xedan'

My end goal is to split on the backslashes (and spaces), so that it looks like this:

split_a = [u03c3, u03c4, u03b7, u03bd, ,u03a0, u03bb, u03b1, u03c4, u03b5, u03af, u03b1, ,u03c4, u03bf, u03c5]
split_b = ['', 'u010deprav', 'so', 'mu', 'doma', 'u010di', 'in', 'strici',  'duhovniki' odtegovali denarno pomo', 'u010d', '.', 'Kljub', 'temu', 'mu', 'je', 'uspelo']
split_c = ['sovi', 'xe9ticas', 'excepto', 'Georgia', ',', 'inclusive', 'las', '3',  'rep', 'xfablicas', 'que', 'hab', 'xedan']

(The empty places where there is both a space and a backslash are totally fine).

When I try to split using this:

a.split("\\"), it doesn't change the string at all.

I saw this example here, which makes me think that I need to make my strings literal strings (using r). However, I don't know how to convert my large list of strings into all literal strings.

When I searched on that, I got here. However, my compiler throws an error when I run a.encode('latin-1').decode('utf-8'). The error it throws is 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)

So, my question is: How can I take a list of unicode strings, programmatically iterate through them and make them string literals, and then split on a backslash?

2
  • Python is an interpreted language, so the Python interpreter throws the error. Commented May 10, 2016 at 16:01
  • I think you're a bit above my level here, but thanks for the info! Commented May 10, 2016 at 16:05

2 Answers 2

3

You have a Unicode string, which already has one Unicode codepoint per string element. The '\\' is just the representation of the string that is printed to the console, it's not the actual contents.

To make a list of numbers out of it is actually quite easy:

split_a = [ord(c) for c in a]

If you need to make a bunch of strings consisting of the letter u followed by the hex value, that's only slightly more complicated:

split_a = ', '.join('u' + ('%04x' % ord(c)) for c in a)
Sign up to request clarification or add additional context in comments.

3 Comments

The second one solved my problem for my example above. I've edited my question to include some more sample unicode strings, let me know if you have a solution for those other types of strings.
Was just about to push submit on a similar solution, so I'll just add a follow up comment - you'd have to do a bit more work to only display the values for characters that are unknown encodings. Specifically, in the OP's example, rendering the space character as " ", vs. "u0020".
@python_in_trouble wow, that's a completely different problem now, much more complex.
1

You can use the unicode_escape code to translate a unicode string to its escaped representation.

split_a = a.encode('unicode_escape').split('\\')

outputs:

['',
 'u03c3',
 'u03c4',
 'u03b7',
 'u03bd ',
 'u03a0',
 'u03bb',
 'u03b1',
 'u03c4',
 'u03b5',
 'u03af',
 'u03b1 ',
 'u03c4',
 'u03bf',
 'u03c5']

1 Comment

This worked for me if I then iterated through the split_a list and further split on " " (space).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.