4

I have the following string "◣⛭◣◃✺▲♢" and I want to make that string into "\u25E3\u26ED\u25E3\u25C3\u273A\u25B2\u2662". Exactly the same as this site does https://mothereff.in/js-escapes

I was wondering if this is possible in python. I have tried allot of stuff from the unicode docs for python but failed miserably.

Example of what I tried before:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

f = open('js.js', 'r').read()

print(ord(f[:1]))

help would be appreciated!

1
  • 2
    try u"◣⛭◣◃✺▲♢".encode('unicode-escape') Commented Feb 13, 2016 at 18:28

2 Answers 2

4

Considering you're using Python 3:

unicode_string="◣⛭◣◃✺▲♢"
byte_string= unicode_string.encode('ascii', 'backslashreplace')
print(byte_string)

See codecs module documentation for more infotmation.

However, to work with JavaScript notation, there's a special module json, and then you could achieve the same thing:

import json
unicode_string="◣⛭◣◃✺▲♢"
json_string=json.dumps(unicode_string)
print(json_string)
Sign up to request clarification or add additional context in comments.

1 Comment

+1 for json.dumps: use the right escaper for the job. Python unicode-escape is not the same syntax as JSON/JavaScript (it'll fail for characters outside the Basic Multilingual Plane: Python will say \U00001F4A9 where JS wants \uD83D\uDCA9)
0

If you're in python 2, then I'd suspect you're getting something like this:

>>> s = "◣⛭◣◃✺▲♢"
>>> s[0]
'\xe2'

To get to the unicode code points in a UTF-8 encoded file (or buffer), you'll need to decode it into a python unicode object first (otherwise you'll see the bytes that make up the UTF-8 encoding).

>>> s_utf8 = s.decode('utf-8')
>>> s_utf8[0]
u'\u25e3'
>>> ord(s_utf8[0])
9699
>>> hex(ord(s_utf8[0]))
'0x25e3'

In your case, you can go straight from the ord() to a literal unicode escape with something like this:

>>> "\\u\x" % (ord(s_utf8[0]))
'\\u25e3'

Or convert the entire string in one go with a list comprehension:

>>> ''.join(["\\u%04x" % (ord(c)) for c in s_utf8])
'\\u25e3\\u26ed\\u25e3\\u25c3\\u273a\\u25b2\\u2662'

Of course, when you're doing the conversion this way, you're going to display the code points for all the characters in the string. You'll have to decide which code points to show, or the ABCs will be escaped too:

>>> ''.join(["\\u%04x" % (ord(c)) for c in u"ABCD"])
'\\u0041\\u0042\\u0043\\u0044'

Or, just use georg's suggestion to let python figure all that out for you.

3 Comments

This will fail for characters outside the Basic Multilingual Plane (on wide builds, including all Python 3.3+): ord(c) can take more than four hex digits.
If the target here is JavaScript, it probably doesn't matter. JS's "\u" escapes would require surrogate pairs outside the BMP, and this method won't make them. At that point you should be using json.dumps, i.e.: json.dumps("𐌀𐌁𐌂") -> "\ud800\udf00\ud800\udf01\ud800\udf02"
I.e., what you said on your comment to @nikita's answer. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.