6

i need to load the third column of this text file as a hex string

http://www.netmite.com/android/mydroid/1.6/external/skia/emoji/gmojiraw.txt

>>> open('gmojiraw.txt').read().split('\n')[0].split('\t')[2]
'\\xF3\\xBE\\x80\\x80'

how do i open the file so that i can get the third column as hex string:

'\xF3\xBE\x80\x80'

i also tried binary mode and hex mode, with no success.

5 Answers 5

7

You can:

  1. Remove the \x-es
  2. Use .decode('hex') on the resulting string

Code:

>>> '\\xF3\\xBE\\x80\\x80'.replace('\\x', '').decode('hex')
'\xf3\xbe\x80\x80'

Note the appropriate interpretation of backslashes. When the string representation is '\xf3' it means it's a single-byte string with the byte value 0xF3. When it's '\\xf3', which is your input, it means a string consisting of 4 characters: \, x, f and 3

Sign up to request clarification or add additional context in comments.

4 Comments

wow, thanks that worked, stackoverflow is not allowing me to accept that as an answer right now!
@kevin: I'm not sure why that would be, but don't hurry. People may come up with better answers than this. You can always accept it later (i.e. in a couple of days)
it said, i have to wait atleast 10 mins before accepting answer. ok, i will wait to accept the answer! but i doubt if any other answer can better this
decode('hex') doesn't work for Python3, but if you need a Python2 answer this is a good one
7

Quick'n'dirty reply

your_string.decode('string_escape')

>>> a='\\xF3\\xBE\\x80\\x80'
>>> a.decode('string_escape')
'\xf3\xbe\x80\x80'
>>> len(_)
4

Bonus info

>>> u='\uDBB8\uDC03'
>>> u.decode('unicode_escape')

Some trivia

What's interesting, is that I have Python 2.6.4 on Karmic Koala Ubuntu (sys.maxunicode==1114111) and Python 2.6.5 on Gentoo (sys.maxunicode==65535); on Ubuntu, the unicode_escape-decode result is \uDBB8\uDC03 and on Gentoo it's u'\U000fe003', both correctly of length 2. Unless it's something fixed between 2.6.4 and 2.6.5, I'm impressed the 2-byte-per-unicode-character Gentoo version reports the correct character.

1 Comment

The \Uxxxxxxxx vs \uxxxx\uxxxx appears to be a build-time option introduced in Python 2.6. In "narrow builds" code points outside the BMP are represented as UTF-16 surrogate pairs. See tangentially issue #1477.
5

If you are using Python2.6+ here is a safe way to use eval

>>> from ast import literal_eval
>>> item='\\xF3\\xBE\\x80\\x80'
>>> literal_eval("'%s'"%item)
'\xf3\xbe\x80\x80'

1 Comment

+1: For Python 3 support, plus I like how this also works if not all of the bytes are escaped, for example it will convert 'hello\\x00world' just fine.
1

After stripping out the "\x" as Eli's answer, you can just do:

int("F3BE8080",16)

Comments

0

If you trust the source, you can use eval('"%s"' % data)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.