11

I want to read a file with data, coded in hex format:

01ff0aa121221aff110120...etc

the files contains >100.000 such bytes, some more than 1.000.000 (they comes form DNA sequencing)

I tried the following code (and other similar):

filele=1234563
f=open('data.geno','r')
c=[]
for i in range(filele):
  a=f.read(1)
  b=a.encode("hex")
  c.append(b)
f.close()

This gives each byte separate "aa" "01" "f1" etc, that is perfect for me!

This works fine up to (in this case) byte no 905 that happen to be "1a". I also tried the ord() function that also stopped at the same byte.

There might be a simple solution?

3
  • 1
    When you say it stopped, did you get an exception, or what? Also to be clear, this is a binary file that you want to read as a sequence of hex encoded byte values? Commented Jan 8, 2016 at 23:05
  • 2
    If you're reading a binary file it is good practice to use 'rb' as your flags to open. Commented Jan 8, 2016 at 23:06
  • I can't come up with any reason this would fail assuming you're rendering the code accurately. Every discrete byte value (and the empty string for that matter) encodes as hex just fine for me (in Py2, the hex codec was removed from str.encode in Py3). Try it by itself for every possible character: for c in map(chr, range(256)): print c.encode('hex'). They all work. My answer optimizes to do most of the work at the C layer (in exchange for slightly higher peak memory usage), but your code as given can't break in any way that makes sense. Please give the exact exception or misbehavior. Commented Jan 8, 2016 at 23:28

4 Answers 4

32

Simple solution is binascii:

import binascii

# Open in binary mode (so you don't read two byte line endings on Windows as one byte)
# and use with statement (always do this to avoid leaked file descriptors, unflushed files)
with open('data.geno', 'rb') as f:
    # Slurp the whole file and efficiently convert it to hex all at once
    hexdata = binascii.hexlify(f.read())

This just gets you a str of the hex values, but it does it much faster than what you're trying to do. If you really want a bunch of length 2 strings of the hex for each byte, you can convert the result easily:

hexlist = map(''.join, zip(hexdata[::2], hexdata[1::2]))

which will produce the list of len 2 strs corresponding to the hex encoding of each byte. To avoid temporary copies of hexdata, you can use a similar but slightly less intuitive approach that avoids slicing by using the same iterator twice with zip:

hexlist = map(''.join, zip(*[iter(hexdata)]*2))

Update:

For people on Python 3.5 and higher, bytes objects spawned a .hex() method, so no module is required to convert from raw binary data to ASCII hex. The block of code at the top can be simplified to just:

with open('data.geno', 'rb') as f:
    hexdata = f.read().hex()
Sign up to request clarification or add additional context in comments.

Comments

3

Just an additional note to these, make sure to add a break into your .read of the file or it will just keep going.

def HexView():
    with open(<yourfilehere>, 'rb') as in_file:
        while True:
            hexdata = in_file.read(16).hex()     # I like to read 16 bytes in then new line it.
            if len(hexdata) == 0:                # breaks loop once no more binary data is read
                break
            print(hexdata.upper())               # I also like it all in caps. 

Comments

2

If the file is encoded in hex format, shouldn't each byte be represented by 2 characters? So

c=[]
with open('data.geno','rb') as f:
    b = f.read(2)
    while b:
        c.append(b.decode('hex'))
        b=f.read(2)

or you can even do

with open('data.geno','rb') as f:
    c = list(f.read().decode('hex'))

for example (in python 2.7.18), this works

>>> list(b'404040'.decode('hex'))
['@', '@', '@']

This won't work in Python 3. In Python you would use the codecs module:

import codecs
with open('data.geno','rb') as f:
    c = list(map(chr, codecs.decode(f.read(), 'hex')))

or (depending on whether you are looking for them as number or as characters)

import codecs
with open('data.geno','rb') as f:
    c = list(codecs.decode(f.read(), 'hex'))

because in Python 3,

>>> import codecs
>>> codecs.decode(b'404040', 'hex')
b'@@@'
>>> list(codecs.decode(b'404040', 'hex'))
[64, 64, 64]
>>> list(map(chr, codecs.decode(b'404040', 'hex')))
['@', '@', '@']

or even ''.join(map(chr, codecs.decode(f.read(), 'hex'))) if you want a string instead of a list.

>>> ''.join(map(chr, codecs.decode(b'404040', 'hex')))
'@@@'

3 Comments

The question's grammar ambiguous, that opening sentence could also mean "I want to read the data and encode it as hex". The rest of the question states they want two character strings, which favors that interpretation. I'll admit it's rather confusing.
I andertsood the question the same way. +1
Seems that python doesnt know what a hex decoding is. "'hex' is not a text encoding; use codecs.decode() to handle arbitrary codecs". Im guessing that was a python 2 thing?
0

Thanks for all interesting answers!

The simple solution that worked immediately, was to change "r" to "rb", so:

f=open('data.geno','r')  # don't work
f=open('data.geno','rb')  # works fine

The code in this case is actually only two binary bites, so one byte contains four data, binary; 00, 01, 10, 11.

Yours!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.