1

I'm trying to read some fixed-width data from an IBM mainframe into Pandas. The fields are stored in a mix of EBCDIC, numbers saved as binary (i.e., 255 stored as 0xFF), and binary coded decimal (i.e., 255 stored as 0x02FF.) I know the field lengths and types ahead of time.

Can read_fwf deal with this kind of data? Are there better alternatives?

Example -- I have an arbitrary number of records structured like this I'm trying to read in.

import tempfile

databin = 0xF0F3F1F5F1F3F9F9F2F50AC2BB85F0F461F2F061F2F0F1F8F2F0F1F860F0F360F2F360F1F54BF4F54BF5F44BF5F9F2F9F1F800004908

#column 1 -- ten bytes, EBCDIC.  Should be 0315139925.
#column 2 -- four bytes, binary number.  Should be 180534149.
#column 3 -- ten characters, EBCDIC.  Should be 04/20/2018.
#column 4 -- twenty six characters, EBCDIC.  Should be 2018-03-23-15.45.54.592918.
#column 5 -- five characters, packed binary coded decimal.  Should be 4908.  I know the scale ahead of time.

rawbin = databin.to_bytes((databin.bit_length() + 7) // 8, 'big') or b'\0'

with tempfile.TemporaryFile() as fp:
    fp.write(rawbin)

1 Answer 1

1

I think most likely what's going to happen is that you have to write some stuff to do them record by record, I think it is unlikely to get it to work as it is in pandas, the components can be brake down into (have to shamelessly copy-and-paste How to split a byte string into separate bytes in python for the BCD part):

def bcdDigits(chars):
    for char in chars:
        char = ord(char)
        for val in (char >> 4, char & 0xF):
            if val == 0xF:
                return
            yield val


In [40]: B
Out[40]: b'\xf0\xf3\xf1\xf5\xf1\xf3\xf9\xf9\xf2\xf5\n\xc2\xbb\x85\xf0\xf4a\xf2\xf0a\xf2\xf0\xf1\xf8\xf2\xf0\xf1\xf8`\xf0
\xf3`\xf2\xf3`\xf1\xf5K\xf4\xf5K\xf5\xf4K\xf5\xf9\xf2\xf9\xf1\xf8\x00\x00I\x08'

In [41]: import codecs

In [43]: codecs.decode(B[0:10], "cp500")
Out[43]: '0315139925'

In [44]: int.from_bytes(B[10:14], byteorder='big')
Out[44]: 180534149

In [45]: codecs.decode(B[14:24], "cp500")
Out[45]: '04/20/2018'

In [46]: codecs.decode(B[24:50], "cp500")
Out[46]: '2018-03-23-15.45.54.592918'

In [48]: list(bcdDigits([B[i: i+1] for i in range(50, 54)]))
Out[48]: [0, 0, 0, 0, 4, 9, 0, 8]

Note: For the last piece if you want to get integer in return:

In [63]: import numpy as np

In [64]: (list(bcdDigits([B[i: i+1] for i in range(50, 54)])) * (10 ** np.arange(8)[::-1])).sum()
Out[64]: 4908
Sign up to request clarification or add additional context in comments.

1 Comment

Ok, went ahead and came up with one. Binary data is tricky for code examples, but I did my best.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.