Extract Text from a Binary File (using Python 2.7 on Windows 7)

Question

I have a binary file of size about 5MB.. which has lots of interspersed text.. and control characters..

This is actually an equivalent of an outlook .pst file for SITATEX Application (from SITA).

The file contains all the TEXT MESSAGES sent and received to and from outside world...(but the text has to be extracted through the binary control characters).. all the text messages are clearly available... with line ending ^M characters... etc.

for example: assume ^@ ^X are control characters... \xaa with HEX aa, etc. loads of them around my required text extraction.

^@^@^@^@^@^@^@^@^@^@^@BLLBBCC^X^X^X^X^X^X^X^X^X
^X^X^X
MVT^M
EA1123 TEXT TEXT TEXT^M
END^M
\xaa^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
 ^@^@^@^@^@^@^@^@^@^@^@TTBBTT^X^X^X^X^X^X^X^X^X
   ^X^X^X blah blah blah... of control characters.. and then the message comes..
   MVT MESSAGE 2
   ED1123
   etc.

and so on.. for several messages.

Using Perl.. it is easy to do:

while (<>) {
  use regular expression to split messages
  m/   /


}

How would one do this in python easily..

How to read the file? binary and text interspersed
Eliminate unnecessary control characters
parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
print out the required stuff
Loop through all the lines.. and more files.

In the text file sample... I am interested in seeing.. BLLBBCC... and MVT and EA1123 and so on.

Please assist... If it is going to be very difficult in python.. I will have to think through the logic in perl itself.. as it (perl) doesn't throw lots of errors at me at least for the looping part of binary and text stuff.. and the regex.

Thanks.

Update 02Jan after reading your answers/comments

After going through S.Lott's comments and others... This is where I am at.. and it is working 80% ok.

import fileinput
import sys
import re

strfile = r'C:\Users\' \
          r'\Learn\python\mvt\sitatex_test.msgs'

f = open(strfile, 'rb')

contents = f.read() # read whole file in contents

#extract the string between two \xaaU.. multiline pattern match
#with look ahead assertion
#and this is stored in a list with all msgs
msgs = re.findall(r'\xaaU.*?(?=\xaaU)', contents, re.I|re.DOTALL|re.M)

for msg in msgs:
    #loop through msgs.. to find the first msg then next and so on.
    print "## NEW MESSAGE STARTS HERE ##"

    #for each msg split the lines.. to read line by line
    # stored as list in msglines
    msglines = msg.splitlines()
    line = 0
#then process each msgline with a message
for msgline in msglines:
    line += 1
    #msgline = re.sub(r'[\x00]+', r' ', msgline)
    mystr = msgline
    print mystr
    textstrings = re.findall(r'[\x00\x20-\x7E]+', msgline)

So far so good.. still I am not completely done.. because I need to parse the text line by line and word by word.. to pickup (as an example) the origin address and headers, subject line, message body... by parsing the message through the control characters.

Now I am stuck with... how to print line by line with the control characters converted to \x00\x02.. etc (using the \xHH format).. but leave the normal readable text alone.

For example.. say I have this: assume ^@ and ^X are some control characters line1 = '^@UG^@^@^@^@^@^@^@^@^@^@BLLBBCC^X^X^X^X^X^X^X^X^X' (on the first line).

When I print the line as it is on IDLE.. print line1.. it prints only say the first 2 or 3 characters.. and ignores the rest due to the control characters get choked.

However, when I print with this: print re.findall(r'.*', line1)

['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
 002 010 180000 DEC 11', '']

It prints nicely with all the control characters converted to \xHH format.. and ascii text intact.. (just as I want it)..with one catch.. the list has two items.. with '' in the end.

What is the explanation for the empty string in the end?
How to avoid it... I just want the line converted nicely to a string (not a list). i.e. one line of binary/text to be converted to a string with \xHH codes.. leave the ASCII TEXT alone.

Is using re.findall(r'.*', line1) is the only easy solution.. to do this conversion.. or are there any other straightforward method.. to convert a '\x00string' to \xHH and TEXT (where it is a printable character or whitespace).

Also.. any other useful comments to get the lines out nicely.

Thanks.

Update 2Jan2011 - Part 2

I have found out that re.findall(r'.+', line1) strips to

['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
    x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
     002 010 180000 DEC 11']

without the extra blank '' item in the list. This finding after numerous trial and errors.

Still I will need assistance to eliminate the list altogether but return just a string. like this:

'\xaaUG\x02\x05\x00\x04..BLLBBCC..002 010 180000 DEC 11'

Added Info on 05Jan:

@John Machin

1) \xaaU is the delimiter between messages.. In the example.. I may have just left out in the samples. Please see below for one actual message that ends with \xaaU (but left out). Following text is obtained from repr(msg between r'\xaaU.*?(?=\xaaU)')

I am trying to understand the binary format.. this is a typical message which is sent out the first 'JJJOWXH' is the sender address.. anything that follows that has 7 alphanumeric is the receiver addresses.. Based on the sender address.. I can know whether this is a 'SND' or 'RCV'.. as the source is 'JJJOWXH'... This msg is a 'SND' as we are 'JJJOWXH'.

The message is addressed to: JJJKLXH.... JJJKRXH.... and so on.

As soon as all the.. \x00000000 finishes.. the sita header and subject starts In this particular case... "\x00QN\x00HX\x00180001 \x00" this is the header.. and I am only interested all the stuff between \x00.

and the body comes next.. after the final \x00 or any other control character... In this case... it is:

COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128 BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE PACKING LEADING TO \r\n SPACE PROBLEM

once the readable text ends... the first control character that appears until the end \xaaU is to be ignored... In above cases.. "SPACE PROBLEM".. is the last one.. then control characters starts... so to be ignored... sometimes the control characters are not there till the next \xaaU.

This is one complete message.

"\xaaU\x1c\x04\x02\x00\x05\x06\x1f\x00\x19\x00\x00\x00\xc4\x9d\xedN\x1a\x00?\x02\x02\x00B\x02\x02\x00E\x02\x07\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00M\x02\xec\x00\xff\xff\x00\x00\x00\x00?\x02M\x02\xec\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00:\x03\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7f\x00JJJOWXH\x00\x05w\x01x\x01\x00\x01JJJKLXH\x00\x00\x7f\x01\x80\x01\x00\x01JJJKRXH\x00F\x87\x01\x88\x01\x00\x01JJJFFXH\x00\xff\x8f\x01\x90\x01\x00\x01JJJFCXH\x00\xff\x97\x01\x98\x01\x00\x01JJJFAXH\x00\x00\x9f\x01\xa0\x01\x00\x01JJJKPXH\x00\x00\xa7\x01\xa8\x01\x00\x01HAKUOHU\x00\x00\xaf\x01\xb0\x01\x00\x01BBBHRXH\x00\x00\xb7\x01\xb8\x01\x00\x01BBBFFHX\x00\x00\xbf\x01\xc0\x01\x00\x01BBBOMHX\x00\x00\xc7\x01\xc8\x01\x00\x01BBBFMXH\x00\x00\xcf\x01\xd0\x01\x00\x01JJJHBER\x00\x00\xd7\x01\xd8\x01\x00\x01BBBFRUO\x00\x00\xdf\x01\xe0\x01\x00\x01BBBKKHX\x00\x00\xe7\x01\xe8\x01\x00\x01JJJLOTG\x00\x01\xef\x01\xf0\x01\x00\x01JJJLCTG\x00\x00\xf7\x01\xf8\x01\x00\x01HDQOMTG\x005\xff\x01\x00\x02\x00\x01CHACSHX\x00K\x07\x02\x08\x02\x00\x01JJJKZXH\x00F\x0f\x02\x10\x02\x00\x01BBBOMUO\x00 \x17\x02\x18\x02\x00\x01BBBORXH\x00 \x1f\x02 \x02\x00\x01BBBOPXH\x00W'\x02(\x02\x00\x01CHACSHX\x00 /\x020\x02\x00\x01JJJDBXH\x0007\x028\x02\x00010000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00QN\x00HX\x00180001 \x00COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128 BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE PACKING LEADING TO \r\n SPACE PROBLEM\x00D-\xedN\x00\x04\x1a\x00t<\x93\x01x\x00M_\x00"

2) I am not using .+ anymore after the 'repr' is known.

3) each Message is multiline.. and i need to preserve all the control characters to make some sense of this proprietary format.. that is why i needed repr to see it up close.

Hope this explains... This is just 1 message out of 1000s with in the file... and some are 'SND' and some are 'RCV'... and for 'RCV' there will not be '000000'.. and occasionally there are minor exceptions to the rule... but usually that is okay.

Any further suggestions anyone.. I am still working with the file.. to retrieve the text out intact... with sender and receiver addresses.

Thank you.

Discard the perl. What have you tried so far in Python? We don't know how much you know, what you've tried or what confuses you. This is pretty trivial stuff (really). Python uses the re library package, for example, to do what Perl does with the m/.../ operator. — S.Lott
– S.Lott, Commented Dec 27, 2011 at 16:58
I have tried all.. findall.. re... sub.. replace... read line by line... character by character... i couldn't make it work... all i want is just the text in between two \xaa (Hex)... The file stops short due to the control characters including EOF and other gibberish characters which I have no clue... Please note that the file I am reading is a proprietary format... that has all the characters code.. and I am interested in just the interpersed TEXT in betwen those characters... Only after I get past the first hurdle of parsing the text... I can go through logic of picking stuff i need. — ihightower
– ihightower, Commented Dec 28, 2011 at 15:11
"I have tried all..." Please post some code so we have a common basis on which to address this. Start with re, please. Focus on just that module. re.findall after reading the entire file in a single read(). Please post that code. — S.Lott
– S.Lott, Commented Dec 28, 2011 at 18:34
I have added further comments to my question.. and code... please check and advise. Thank you. — ihightower
– ihightower, Commented Jan 2, 2012 at 6:05
It would be great if you could get information about the file format. Then you could pull all of the data out, and keep all of the associated information. — Brad Gilbert
– Brad Gilbert, Commented Jan 3, 2012 at 1:35

Taymon · Accepted Answer · 2011-12-27 17:14:16Z

2

Python supports regexes too. I don't speak Perl, so I don't know exactly what your Perl code does, but this Python program might help you:

import re
with open('yourfile.pst') as f:
    contents = f.read()
textstrings = re.findall(r'[\x20-\x7E]+', contents)

That will get you a list of all strings of one or more ASCII printable characters in the file. That may not be exactly what you want, but you might be able to tweak it from there.

Note that if you're using Python 3, then you have to worry about the distinction between binary and textual data and it becomes a bit more complicated. I'm assuming you're in Python 2.

answered Dec 27, 2011 at 17:14

Taymon

25.8k9 gold badges65 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ihightower Over a year ago

The file has all sorts of control characters.. including EOF hex (1A). When I used your code with slight changes to read the file line by line and parse contents.. the script stops prematurely because of all these control characters... like EOF is reached... with in the first 3 lines... still 10,000+ lines to go. I am using Python 2.7.... In Perl.. when I do such looping through lines.. it mysteriously without any issues can go through all the lines... and match contents... As I am learning Python.. I want to try to do it Python way...

S.Lott Over a year ago

Is this Windows? If so, please include [Windows] tag in your question. Linus has no such "EOF" character. Please try something like open('yourfile.pst','rb') which may get past the Windows foolishness.

ihightower Over a year ago

@JohnM.. Thanks for your comments on above and top (the one with EOF part). The code with C:\Users\yada yada was just added today (2jan).. In the original posting (on 27dec)... this part of the question was not there. So, it is natural that S.Lott didn't know if I was on Windows or Unix. It is my fault ( i think). that I haven't specifically stated this.

odgrim · Accepted Answer · 2012-01-03 00:38:50Z

Q: How to read the file? binary and text interspersed

A: Don't bother, just read it as normal text and you'll be able to keep your binary/text dichotomy (otherwise you won't be able to regex it as easily)

fh = open('/path/to/my/file.ext', 'r')
fh.read()

Just in case you want to read binary later for some reason, you just add a b to the second input of the open:

fh = open('/path/to/my/file.ext', 'rb')

Q: Eliminate unnecessary control characters

A: Use the python re module. Your next question sorta ask how

Q: parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')

A: re module has a findall function that works as you (mostly) expect.

import re

mytext = '\xaaUseful text that I want to keep\xaa^X^X^X\xaaOther text i like\xaa'
usefultext = re.findall('\xaa([a-zA-Z^!-~0-9 ]+)\xaa', mytext)

Q: print out the required stuff

*A: There's a print function...

print usefultext

Q: Loop through all the lines.. and more files.

fh = open('/some/file.ext','r')

for lines in fh.readlines():
    #do stuff

I'll let you figure out the os module to figure out what files exist/how to iterate through them.

Brad Gilbert · Accepted Answer · 2012-01-03 01:29:12Z

1

You say:

Still I will need assistance to eliminate the list altogether but return just a string. like this

In other words, you have foo = [some_string] and you are doing print foo which as a side does repr(some_string) but encloses it in square brackets which you don't want. So just do print repr(foo[0]).

There seem to be several things unexplained:

You say the useful text is bracketed by \xaaU but in the sample file instead of 2 occurrences of that delimiter there is only \xaa (missingU) near the start, and nothing else.
You say

I have found out that re.findall(r'.+', line1) strips to ...

That in effect is stripping out \n (but not \r!!) -- I thought line breaks would be worth preserving when attempting to recover an email message.
```
>>> re.findall(r'.+', 'abc\r\ndef\r\n\r\n')
['abc\r', 'def\r', '\r']
```
What you you done with the \r characters? Have you tested a multi-line message? Have you tested a multi-message file?
One is left to guess who or what is intended to consume your output; you write

I need to parse the text line by line and word by word

but you seem overly concerned with printing the message "legibly" with e.g. \xab instead of gibberish.
It looks like the last 6 or so lines in your latest code (for msgline in msglines: etc etc) should be indented one level.

Is it possible to clarify all of the above?

edited Jan 3, 2012 at 1:29

Brad Gilbert

34.2k11 gold badges80 silver badges131 bronze badges

answered Jan 2, 2012 at 9:57

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

2 Comments

ihightower Over a year ago

@JohnM.. Wow!!! repr(some text with binary) returns string representation.. is superb. I have no idea why I can't find on the net anyone ever recommending this while I searched for it... they only talked about some bin2ascii.. hexlify.. struct.. blah blah..blah.. all of which are useless to my needs... Thank you so much... For the rest of the questions.. let me get back to you tomorrow with explanation... Thanks again.

ihightower Over a year ago

@JohnM.. I have added further added note in my original question... starting from "Added Info on 05Jan:"

Collectives™ on Stack Overflow

Extract Text from a Binary File (using Python 2.7 on Windows 7)

3 Answers 3

3 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related