2

I've a binary file. From that file I need to extract few chunk of data using python regular expression.

I need to extract non null characters-set present in-between null characters sets.

For example this is the main character set:

\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56

The regex should extract below character sets from above master set:

\xff\xfe\xfe\x00\x00\x23\x41, \x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32 and \x56\x65\x00\x35\x56

One thing is important, If it gets more than 5 null bytes continuously then only it should treat these null characters set as separator..otherwise it should include this null bytes into no-null character. As you can see in given example few null characters are also present in extracted character set.

If its not making any sense please let me know I will try to explain it in a better manner.

Thanks in Advance,

6
  • 2
    Are you sure you're going to want to use a regex for this? Commented Apr 1, 2014 at 18:21
  • Why not just split on \000{5,} ? Commented Apr 1, 2014 at 18:22
  • @msvalkon any other better / efficient option ?? Commented Apr 1, 2014 at 18:29
  • @sln here the length of separator is not fixed..separator would be \x00*n ..Where we know n >= 5... Commented Apr 1, 2014 at 18:48
  • @sln did you mean this ? arr = re.split(r'\000{5,}', data) Commented Apr 1, 2014 at 18:57

3 Answers 3

1

You could split on \x00{5,}
This is 5 or more zero's. Its the delimeter you specified.

In Perl, its something like this

Perl test case

$strLangs =  "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56";

# Remove leading zero's (5 or more)
$strLangs =~ s/^\x00{5,}//;

# Split on 5 or more 0's
@Alllangs = split /\x00{5,}/, $strLangs;

# Print each language characters
foreach $lang (@Alllangs)
{
    print "<";
    for ( split //, $lang ) {
       printf( "%x,", ord($_)); 
    }
    print ">\n";

}

Output >>

<ff,fe,fe,0,0,23,41,>
<41,49,57,0,0,0,0,32,41,49,57,0,0,0,0,32,>
<56,65,0,35,56,>
Sign up to request clarification or add additional context in comments.

1 Comment

@Raza: In your question you said "more than 5 null bytes continuously", so you probably want re.split(r'\000{6,}', data). Also, I get an extra zero-length item at the beginning with Python's re module using this pattern.
1

You can use split and lstrip with list comprehension as:

s='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56'
sp=s.split('\x00\x00\x00\x00\x00')
print [i.lstrip('\x00\\')  for i in sp if i != ""]

Output:

['\xff\xfe\xfe\x00\x00#A', 'AIW\x00\x00\x00\x002AIW\x00\x00\x00\x002', 'Ve\x005V']
  1. split entire data based on 5 nul values.
  2. in the list, find if any element is starting with nul and if it's starting remove them (this works for variable number of nul replacement at start).

Comments

1

Here's how to do it in Python. I had to str.strip() off and leading and trailing nulls to get the regex pattern to prevent the inclusion of an extra empty string at the beginning of the list of results returned from re.split().

import re

data = ('\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41'
        '\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41'
        '\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
        '\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56'
        '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

chunks = re.split(r'\000{6,}', data.strip('\x00'))

# display results
print ',\n'.join(''.join('\\x'+ch.encode('hex_codec') for ch in chunk) 
                         for chunk in chunks),

Output:

\xff\xfe\xfe\x00\x00\x23\x41,
\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32,
\x56\x65\x00\x35\x56

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.