1

I'm trying to extract the binary opcodes from listing file generated via /Fa flag in visual studio. The format look like:

00040   8b 45 bc     mov     eax, DWORD PTR _i$2535[ebp]
  00043 3b 45 c8     cmp     eax, DWORD PTR _code_section_size$[ebp]
  00046 73 19        jae     SHORT $LN1@unpacker_m

When the first number is address, then we have opcodes and then the instruction mnemonic, in such case I'd like to get an array of:

8b 45 bc 3b 45 c8 73 19

First I split the line and then run the following regular expression to get bytes:

HEX_BYTE = re.compile("\s*[\da-fA-F]{2}\s*", re.IGNORECASE)

But this regex match everything, someone have an idea how to do this in a simple way? Thanks David

4
  • You may read it line by line and use ^\d{5}\s+([\da-fA-F]{2}(?:\s+[\da-fA-F]{2})*) to extract the opcodes into group(1) and then split with space and append the results to the list. Commented Feb 2, 2016 at 9:22
  • @WiktorStribiżew: There appear to be some whitespaces at the beginning in the second/third line. Commented Feb 2, 2016 at 10:09
  • @Jan: I change the format of the question, and I am not sure if those spaces are really there. OP is keeping silent. Commented Feb 2, 2016 at 10:11
  • I don't think that the leading spaces matter - the file uses a fixed width field format anyway. Commented Feb 2, 2016 at 10:13

4 Answers 4

3

Forget regexp, it is over-complicated for extracting data from fixed fields. The statements

line = '  00043 3b 45 c8     cmp     eax,'
print(line[7:19].split())

yield

['3b', '45', 'c8']

You might need to

line = line.expandtabs()

first if there are Tab characters in the input strings.

Sign up to request clarification or add additional context in comments.

Comments

0

You could try this one: \s[\da-fA-F]{2}\s[\da-fA-F]{2}(\s[\da-fA-F]{2})?

It would return three results for your example:

" 8b 45 bc"

" 3b 45 c8"

" 73 19"

You would have to split them with space and then you have the same result as you described.

Comments

0

Looking at the file sample in the question it appears to consist of fixed width fields, so you should be able to extract those values using fixed offsets into each line:

with open('listing.txt') as listing:
    opcodes = [opcode for line in listing for opcode in line[8:16].split()]

>>> opcodes
['8b', '45', 'bc', '3b', '45', 'c8', '73', '19']

The above uses a list comprehension to pluck out the required fields which are known to exist in positions 8 through 16 using nothing but a slice operation and a split(). This ought to be a great deal faster than a regular expression, and it is a great deal more readable.

If you want the opcodes as integers:

with open('listing.txt') as listing:
    opcodes = [int(opcode, 16) for line in listing for opcode in line[8:16].split()]

>>> opcodes
[139, 69, 188, 59, 69, 200, 115, 25]

Comments

0

A Python example with the help of regular expressions:

import re
string = """00040   8b 45 bc     mov     eax, DWORD PTR _i$2535[ebp]
  00043 3b 45 c8     cmp     eax, DWORD PTR _code_section_size$[ebp]
  00046 73 19        jae     SHORT $LN1@unpacker_m"""

bytes = map(str.strip, re.findall(r'((?:\b[\da-fA-F]{2}\b\s+)+)', string))
print bytes
# ['8b 45 bc', '3b 45 c8', '73 19']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.