Regex to get bytes between binary content

Question

I have the following 3 strings containing binary data.

s1="\t 28890\tABGT\tXYZW\t 94 23 08 92 00 00 00 EC 02 10 00 E2 00 4B\t\x00\x00\x00\x00\x01\f".force_encoding("ASCII-8BIT")
s2=" \t0000013\t123\t9886\t 95 83 49 26 0E 82 00 A6 08 02 06 C0\x00\x00\x00\x00\x02\xB2\x00\x00\x00\x00\b\xFEF".force_encoding("ASCII-8BIT")
s3=" \t0000013\t123\t9HN3\t 95 83 49 26 0E 82 00 A6 08 02 06 C0\xA1\x02\x00\x00\x02\xB2\b\xFEF".force_encoding("ASCII-8BIT")

I have the following 3 similar regex to get the bytes between *\t and something beginning with \ (i.e. \t, \x00, \xB2, \xFEF)

s1[/(?<=[A-Z]{4}\t ).+?(?=\t)/]
s2[/(?<=[0-9]{4}\t ).+?(?=\x00)/]
s3[/(?<=.+\t ).+?(?=\x..)/]

The first 2 regex work for string s1 and s2 but how could be a more general regex to match the 3 cases?

I tried the regex s3[/(?<=.+\t ).+?(?=\x..)/] but I get error below.

irb(main):> s1[/(?<=[A-Z]{4}\t ).+?(?=\t)/]
=> "94 23 08 92 00 00 00 EC 02 10 00 E2 00 4B"

irb(main):> s2[/(?<=[0-9]{4}\t ).+?(?=\x00)/]
=> "95 83 49 26 0E 82 00 A6 08 02 06 C0"

irb(main):> s3[/(?<=.+\t ).+?(?=\x..)/]
SyntaxError: (irb):4953: invalid hex escape
s3[/(?<=.+\t ).+?(?=\x..)/]
                    ^
invalid pattern in look-behind: /(?<=.+\t ).+?(?=..)/
        from /usr/bin/irb:11:in `<main>'

I think I only need to have the correct regex or is there a better way to extract the desired values without using regex?

Thanks for any help

(?<=.+\t ) is invalid because Ruby's lookbehinds cannot be variable length. — Cary Swoveland
– Cary Swoveland, Commented Jan 26, 2019 at 8:57

Cary Swoveland · Accepted Answer · 2019-01-26 21:36:58Z

3

R = /\h{2}(?: \h{2})+/

def extract(str)
  str[R]
end

extract(s1)
  #=> "94 23 08 92 00 00 00 EC 02 10 00 E2 00 4B" 
extract(s2)
  #=> "95 83 49 26 0E 82 00 A6 08 02 06 C0" 
extract(s3)
  #=> "95 83 49 26 0E 82 00 A6 08 02 06 C0"

The regular expression reads, "match two hex digits (\h{2}) followed by a space followed by two hex digits, those three characters as a group matched one or more times (+), (?: \h{2}) being a non-capture group.

edited Jan 26, 2019 at 21:36

answered Jan 26, 2019 at 8:52

Cary Swoveland

111k6 gold badges69 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Cary Swoveland Over a year ago

Initially I had [\dA-F]{2} in my regex. I changed that to \h{2} after seeing that @Andrei had used the POSIX equivalent. I was not aware of those values.

Ger Cas Over a year ago

Excellent Cary. It works in all cases I tested so far. Many thanks for the explicit explanation

Ger Cas Over a year ago

I changed your solution as better since is more general. It doesn't care if before the values I want there is a \t or \x07 . It focuses on the desired values. Great!

Andrei Odegov Over a year ago

For input string \t 28890\tABGT20190127 11:52+0500\tXYZW\t 94 23 08 92 00 00 00 EC 02 10 00 E2 00 4B\t\x00... the \h{2}(?: \h{2})+ regex will give 27 11. IMHO it's a bit an ambiguous result. Check it here rextester.com/ELDD86209

Cary Swoveland Over a year ago

@Andrei, your string, with the embedded time, bears little resemblance to the three examples given by the OP. Without an unambiguous specification of the possible structure of the strings we must assume the examples to be indicative. It's always easy to break a regex with a manufactured string. The opposite is a possibility as well: changing the string so matches outside of the substring(s) of interest fail. If we knew there would be at least, say, 5 pairs of hex digits, I could change + on my non-capture group to {4,}, locking in the match more tightly.

|

Andrei Odegov · Accepted Answer · 2019-01-26 18:11:14Z

2

#ruby 2.3.1 

xs = ["\t 28890\tABGT\tXYZW\t 94 23 08 92 00 00 00 EC 02 10 00 E2 00 4B\t\x00\x00\x00\x00\x01\f".force_encoding("ASCII-8BIT"),
      " \t0000013\t123\t9886\t 95 83 49 26 0E 82 00 A6 08 02 06 C0\x00\x00\x00\x00\x02\xB2\x00\x00\x00\x00\b\xFEF".force_encoding("ASCII-8BIT"),
      " \t0000013\t123\t9HN3\t 95 83 49 26 0E 82 00 A6 08 02 06 C0\xA1\x02\x00\x00\x02\xB2\b\xFEF".force_encoding("ASCII-8BIT"),
      "\t 28890\tABGT\tXYZW\t 94\t\x00\x00\x00\x00\x01\f".force_encoding("ASCII-8BIT"),
      " \t0000013\t123\t9HN3\t 95 83 49 26 0E 82 00 A6 08 02 06 C0".force_encoding("ASCII-8BIT")]

r = /
    (?<=                  # start of lookbehind: asserts that what immediately precedes the current position in the string are
      [[:alnum:]]{4}\t[ ] # a space character, and a tab character and then four alphanumeric characters
    )                     # end of lookbehind
    [[:xdigit:]]{2}       # match two hex digits
    (?:                   # start non-capture group
      [ ]                 # match a space character
      [[:xdigit:]]{2}     # match two hex digits
    )*                    # end the non-capture group and match it zero or more times
    /x                    # free-spacing mode

xs.map { |x| p x[r] }

Output:

"94 23 08 92 00 00 00 EC 02 10 00 E2 00 4B"
"95 83 49 26 0E 82 00 A6 08 02 06 C0"
"95 83 49 26 0E 82 00 A6 08 02 06 C0"
"94"
"95 83 49 26 0E 82 00 A6 08 02 06 C0"

edited Jan 26, 2019 at 18:11

answered Jan 26, 2019 at 9:04

Andrei Odegov

3,5722 gold badges18 silver badges26 bronze badges

4 Comments

Cary Swoveland Over a year ago

I didn't know about [[:xdigit:]]. Good to know.

Cary Swoveland Over a year ago

Yes, I saw. Also, \p{XDigit} or \h.

Ger Cas Over a year ago

Thanks Andrei. It works very well. If you have the chance maybe could explain how the regex works.

Andrei Odegov Over a year ago

@CarySwoveland, thanks. I updated my answer take into account your valuable criticism.

Collectives™ on Stack Overflow

Regex to get bytes between binary content

2 Answers 2

6 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related