2

I have the following 3 strings containing binary data.

s1="\t 28890\tABGT\tXYZW\t 94 23 08 92 00 00 00 EC 02 10 00 E2 00 4B\t\x00\x00\x00\x00\x01\f".force_encoding("ASCII-8BIT")
s2=" \t0000013\t123\t9886\t 95 83 49 26 0E 82 00 A6 08 02 06 C0\x00\x00\x00\x00\x02\xB2\x00\x00\x00\x00\b\xFEF".force_encoding("ASCII-8BIT")
s3=" \t0000013\t123\t9HN3\t 95 83 49 26 0E 82 00 A6 08 02 06 C0\xA1\x02\x00\x00\x02\xB2\b\xFEF".force_encoding("ASCII-8BIT")

I have the following 3 similar regex to get the bytes between *\t and something beginning with \ (i.e. \t, \x00, \xB2, \xFEF)

s1[/(?<=[A-Z]{4}\t ).+?(?=\t)/]
s2[/(?<=[0-9]{4}\t ).+?(?=\x00)/]
s3[/(?<=.+\t ).+?(?=\x..)/]

The first 2 regex work for string s1 and s2 but how could be a more general regex to match the 3 cases?

I tried the regex s3[/(?<=.+\t ).+?(?=\x..)/] but I get error below.

irb(main):> s1[/(?<=[A-Z]{4}\t ).+?(?=\t)/]
=> "94 23 08 92 00 00 00 EC 02 10 00 E2 00 4B"

irb(main):> s2[/(?<=[0-9]{4}\t ).+?(?=\x00)/]
=> "95 83 49 26 0E 82 00 A6 08 02 06 C0"

irb(main):> s3[/(?<=.+\t ).+?(?=\x..)/]
SyntaxError: (irb):4953: invalid hex escape
s3[/(?<=.+\t ).+?(?=\x..)/]
                    ^
invalid pattern in look-behind: /(?<=.+\t ).+?(?=..)/
        from /usr/bin/irb:11:in `<main>'

I think I only need to have the correct regex or is there a better way to extract the desired values without using regex?

Thanks for any help

2
  • Sth. like this: regex101.com/r/wR6yAK/1 ? Commented Jan 26, 2019 at 8:00
  • (?<=.+\t ) is invalid because Ruby's lookbehinds cannot be variable length. Commented Jan 26, 2019 at 8:57

2 Answers 2

3
R = /\h{2}(?: \h{2})+/

def extract(str)
  str[R]
end

extract(s1)
  #=> "94 23 08 92 00 00 00 EC 02 10 00 E2 00 4B" 
extract(s2)
  #=> "95 83 49 26 0E 82 00 A6 08 02 06 C0" 
extract(s3)
  #=> "95 83 49 26 0E 82 00 A6 08 02 06 C0" 

The regular expression reads, "match two hex digits (\h{2}) followed by a space followed by two hex digits, those three characters as a group matched one or more times (+), (?: \h{2}) being a non-capture group.

Sign up to request clarification or add additional context in comments.

6 Comments

Initially I had [\dA-F]{2} in my regex. I changed that to \h{2} after seeing that @Andrei had used the POSIX equivalent. I was not aware of those values.
Excellent Cary. It works in all cases I tested so far. Many thanks for the explicit explanation
I changed your solution as better since is more general. It doesn't care if before the values I want there is a \t or \x07 . It focuses on the desired values. Great!
For input string \t 28890\tABGT20190127 11:52+0500\tXYZW\t 94 23 08 92 00 00 00 EC 02 10 00 E2 00 4B\t\x00... the \h{2}(?: \h{2})+ regex will give 27 11. IMHO it's a bit an ambiguous result. Check it here rextester.com/ELDD86209
@Andrei, your string, with the embedded time, bears little resemblance to the three examples given by the OP. Without an unambiguous specification of the possible structure of the strings we must assume the examples to be indicative. It's always easy to break a regex with a manufactured string. The opposite is a possibility as well: changing the string so matches outside of the substring(s) of interest fail. If we knew there would be at least, say, 5 pairs of hex digits, I could change + on my non-capture group to {4,}, locking in the match more tightly.
|
2
#ruby 2.3.1 

xs = ["\t 28890\tABGT\tXYZW\t 94 23 08 92 00 00 00 EC 02 10 00 E2 00 4B\t\x00\x00\x00\x00\x01\f".force_encoding("ASCII-8BIT"),
      " \t0000013\t123\t9886\t 95 83 49 26 0E 82 00 A6 08 02 06 C0\x00\x00\x00\x00\x02\xB2\x00\x00\x00\x00\b\xFEF".force_encoding("ASCII-8BIT"),
      " \t0000013\t123\t9HN3\t 95 83 49 26 0E 82 00 A6 08 02 06 C0\xA1\x02\x00\x00\x02\xB2\b\xFEF".force_encoding("ASCII-8BIT"),
      "\t 28890\tABGT\tXYZW\t 94\t\x00\x00\x00\x00\x01\f".force_encoding("ASCII-8BIT"),
      " \t0000013\t123\t9HN3\t 95 83 49 26 0E 82 00 A6 08 02 06 C0".force_encoding("ASCII-8BIT")]

r = /
    (?<=                  # start of lookbehind: asserts that what immediately precedes the current position in the string are
      [[:alnum:]]{4}\t[ ] # a space character, and a tab character and then four alphanumeric characters
    )                     # end of lookbehind
    [[:xdigit:]]{2}       # match two hex digits
    (?:                   # start non-capture group
      [ ]                 # match a space character
      [[:xdigit:]]{2}     # match two hex digits
    )*                    # end the non-capture group and match it zero or more times
    /x                    # free-spacing mode

xs.map { |x| p x[r] }

Output:

"94 23 08 92 00 00 00 EC 02 10 00 E2 00 4B"
"95 83 49 26 0E 82 00 A6 08 02 06 C0"
"95 83 49 26 0E 82 00 A6 08 02 06 C0"
"94"
"95 83 49 26 0E 82 00 A6 08 02 06 C0"

4 Comments

I didn't know about [[:xdigit:]]. Good to know.
Yes, I saw. Also, \p{XDigit} or \h.
Thanks Andrei. It works very well. If you have the chance maybe could explain how the regex works.
@CarySwoveland, thanks. I updated my answer take into account your valuable criticism.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.