How to extract a pattern from string containing binary data

Question

I have this array that comes from a previous a=array.unpack("C*") command.

a = [9, 32, 50, 53, 56, 53, 57, 9, 73, 78, 70, 79, 9, 73, 78, 70, 79, 53, 9, 
     32, 55, 52, 32, 50, 51, 32, 48, 51, 32, 57, 50, 32, 48, 48, 32, 48, 48, 32, 
     48, 48, 32, 69, 67, 32, 48, 50, 32, 49, 48, 32, 48, 48, 32, 69, 50, 32, 48, 
     48, 32, 55, 55, 9, 0, 0, 0, 0, 1, 12, 1, 0, 0, 0, 57, 254, 70, 6, 1, 6, 0, 3, 
     0, 3, 198, 0, 2, 198, 31, 147, 23, 0, 226, 7, 12, 17, 18, 56, 55, 3, 101, 1, 
     1, 0, 134, 7, 145, 5, 148, 37, 150, 133, 241, 135, 5, 22, 109, 145, 53, 38, 
     171, 4, 3, 2, 6, 192, 173, 22, 160, 20, 48, 18, 6, 9, 42, 134, 58, 0, 137, 97, 
     58, 1, 0, 164, 5, 48, 3, 129, 1, 7, 225, 16, 2, 1, 1, 4, 11, 9, 1, 10, 10, 6, 
     2, 19, 105, 145, 103, 116, 226, 35, 48, 3, 194, 1, 242, 48, 3, 194, 1, 241, 48, 
     3, 194, 1, 246, 48, 3, 194, 1, 245, 48, 3, 194, 1, 244, 48, 3, 194, 1, 243, 48, 
     3, 194, 1, 247, 177, 13, 10, 1, 1, 4, 8, 10, 6, 2, 19, 105, 145, 103, 116, 0, 0, 
     42, 3, 0, 0, 48, 48, 48, 48, 48, 48, 48, 50, 9, 82, 101, 99, 101, 105, 118, 101, 
     9, 50, 51, 9, 77, 111, 110, 32, 32]

when I convert to chr it looks like this:

 irb(main):4392:0> a.map(&:chr).join
 => "\t 25859\tINFO\tINFO5\t 74 23 03 92 00 00 00 EC 02 10 00 E2 00 77\t\x00\x00\x00\x00
 \x01\f\x01\x00\x00\x009\xFEF\x06\x01\x06\x00\x03\x00\x03\xC6\x00\x02\xC6\x1F\x93\x17\x00
 \xE2\a\f\x11\x1287\x03e\x01\x01\x00\x86\a\x91\x05\x94%\x96\x85\xF1\x87\x05\x16m\x915&\xAB
 \x04\x03\x02\x06\xC0\xAD\x16\xA0\x140\x12\x06\t*\x86:\x00\x89a:\x01\x00\xA4\x050\x03\x81
 \x01\a\xE1\x10\x02\x01\x01\x04\v\t\x01\n\n\x06\x02\x13i\x91gt\xE2#0\x03\xC2\x01\xF20\x03
 \xC2\x01\xF10\x03\xC2\x01\xF60\x03\xC2\x01\xF50\x03\xC2\x01\xF40\x03\xC2\x01\xF30\x03\xC2
 \x01\xF7\xB1\r\n\x01\x01\x04\b\n\x06\x02\x13i\x91gt\x00\x00*\x03\x00\x000000..."

I would like to extract the hexadecimal values between INFO5\t and \t..., so the output would be

 "74 23 03 92 00 00 00 EC 02 10 00 E2 00 77"

I'm doing like below but only removes the first unwanted part and leaves \n\n\x06...000

How can I fix this?

irb(main)>: a.map(&:chr).join.gsub(/(\t .*\t )|(\t.*)/,"")
=> "74 23 03 92 00 00 00 EC 02 10 00 E2 00 77\n\n\x06\x02\x13i\x91gt\xE2#0
\x03\xC2\x01\xF20\x03\xC2\x01\xF10\x03\xC2\x01\xF60\x03\xC2\x01\xF50\x03\xC2
\x01\xF40\x03\xC2\x01\xF30\x03\xC2\x01\xF7\xB1\r\n\x01\x01\x04\b\n\x06\x02\
x13i\x91gt\x00\x00*\x03\x00\x0000000002"

Thanks for the help in advance.

UDPATE

Below attached sample binary file.

input.dat

I suggest editing the title to be "How to extract a pattern from string containing binary data" (no need to mention "ruby" because it's already a tag) — Kelvin
– Kelvin, Commented Jan 25, 2019 at 18:28

Cary Swoveland · Accepted Answer · 2019-01-25 00:46:26Z

2

Here are two approaches (a below is abbreviated from that given in the question).

a = [9, 32, 50, 53, 56, 53, 57, 9, 73, 78, 70, 79, 9, 73, 78, 70, 79, 53, 9, 
     32, 55, 52, 32, 50, 51, 32, 48, 51, 32, 57, 50, 32, 48, 48, 32, 48, 48,
     32, 48, 48, 32, 69, 67, 32, 48, 50, 32, 49, 48, 32, 48, 48, 32, 69, 50,
     32, 48, 48, 32, 55, 55, 9, 0, 0]

Extract from the string that had been unpacked to create a

str = a.pack("C*")
  #=> "\t 25859\tINFO\tINFO5\t 74 23 03 92 00 00 00 EC 02 10 00 E2 00 77\t\x00\x00"

str[/(?<=INFO5\t).+?(?=\t)/].strip
  #=> "74 23 03 92 00 00 00 EC 02 10 00 E2 00 77"

str is the string that had been converted to a (a = str.unpack("C*)), so it need not be computed.

(?<=INFO5\t ) and (?=\t) are respectively a positive lookbehind and a positive lookahead. They must be matched but are not part of the match that is returned. The ("non-greedy") question mark in .+? ensures that the match terminates immediately before the first tab is encountered. By contrast,

"abc\td\tef"[/(?<=a).+(?=\t)/]
  #=> "bc\td"

Extract from a and convert to a string

pfix = "INFO5\t".unpack("C*")
  #=> [73, 78, 70, 79, 53, 9]
pfix_size = pfix.size
  #=> 6 
sfix = [prefix.last]
  #=> [9]
sfix_size = sfix.size
start = idx_start(a, pfix) + pfix_size
  #=> 19
a[start..idx_start(a[start..-1], sfix) + start - 1].pack("C*").strip
  #=> "74 23 03 92 00 00 00 EC 02 10 00 E2 00 77"

def idx_start(a, arr)
  arr_size = arr.size
  a.each_index.find { |i| a[i, arr_size] == arr }
end

edited Jan 25, 2019 at 0:46

answered Jan 24, 2019 at 23:05

Cary Swoveland

111k6 gold badges69 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ger Cas Over a year ago

Hi Cary. Thanks for answer. When I try your solution I get ArgumentError: invalid byte sequence in UTF-8

Ger Cas Over a year ago

Hi Cary. Please see my update. I've attached a sample file containing the hex string. For me the errors happen only working with the binary file, but your solutions work if I apply them to a text string.

Ger Cas Over a year ago

Thanks so much your updated solutions. You give me more ideas to apply in my code.

Kelvin · Accepted Answer · 2019-01-25 18:20:55Z

1

I assume you mean a=str.unpack("C*") - you can unpack a string but not an array.

To get the result you want, you don't need to use unpack at all¹ - just perform a regex:

str.match(/INFO5\t(.*?)\t/).to_a[1]
# => " 74 23 03 92 00 00 00 EC 02 10 00 E2 00 77"

Note that there's a leading space in the result, but you can adjust the regex according to your needs; I'm not going to try to guess the specification of this format.

Tips:

The ? in .*? is needed to make the * non-greedy.
The to_a avoids raiseing an error in case the match finds nothing.

EDIT

Your comment regarding "invalid byte sequence in UTF-8" indicates that your data is probably ASCII-8BIT (i.e. it's not compatible with UTF-8), but it's stored in a string whose encoding attribute is "UTF-8". It would help if you explain how you obtained that string, because the string's encoding appears to be wrong.

Solution 1 (this is ideal):

Read in the file as ASCII-8BIT:

str = File.read("input.dat", encoding: 'ASCII-8BIT')

Solution 2 (a workaround, if you can't control the input encoding):

# NOTE: this changes the encoding on `str`
str.force_encoding("ASCII-8BIT")

After you've done this, the .match should work.

Further Explanation

The reason your map(&:chr).join works is because .chr will produce either US-ASCII or ASCII-8BIT strings (the latter happens for bytes above 127), never UTF-8.

When you join those strings, your result is in ASCII-8BIT if any byte was above 127. So this is effectively the same as calling force_encoding("ASCII-8BIT"), except that map/join doesn't modify the original string's encoding like force_encoding does.

¹unpack is unnecessary because a.map(&:chr).join is the same as arr.pack('C*') which gives you the original str. Even if you had to unpack the string for another purpose, I recommend using the original string instead of re-packing the array. Maybe you can encapsulate this into a data structure, e.g.:

i_data = InfoData.new(str)
i_data.bytes  # array of bytes
i_data.hex_string  # "74 23 03 ..."

Note that the above code won't work as-is - you need to write the InfoData class yourself.

edited Jan 25, 2019 at 18:20

answered Jan 24, 2019 at 22:58

Kelvin

21.1k3 gold badges63 silver badges73 bronze badges

6 Comments

Ger Cas Over a year ago

Hi Kelvin, Thanks for your answer. When I try str.match(/INFO5\t(.*?)\t/).to_a[1] I get ArgumentError: invalid byte sequence in UTF-8 from (irb):4463:in match' from (irb):4463:in match' from (irb):4463:in block in irb_binding' from (irb):4454:in foreach' from (irb):4454

Ger Cas Over a year ago

When I try InfoData.new(str) I get this error i_data = InfoData.new(str) NameError: uninitialized constant InfoData

Ger Cas Over a year ago

It seems that even a.map(&:chr).join = arr.pack('C*') = Original string, If I first don't do a=string.unpack("C*") your code doesn't work and I recieve invalid byte sequence in UTF-8 from

Ger Cas Over a year ago

Hi Kelvin. In my real file I read it using as line separator a byte sequence like this:

File.foreach("input.dat", sep="\x10\x12", encoding: 'ASCII-8BIT') do |line| next unless line =~ /INFO5/ p line.match(/INFO5\t ?([^\t]*)\t/)[1] end

so now with your explanation and solution 1 and 2 about ASCII-8BIT and UTF-8 I added encoding: ASCII-8BIT to the foreach like this File.foreach("input.dat", sep="\x10\x12", encoding: 'ASCII-8BIT') and now it works pretty fine. Many thanks for your help.

Ger Cas Over a year ago

One more question, to read this kind of file for which I use as separator a byte sequence is good to use File.foreach() or is there a better method?

|

mrzasa · Accepted Answer · 2019-01-24 23:55:35Z

1

I assume that you don't need the non-ascii bytes, so in first step I trim them to the first null byte using take_while
Then I convert ints to string using map(&:chr).join
Finally I match them using a regex that /INFO5\t ?([^\t]*)\t/ that assumes the interesting part is between INFO5\t and next \t

--

a=array.unpack("C*")
a.take_while{|e| e > 0}.map(&:chr).join.match(/INFO5\t ?([^\t]*)\t/)[1]
# => "74 23 03 92 00 00 00 EC 02 10 00 E2 00 77"

edited Jan 24, 2019 at 23:55

answered Jan 24, 2019 at 22:38

mrzasa

23.4k11 gold badges60 silver badges96 bronze badges

3 Comments

Ger Cas Over a year ago

Hi mrzasa, thanks for your help. It work only if I first do a=hexstring.unpack("C*") and then apply your code. If I try to apply the code to hexstring directly I get error ArgumentError: invalid byte sequence in UTF-8 This is do this hexstring.match(/INFO5\t ?([^\t]*)\t/)[1]

mrzasa Over a year ago

Please, see the edit. I used a array you had in the question

Ger Cas Over a year ago

Yes, what I mean is that hextring looks the same as a.map(&:chr).join but doesn't work if I apply directly to hexstring. I tried that in order to know if I'm doing extra steps, but it seems I need to do first a=hextring.unpack("C*") and then apply your code in order to work.

Collectives™ on Stack Overflow

How to extract a pattern from string containing binary data

3 Answers 3

3 Comments

6 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

6 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related