1

I have this array that comes from a previous a=array.unpack("C*") command.

a = [9, 32, 50, 53, 56, 53, 57, 9, 73, 78, 70, 79, 9, 73, 78, 70, 79, 53, 9, 
     32, 55, 52, 32, 50, 51, 32, 48, 51, 32, 57, 50, 32, 48, 48, 32, 48, 48, 32, 
     48, 48, 32, 69, 67, 32, 48, 50, 32, 49, 48, 32, 48, 48, 32, 69, 50, 32, 48, 
     48, 32, 55, 55, 9, 0, 0, 0, 0, 1, 12, 1, 0, 0, 0, 57, 254, 70, 6, 1, 6, 0, 3, 
     0, 3, 198, 0, 2, 198, 31, 147, 23, 0, 226, 7, 12, 17, 18, 56, 55, 3, 101, 1, 
     1, 0, 134, 7, 145, 5, 148, 37, 150, 133, 241, 135, 5, 22, 109, 145, 53, 38, 
     171, 4, 3, 2, 6, 192, 173, 22, 160, 20, 48, 18, 6, 9, 42, 134, 58, 0, 137, 97, 
     58, 1, 0, 164, 5, 48, 3, 129, 1, 7, 225, 16, 2, 1, 1, 4, 11, 9, 1, 10, 10, 6, 
     2, 19, 105, 145, 103, 116, 226, 35, 48, 3, 194, 1, 242, 48, 3, 194, 1, 241, 48, 
     3, 194, 1, 246, 48, 3, 194, 1, 245, 48, 3, 194, 1, 244, 48, 3, 194, 1, 243, 48, 
     3, 194, 1, 247, 177, 13, 10, 1, 1, 4, 8, 10, 6, 2, 19, 105, 145, 103, 116, 0, 0, 
     42, 3, 0, 0, 48, 48, 48, 48, 48, 48, 48, 50, 9, 82, 101, 99, 101, 105, 118, 101, 
     9, 50, 51, 9, 77, 111, 110, 32, 32]

when I convert to chr it looks like this:

 irb(main):4392:0> a.map(&:chr).join
 => "\t 25859\tINFO\tINFO5\t 74 23 03 92 00 00 00 EC 02 10 00 E2 00 77\t\x00\x00\x00\x00
 \x01\f\x01\x00\x00\x009\xFEF\x06\x01\x06\x00\x03\x00\x03\xC6\x00\x02\xC6\x1F\x93\x17\x00
 \xE2\a\f\x11\x1287\x03e\x01\x01\x00\x86\a\x91\x05\x94%\x96\x85\xF1\x87\x05\x16m\x915&\xAB
 \x04\x03\x02\x06\xC0\xAD\x16\xA0\x140\x12\x06\t*\x86:\x00\x89a:\x01\x00\xA4\x050\x03\x81
 \x01\a\xE1\x10\x02\x01\x01\x04\v\t\x01\n\n\x06\x02\x13i\x91gt\xE2#0\x03\xC2\x01\xF20\x03
 \xC2\x01\xF10\x03\xC2\x01\xF60\x03\xC2\x01\xF50\x03\xC2\x01\xF40\x03\xC2\x01\xF30\x03\xC2
 \x01\xF7\xB1\r\n\x01\x01\x04\b\n\x06\x02\x13i\x91gt\x00\x00*\x03\x00\x000000..."

I would like to extract the hexadecimal values between INFO5\t and \t..., so the output would be

 "74 23 03 92 00 00 00 EC 02 10 00 E2 00 77"     

I'm doing like below but only removes the first unwanted part and leaves \n\n\x06...000

How can I fix this?

irb(main)>: a.map(&:chr).join.gsub(/(\t .*\t )|(\t.*)/,"")
=> "74 23 03 92 00 00 00 EC 02 10 00 E2 00 77\n\n\x06\x02\x13i\x91gt\xE2#0
\x03\xC2\x01\xF20\x03\xC2\x01\xF10\x03\xC2\x01\xF60\x03\xC2\x01\xF50\x03\xC2
\x01\xF40\x03\xC2\x01\xF30\x03\xC2\x01\xF7\xB1\r\n\x01\x01\x04\b\n\x06\x02\
x13i\x91gt\x00\x00*\x03\x00\x0000000002"

Thanks for the help in advance.

UDPATE

Below attached sample binary file.

input.dat

2
  • I suggest editing the title to be "How to extract a pattern from string containing binary data" (no need to mention "ruby" because it's already a tag) Commented Jan 25, 2019 at 18:28
  • Done! Thanks so much for the suggestion Commented Jan 26, 2019 at 5:30

3 Answers 3

2

Here are two approaches (a below is abbreviated from that given in the question).

a = [9, 32, 50, 53, 56, 53, 57, 9, 73, 78, 70, 79, 9, 73, 78, 70, 79, 53, 9, 
     32, 55, 52, 32, 50, 51, 32, 48, 51, 32, 57, 50, 32, 48, 48, 32, 48, 48,
     32, 48, 48, 32, 69, 67, 32, 48, 50, 32, 49, 48, 32, 48, 48, 32, 69, 50,
     32, 48, 48, 32, 55, 55, 9, 0, 0]

Extract from the string that had been unpacked to create a

str = a.pack("C*")
  #=> "\t 25859\tINFO\tINFO5\t 74 23 03 92 00 00 00 EC 02 10 00 E2 00 77\t\x00\x00"

str[/(?<=INFO5\t).+?(?=\t)/].strip
  #=> "74 23 03 92 00 00 00 EC 02 10 00 E2 00 77" 

str is the string that had been converted to a (a = str.unpack("C*)), so it need not be computed.

(?<=INFO5\t ) and (?=\t) are respectively a positive lookbehind and a positive lookahead. They must be matched but are not part of the match that is returned. The ("non-greedy") question mark in .+? ensures that the match terminates immediately before the first tab is encountered. By contrast,

"abc\td\tef"[/(?<=a).+(?=\t)/]
  #=> "bc\td" 

Extract from a and convert to a string

pfix = "INFO5\t".unpack("C*")
  #=> [73, 78, 70, 79, 53, 9]
pfix_size = pfix.size
  #=> 6 
sfix = [prefix.last]
  #=> [9]
sfix_size = sfix.size
start = idx_start(a, pfix) + pfix_size
  #=> 19
a[start..idx_start(a[start..-1], sfix) + start - 1].pack("C*").strip
  #=> "74 23 03 92 00 00 00 EC 02 10 00 E2 00 77"

def idx_start(a, arr)
  arr_size = arr.size
  a.each_index.find { |i| a[i, arr_size] == arr }
end
Sign up to request clarification or add additional context in comments.

3 Comments

Hi Cary. Thanks for answer. When I try your solution I get ArgumentError: invalid byte sequence in UTF-8
Hi Cary. Please see my update. I've attached a sample file containing the hex string. For me the errors happen only working with the binary file, but your solutions work if I apply them to a text string.
Thanks so much your updated solutions. You give me more ideas to apply in my code.
1

I assume you mean a=str.unpack("C*") - you can unpack a string but not an array.

To get the result you want, you don't need to use unpack at all1 - just perform a regex:

str.match(/INFO5\t(.*?)\t/).to_a[1]
# => " 74 23 03 92 00 00 00 EC 02 10 00 E2 00 77"

Note that there's a leading space in the result, but you can adjust the regex according to your needs; I'm not going to try to guess the specification of this format.

Tips:

  • The ? in .*? is needed to make the * non-greedy.
  • The to_a avoids raiseing an error in case the match finds nothing.

EDIT

Your comment regarding "invalid byte sequence in UTF-8" indicates that your data is probably ASCII-8BIT (i.e. it's not compatible with UTF-8), but it's stored in a string whose encoding attribute is "UTF-8". It would help if you explain how you obtained that string, because the string's encoding appears to be wrong.

Solution 1 (this is ideal):

Read in the file as ASCII-8BIT:

str = File.read("input.dat", encoding: 'ASCII-8BIT')

Solution 2 (a workaround, if you can't control the input encoding):

# NOTE: this changes the encoding on `str`
str.force_encoding("ASCII-8BIT")

After you've done this, the .match should work.

Further Explanation

The reason your map(&:chr).join works is because .chr will produce either US-ASCII or ASCII-8BIT strings (the latter happens for bytes above 127), never UTF-8.

When you join those strings, your result is in ASCII-8BIT if any byte was above 127. So this is effectively the same as calling force_encoding("ASCII-8BIT"), except that map/join doesn't modify the original string's encoding like force_encoding does.


1unpack is unnecessary because a.map(&:chr).join is the same as arr.pack('C*') which gives you the original str. Even if you had to unpack the string for another purpose, I recommend using the original string instead of re-packing the array. Maybe you can encapsulate this into a data structure, e.g.:

i_data = InfoData.new(str)
i_data.bytes  # array of bytes
i_data.hex_string  # "74 23 03 ..."

Note that the above code won't work as-is - you need to write the InfoData class yourself.

6 Comments

Hi Kelvin, Thanks for your answer. When I try str.match(/INFO5\t(.*?)\t/).to_a[1] I get ArgumentError: invalid byte sequence in UTF-8 from (irb):4463:in match' from (irb):4463:in match' from (irb):4463:in block in irb_binding' from (irb):4454:in foreach' from (irb):4454
When I try InfoData.new(str) I get this error i_data = InfoData.new(str) NameError: uninitialized constant InfoData
It seems that even a.map(&:chr).join = arr.pack('C*') = Original string, If I first don't do a=string.unpack("C*") your code doesn't work and I recieve invalid byte sequence in UTF-8 from
Hi Kelvin. In my real file I read it using as line separator a byte sequence like this: File.foreach("input.dat", sep="\x10\x12", encoding: 'ASCII-8BIT') do |line| next unless line =~ /INFO5/ p line.match(/INFO5\t ?([^\t]*)\t/)[1] end so now with your explanation and solution 1 and 2 about ASCII-8BIT and UTF-8 I added encoding: ASCII-8BIT to the foreach like this File.foreach("input.dat", sep="\x10\x12", encoding: 'ASCII-8BIT') and now it works pretty fine. Many thanks for your help.
One more question, to read this kind of file for which I use as separator a byte sequence is good to use File.foreach() or is there a better method?
|
1
  1. I assume that you don't need the non-ascii bytes, so in first step I trim them to the first null byte using take_while
  2. Then I convert ints to string using map(&:chr).join
  3. Finally I match them using a regex that /INFO5\t ?([^\t]*)\t/ that assumes the interesting part is between INFO5\t and next \t

--

a=array.unpack("C*")
a.take_while{|e| e > 0}.map(&:chr).join.match(/INFO5\t ?([^\t]*)\t/)[1]
# => "74 23 03 92 00 00 00 EC 02 10 00 E2 00 77"

3 Comments

Hi mrzasa, thanks for your help. It work only if I first do a=hexstring.unpack("C*") and then apply your code. If I try to apply the code to hexstring directly I get error ArgumentError: invalid byte sequence in UTF-8 This is do this hexstring.match(/INFO5\t ?([^\t]*)\t/)[1]
Please, see the edit. I used a array you had in the question
Yes, what I mean is that hextring looks the same as a.map(&:chr).join but doesn't work if I apply directly to hexstring. I tried that in order to know if I'm doing extra steps, but it seems I need to do first a=hextring.unpack("C*") and then apply your code in order to work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.