Grep - list files that start with regex binary byte sequence?

Question

I want to list files that start with a certain byte sequence. My ideas are failing with identical behavior:

grep -Rl $'\A\xff\xd8' .
grep -Rl \A$'\xff\xd8' .
grep -RlP "\A\xff\xd8" .

A test file starting with ff d8 is not found, while 3 other files are found that have the byte sequence elsewhere in the file. My test file first few bytes are confirmed with hexdump -C.

00000000  ff d8 ff e0 00 10 4a 46  49 46 00 01 01 00 00 01  |......JFIF......|

I found multiple "almost" answers. I've explored hexdump, but prefer the speed of directly grepping rather than a lot of piping and looping through recursive filenames, with wrap around text exceptions. A prior question 2-1/2 years ago "File carving with Bash can't find hex values FFD8 or FFD9 with grep" is very close but LC_ALL=C doesn't change behavior. Playing with -a and -b doesn't change behavior.

What is the right way to do this? I'm using GNU grep 3.1.

/// Further study makes me think grep maybe has as problem. The code below shows that the 2-byte sequence is not found when it's not at the beginning. Then 2-byte sequence IS found when it IS at the beginnning. Also on a real jpg file, the match is found when it is at the beginning So far, so good.

dell@DELL-E6440:~$ echo $'\xffThis is a short test file I\xff\xd8 made' > junk.txt
dell@DELL-E6440:~$ hexdump -C junk.txt | head -n1
00000000  ff 54 68 69 73 20 69 73  20 61 20 73 68 6f 72 74  |.This is a short|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" junk.txt
dell@DELL-E6440:~$ echo $'\xff\xd8This is a short test file I\xff\xd8 made' > junk.txt
dell@DELL-E6440:~$ hexdump -C junk.txt | head -n1
00000000  ff d8 54 68 69 73 20 69  73 20 61 20 73 68 6f 72  |..This is a shor|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" junk.txt
junk.txt
dell@DELL-E6440:~$ hexdump -C avoid-powered.jpg | head -n1
00000000  ff d8 ff e0 00 10 4a 46  49 46 00 01 01 00 00 01  |......JFIF......|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" avoid-powered.jpg
avoid-powered.jpg
dell@DELL-E6440:~$

So, why is it matched in a larger file when it's NOT at the beginng? First show that a file without the necessary 2-byte sequence is matched. Then, keep only the beginning of the real file, and the 2-byte sequence is properly not found.

dell@DELL-E6440:~$ cp 130913-SEMSA.pdf junk.txt
dell@DELL-E6440:~$ hexdump -C junk.txt | head -n1
00000000  25 50 44 46 2d 31 2e 34  0a 31 20 30 20 6f 62 6a  |%PDF-1.4.1 0 obj|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" junk.txt
junk.txt
dell@DELL-E6440:~$ dd if=130913-SEMSA.pdf bs=10 count=1 of=junk.txt
1+0 records in
1+0 records out
10 bytes copied, 0.0062894 s, 1.6 kB/s
dell@DELL-E6440:~$ hexdump -C junk.txt | head -n1
00000000  25 50 44 46 2d 31 2e 34  0a 31                    |%PDF-1.4.1|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" junk.txt
dell@DELL-E6440:~$

What can possibly be in the full size file that makes a false match? grep should be looking only at the first 2-bytes of the file with \A option.

Responding to dash-o's answer...

I considered the grep v3.3 manual https://www.gnu.org/software/grep/manual/grep.html which says,

-P Interpret patterns as Perl-compatible regular expressions (PCREs)

and a perl regex guide https://www.tutorialspoint.com/perl/perl_regular_expressions.htm says,

\A Matches beginning of string.

Also, the \A idea works as it's supposed to for printable byte sequences and no documentation makes an exception for certain byte values or suggests "line oriented" should negate the idea. Looking at the file utility, it's pretty cool to ID file types, but I see no easy way to recurse directories and get a path/filename printed out, one per line if and only if it has an arbitrary leading byte sequence. Lastly, I'm sort of a bash guy .. yea.. I need to go learn perl and python more ..but I'd sure like the universal bash/grep combo to work as documented.

stevesliva - I see -z to interpret nulls as EOL instead of normal LF as EOL but I'm not sure why one or the other matters. The regex carat (^) anchors to the beginning of line, but I do not want that. I want to anchor to the beginning of the entire sequence of characters (\A). — Brian
– Brian, Commented Nov 6, 2019 at 1:03
Consider using -a -b to get the byte offset for the match with the 'false positive'. LC_ALL=C grep -P -a -b "\A\xff\xd8" junk.txt. It should report the offset of the match, which might give you some clue. — dash-o
– dash-o, Commented Nov 7, 2019 at 4:55
Also, consider the for big binary files, grep will have to somehow split the files into multiple blocks, which will run against the RE. May be the false positive comes from one of those blocks. — dash-o
– dash-o, Commented Nov 7, 2019 at 4:57
dash-o, Thanks for your ideas! I looked into -a and -b, and they make some differences without addressing the core problem. I did look through the files and yes, some of them had the byte sequence later in the file, so I got the same info differently. But knowing that, uhh... same problem exists. — Brian
– Brian, Commented Nov 7, 2019 at 14:18

dash-o · Accepted Answer · 2019-11-06 13:46:45Z

1

According to grep manual, there is no support for '\A` anchoring, only for '^' and '$'

3.4 Anchoring
=============
The caret ‘^’ and the dollar sign ‘$’ are meta-characters that
respectively match the empty string at the beginning and end of a line.
They are termed “anchors”, since they force the match to be “anchored”
to beginning or end of a line, respectively.

Also, recall that grep is a line oriented search utility. It has few options to handle binary files (--binary-files=binary, text, without-match). None of them changes the 'nature' of the search - it will still look for regexp in lines

Two option to consider

If you are looking for a search on 'file types' (JPEG, PDF), consider using the file utility. It uses the 'magic' database to examine the file content, and determinte the 'file type'. It included JPEG, PDF and more types.
Use other utility (sed, perl), which allows more control over location (e.g., you can limit search to the first line of the file, etc). You will need to spend more on setting those filters. Personally, I would go with Perl, if you take this route.

answered Nov 6, 2019 at 13:46

dash-o

14.6k1 gold badge14 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Brian Over a year ago

dash-o, Thank you. In the post I showed links specifying the /A option. The file utility is cool, but is kind of reverse - I want to search recursive files for a certain byte sequence, whereas file searches many byte sequences against a single or limited set of files.

dash-o Over a year ago

I would note that file can work on arbitrary list of files, and you can customize the list of test to perform only one test. The magic search patterns are more powerful (vs. grep) for assertiong on binary data. I'll post an example as separate answer

Brian Over a year ago

dash-o, are you thinking maybe generate a list of all recurse files at run-time, and maybe create my own magic search pattern in the config file? I need to go read about file magic search patterns...

dash-o Over a year ago

Exactly as you described.

Collectives™ on Stack Overflow

Grep - list files that start with regex binary byte sequence?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related