I want to list files that start with a certain byte sequence. My ideas are failing with identical behavior:
grep -Rl $'\A\xff\xd8' .
grep -Rl \A$'\xff\xd8' .
grep -RlP "\A\xff\xd8" .
A test file starting with ff d8 is not found, while 3 other files are found that have the byte sequence elsewhere in the file. My test file first few bytes are confirmed with hexdump -C.
00000000 ff d8 ff e0 00 10 4a 46 49 46 00 01 01 00 00 01 |......JFIF......|
I found multiple "almost" answers. I've explored hexdump, but prefer the speed of directly grepping rather than a lot of piping and looping through recursive filenames, with wrap around text exceptions. A prior question 2-1/2 years ago "File carving with Bash can't find hex values FFD8 or FFD9 with grep" is very close but LC_ALL=C doesn't change behavior. Playing with -a and -b doesn't change behavior.
What is the right way to do this? I'm using GNU grep 3.1.
/// Further study makes me think grep maybe has as problem. The code below shows that the 2-byte sequence is not found when it's not at the beginning. Then 2-byte sequence IS found when it IS at the beginnning. Also on a real jpg file, the match is found when it is at the beginning So far, so good.
dell@DELL-E6440:~$ echo $'\xffThis is a short test file I\xff\xd8 made' > junk.txt
dell@DELL-E6440:~$ hexdump -C junk.txt | head -n1
00000000 ff 54 68 69 73 20 69 73 20 61 20 73 68 6f 72 74 |.This is a short|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" junk.txt
dell@DELL-E6440:~$ echo $'\xff\xd8This is a short test file I\xff\xd8 made' > junk.txt
dell@DELL-E6440:~$ hexdump -C junk.txt | head -n1
00000000 ff d8 54 68 69 73 20 69 73 20 61 20 73 68 6f 72 |..This is a shor|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" junk.txt
junk.txt
dell@DELL-E6440:~$ hexdump -C avoid-powered.jpg | head -n1
00000000 ff d8 ff e0 00 10 4a 46 49 46 00 01 01 00 00 01 |......JFIF......|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" avoid-powered.jpg
avoid-powered.jpg
dell@DELL-E6440:~$
So, why is it matched in a larger file when it's NOT at the beginng? First show that a file without the necessary 2-byte sequence is matched. Then, keep only the beginning of the real file, and the 2-byte sequence is properly not found.
dell@DELL-E6440:~$ cp 130913-SEMSA.pdf junk.txt
dell@DELL-E6440:~$ hexdump -C junk.txt | head -n1
00000000 25 50 44 46 2d 31 2e 34 0a 31 20 30 20 6f 62 6a |%PDF-1.4.1 0 obj|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" junk.txt
junk.txt
dell@DELL-E6440:~$ dd if=130913-SEMSA.pdf bs=10 count=1 of=junk.txt
1+0 records in
1+0 records out
10 bytes copied, 0.0062894 s, 1.6 kB/s
dell@DELL-E6440:~$ hexdump -C junk.txt | head -n1
00000000 25 50 44 46 2d 31 2e 34 0a 31 |%PDF-1.4.1|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" junk.txt
dell@DELL-E6440:~$
What can possibly be in the full size file that makes a false match? grep should be looking only at the first 2-bytes of the file with \A option.
Responding to dash-o's answer...
I considered the grep v3.3 manual https://www.gnu.org/software/grep/manual/grep.html which says,
-P Interpret patterns as Perl-compatible regular expressions (PCREs)
and a perl regex guide https://www.tutorialspoint.com/perl/perl_regular_expressions.htm says,
\A Matches beginning of string.
Also, the \A idea works as it's supposed to for printable byte sequences and no documentation makes an exception for certain byte values or suggests "line oriented" should negate the idea. Looking at the file utility, it's pretty cool to ID file types, but I see no easy way to recurse directories and get a path/filename printed out, one per line if and only if it has an arbitrary leading byte sequence. Lastly, I'm sort of a bash guy .. yea.. I need to go learn perl and python more ..but I'd sure like the universal bash/grep combo to work as documented.
grep -z '^'$'\xff\xd8'?-a -bto get the byte offset for the match with the 'false positive'.LC_ALL=C grep -P -a -b "\A\xff\xd8" junk.txt. It should report the offset of the match, which might give you some clue.