14

I have a bunch of files and I want to find which one contains sequential lines starting with a certain string.

For example for the following file :

Aaaaaaaaaaaa
Baaaaaaaaaaa
Cxxxxxxxxx
Cyyyyyyyyy
Czzzzzzzzz
Abbbbbbbbbbb
Bbbbbbbbbbbb
Caaaaaa
Accccccccccc
Bccccccccccc
Cdddddd
Ceeeeee

There is more than one line starting with 'C', so I want this file to be found by command.
For example for the following file :

Aaaaaaaaaaaa
Baaaaaaaaaaa
Cxxxxxxxxx
Abbbbbbbbbbb
Bbbbbbbbbbbb
Caaaaaa
Accccccccccc
Bccccccccccc
Cdddddd

There is always one line starting with 'C', I don't want this file. I thought of using a grep or a sed but I don't know exactly how to do it. Maybe using a regexp ^C.*$^C or something like that. Any idea ?

11
  • There're two lines starting with C in your second example. Commented Mar 25, 2014 at 13:40
  • 5
    This question is unclear. Are you looking for files that have more than one consecutive line starting with C? Commented Mar 25, 2014 at 13:40
  • Yes this is what I want. Sorry for the misunderstanding. Commented Mar 25, 2014 at 14:03
  • 2
    @terdon, it looks like multi-line searches with -P worked until 2.5.4 and not anymore after that, though I can't find anything in the changelog that would explain why. Commented Mar 25, 2014 at 16:33
  • 1
    @Graeme you might want to undelete your answer, see Stephane's comment, apparently it does work for some older grep versions. Commented Mar 25, 2014 at 16:37

6 Answers 6

7

With pcregrep:

pcregrep -rMl '^C.*\nC' .

POSIXly:

find . -type f -exec awk '
  FNR==1 {last=0; printed=0; next}
  printed {next}
  /^C/ {if (last) {print FILENAME; printed=1; nextfile} else last=1; next}
  {last=0}' {} +

(though that means reading all the files fully with those awk implementations that don't support nextfile).


With versions of GNU grep up to 2.5.4:

grep -rlP '^C.*\nC' .

appears to work, but it's by accident and it is not guaranteed to work.

Before it was fixed in 2.6 (by this commit), GNU grep had overlooked that the pcre searching function it was using would match on the whole buffer currently processed by grep, causing all sorts of surprising behavior. For instance:

grep -P 'a\s*b'

would match on a file containing:

bla
bla

This would match:

printf '1\n2\n' | grep -P '1\n2'

But this:

(printf '1\n'; sleep 1; printf '2\n') | grep -P '1\n2'

Or:

(yes | head -c 32766; printf '1\n2\n') > file; grep -P '1\n2' file

would not (as the 1\n2\n is across two buffers processed by grep).

That behaviour ended up being documented though:

15- How can I match across lines?

Standard grep cannot do this, as it is fundamentally line-based. Therefore, merely using the '[:space:]' character class does not match newlines in the way you might expect. However, if your grep is compiled with Perl patterns enabled, the Perl 's' modifier (which makes '.' match newlines) can be used:

     printf 'foo\nbar\n' | grep -P '(?s)foo.*?bar'

After it was fixed in 2.6, the documentation was not amended (I once reported it there).

4
  • Is there any reason not to use exit and -exec \; instead of nextfile? Commented Mar 25, 2014 at 14:48
  • @terdon, that would mean running one awk per file. You'd want to do that only if your awk doesn't support nextfile and you've got a large proportion of files that are large and have matching lines towards the beginning of the file. Commented Mar 25, 2014 at 14:51
  • How about this grep technique ( I guess with more recent versions of GNU grep) that facilitates multiline matches by make the whole file look like a single string by setting line terminator to NUL - would you be aware if there are any limitations to it? Commented Mar 27, 2014 at 12:28
  • 1
    @1_CR, That would load the whole file in memory if there's no NUL character in there and that assumes lines don't contain NUL characters. Also note that older versions of GNU grep (which the OP has) can't use -z with -P. There's no \N without -P, you'd need to write it $'[\01-\011\013-\0377]' which would only work in C locales (see thread.gmane.org/gmane.comp.gnu.grep.bugs/5187) Commented Mar 27, 2014 at 13:52
3

With awk:

awk '{if (p ~ /^C/ && $1 ~ /^C/) print; p=$1}' afile.txt

This will print contents of the file if there are consecutive lines starting with a C. The expression (p ~ /^C/ && $1 ~ /^C/) will look into successive lines in the file and will evaluate to true if the first character in both match C. If that is the case, the line will be printed.

In order to find all the files that have such a pattern, you can run the above awk through a find command:

find /your/path -type f -exec awk '{if (p ~ /^C/ && $1 ~ /^C/) {print FILENAME; exit;} p=$1}' {} \;

In this command, the find + exec will go through each of the files and perform similar awk filtering on each file and print its name via FILENAME if the awk expression evaluated to true. In order to avoid printing FILENAME multiple times for a single file with multiple matches the exit statement is used (thanks @terdon).

3
  • My question was not clear enough, I want to know the name of the files with more than one consecutive line starting with C Commented Mar 25, 2014 at 14:09
  • @Jérémie I updated my answer. Commented Mar 25, 2014 at 14:31
  • Could you please add an explanation of how this works? Also, there's no need for flag, just exit instead. That way, you don't need to keep processing files after a match has been found. Commented Mar 25, 2014 at 14:41
2

Yet another option with GNU sed:

For a single file:

sed -n -- '/^C/{n;/^C/q 1}' "$file" || printf '%s\n' "$file"

(though it will also report the files it cannot read).

For find:

find . -type f ! -exec sed -n '/^C/{n;/^C/q 1}' {} \; -print

The problem with unreadable files being printed can be avoided by writing it:

find . -type f -size +2c -exec sed -n '$q1;/^C/{n;/^C/q}' {} \; -print
3
  • Can you please detail the sed -n '$q1;/^C/{n;/^C/q}' ? Commented Mar 26, 2014 at 15:13
  • Anyone to explain me ? Commented Mar 28, 2014 at 7:47
  • @Jérémie $q1 -- forces sed to quit with an error if pattern is not found. It also will finish with error if something is wrong with file (it's unreadable or broken). So it will quit with 0 exit status only in case pattern is found and it will be passed to print. Part with /^C/{n;/^C/q is pretty simple. If it find string that starts with C it will read the next line and if it also starts with C it will quit with zero exit status. Commented Mar 28, 2014 at 10:28
1

Assuming your files are small enough to be read into memory:

perl -000ne 'print "$ARGV\n" if /^C[^\n]*\nC/sm' *

Explanation:

  • -000 : set \n\n as the record separator, this turns on paragraph mode which will treat paragraphs (separated by consecutive newlines) as single lines.
  • -ne : apply the script given as an argument to -e to each line of the input file(s).
  • $ARGV : is the file currently being processed
  • /^C[^\n]*\nC/ : match C at the beginning of a line (see the description of the sm modifiers below for why this works here) followed by 0 or more non-newline characters, a newline and then another C. In other words, find consecutive lines starting with C. *//sm : these match modifiers are (as documented [here]):

    • m : Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of line only at the left and right ends of the string to matching them anywhere within the string.

    • s: Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.

You could also do something ugly like:

for f in *; do perl -pe 's/\n/%%/' "$f" | grep -q 'C[^%]*%%C' && echo "$f"; done

Here, the perl code replaces newlines with %% so, assuming you have no %% in your input file (big if of course), the grep will match consecutive lines starting with C.

1

SOLUTION:

( set -- *files ; for f ; do (
set -- $(printf %c\  `cat <$f`)
while [ $# -ge 1 ] ;do [ -z "${1#"$2"}" ] && {
    echo "$f"; break ; } || shift
done ) ; done )

DEMO:

First, we'll create a test base:

abc="a b c d e f g h i j k l m n o p q r s t u v w x y z" 
for l in $abc ; do { i=$((i+1)) h= c= ;
    [ $((i%3)) -eq 0 ] && c="$l" h="${abc%"$l"*}"
    line="$(printf '%s ' $h $c ${abc#"$h"})"
    printf "%s$(printf %s $line)\n" $line >|/tmp/file${i}
} ; done

The above creates 26 files in /tmp named file1-26. In each file there are 27 or 28 lines beginning with the letters a-z and followed by the rest of the alphabet. Every 3rd file contains two consecutive lines in which the first character is duplicated.

SAMPLE:

cat /tmp/file12
...
aabcdefghijkllmnopqrstuvwxyz
babcdefghijkllmnopqrstuvwxyz
cabcdefghijkllmnopqrstuvwxyz
...
kabcdefghijkllmnopqrstuvwxyz
labcdefghijkllmnopqrstuvwxyz
labcdefghijkllmnopqrstuvwxyz
mabcdefghijkllmnopqrstuvwxyz
...

And when I change:

set -- *files

to:

set -- /tmp/file[0-9]*

I get...

OUTPUT:

/tmp/file12
/tmp/file15
/tmp/file18
/tmp/file21
/tmp/file24
/tmp/file3
/tmp/file6
/tmp/file9

So, in brief, the solution works like this:

sets subshell positionals to all of your files, and for each

sets a nested subshell's positionals to the first letter of each line in each file as it loops.

[ tests ] if $1 negates $2 indicating a match, and if so

echoes the filename then breaks the current loop iteration

else shifts to the next single character positional to try again

0

This script uses grep and cut to obtain line numbers of matching lines, and checks for any two consecutive numbers. The file is assumed a valid filename passed as the first argument to the script:

#!/bin/bash

checkfile () {
 echo checking $1
 grep -n -E "^C.*$" $1 | cut -d: -f1 | while read linenum
     do
        : $[ ++PRV ] 
        if [ $linenum == $PRV ]; then return 1; fi
        PRV=$linenum
     done
     return 0
}

PRV="-1"
checkfile $1
if [ $? == 0 ]; then
   echo Consecutive matching lines found in file $1
fi

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.