3

I have over forty files with the following structure:

file1 first 21 lines

8191 M0
139559 M1
79 M10
1 M10007
1 M1006
1 M10123

file2 first 21 lines

8584 M0
119837 M1
72 M10
1 M10003
1 M10045
1 M1014

file3 first 21 lines

9090 M0
137373 M1
73 M10
1 M10046
2 M101
1 M1039

where number is the number of occurrences of an M pattern, tab-separated. Now, the thing is these M patterns are in part shared across all files and are in the range of 700-800 total for each one of them.

What I wish to do is to use AWK to extract only those common to all the forty-plus files (say ~600) along with their count (the column). Ideally, the final file will have forty-plus columns + 1 indicating for the shared M patterns, with no particular order as I can then sort on the last M-pattern column. Something like this I imagine:

file1 file2 file3 M-pattern
8191 8584 9090 M0
139599 119837 137373 M1
79 72 73 M10

In theory since AWK should work sequentially I should be able to parse a header afterward that reflects the order in which files have been added using something like sed. Any help is much appreciated, thanks in advance!


So far, I have attempted the following found in some other related answers:

awk 'FNR==NR{a[$0];next} $0 in a'  one  two

but it seems to not acting on column $2 where my M-patterns are, neither I think I understood how to modify it to do so eventually...

8
  • are all files guaranteed to have at least one row of data? Commented Jun 11 at 19:28
  • Are the M values unique? What if more than one M0 is found? Commented Jun 11 at 19:30
  • @markp-fuso yes, I already made a check they range from the low 700 to the high 800. I'm updating with three more columns for file2 and 3. Commented Jun 11 at 19:30
  • @dawg they are indeed unique I already run the appropriate sort and unique -c to get those final output I wish to process. Commented Jun 11 at 19:31
  • @markp-fuso I just did it sorry, it took me a bit of time to do it manually... I hope I didn't miss anything. Commented Jun 11 at 19:54

6 Answers 6

3

Using GNU awk for arrays of arrays, ARGIND, and sorted_in:

$ cat tst.awk
{ cnts[$2][ARGIND] += $1 }
END {
    OFS = "\t"

    for ( key in cnts ) {
        if ( length(cnts[key]) == ARGIND ) {
            goodKeys[key]
        }
    }

    if ( length(goodKeys) ) {
        for ( fileNr=1; fileNr<=ARGIND; fileNr++ ) {
            printf "%s%s", ARGV[fileNr], OFS
        }
        print "M-pattern"

        PROCINFO["sorted_in"] = "@ind_str_asc"
        for ( key in goodKeys ) {
            for ( fileNr=1; fileNr<=ARGIND; fileNr++ ) {
                printf "%d%s", cnts[key][fileNr], OFS
            }
            print key
        }
    }
}

$ awk -f tst.awk file1 file2 file3
file1   file2   file3   M-pattern
8191    8584    9090    M0
139559  119837  137373  M1
79      72      73      M10

The above would work even if a given "M-pattern" could occur multiple times in an input file, it'd just summarize the counts for that M-pattern in that file, and it won't produce any output if no M-patterns exist across all files. It'll print the file names in the order provided on the command line and sort the M-patterns alphabetically in the output.

Sign up to request clarification or add additional context in comments.

1 Comment

many thanks! I had the chance to test it and works perfectly! Also, it's nice you made so it preserves the filenames as headers, so I don't have to work on it after.
2

Assumptions/understandings:

  • every file contains at least 1 data row
  • matching m-patterns have the same capitalization (ie, we do not have to worry about case-insensitive matching of m-patterns)
  • each file consists of a unique set of m-patterns (ie, an m-pattern will never show up more than once in a given file)
  • we do not need to worry about sorting the output

One awk approach:

awk '
BEGIN  { OFS = "\t" }                                              # define output field separator

FNR==1 { fcnt++                                                    # 1st row of each new file: keep track of number of files
         fnames[fcnt]=FILENAME                                     # save file name
       }

       { mcounts[$2]++                                             # for every data row: keep track of how many times we have seen this particular m-pattern
         occurrences[$2,fcnt]=$1                                   # store the occurrence data for this m-pattern + file (counter) combo
       }

END    { for (fnum=1; fnum<=fcnt; fnum++)                          # after processing all files: loop through list of files and ...
             printf "%s%s", fnames[fnum], OFS                      # print header line
         print "M-pattern"                                         # terminate printf output line

         for (mpat in mcounts)                                     # for each m-pattern ...
             if (mcounts[mpat] == fcnt) {                          # if we saw this m-pattern "fcnt" times then ...
                for (fnum=1; fnum<=fcnt; fnum++)                   # loop through our list of files and ...
                    printf "%s%s", occurrences[mpat,fnum], OFS     # print our line of occurrences
                print mpat                                         # terminate printf output line with actual m-pattern
             }
       }
' file1 file2 file3

This generates:

file1   file2   file3   M-pattern
79      72      73      M10
8191    8584    9090    M0
139559  119837  137373  M1

NOTE: actual output ordering may vary based on how awk hashes the various array indices

1 Comment

thanks for the alternative version as well as for the insights on the various bits and pieces of what everything does. Definitely, it will be helpful in future cases!
2
awk -v OFS='\t' '
    BEGIN {
        for (i=1; i<ARGC; ++i)
            h = h ARGV[i] OFS
    }
    {
        v[$2] = v[$2] $1 OFS
        ++n[$2]
    }
    END {
        print h "M-pattern"
        for (m in n)
            if (n[m]==ARGC-1)
                print v[m] m
    }
' file1 file2 file3
  • build header h
  • build value rows v[m]
  • count times m-pattern has been seen n[m]
  • at end, output header and any full rows

Assumes m-patterns are unique within each file.

1 Comment

if any file is empty, it is impossible for any pattern to be common to all files. In that situation, only the header line will print
1

With any POSIX awk:

$ cat foo.awk
BEGIN { for(i = 1; i < ARGC; i++) s0 = s0 "\t" }

function insert(s, v, p) {
  if(! s) s = s0
  match(s, "^([^" FS "]*" FS "){" p - 1 "}")
  return substr(s, 1, RLENGTH) v substr(s, RLENGTH + 1)
}

FNR==1 { header = insert(header, FILENAME, ++nf) }

{
  line[$2] = insert(line[$2], $1, nf)
  cnt[$2] += 1
}

END {
  for(m in cnt) if(cnt[m] != nf) delete line[m]
  if(length(line)) print insert(header, "M-pattern", nf + 1)
  for(m in line) print insert(line[m], m, nf + 1)
}

$ awk -f foo.awk -F'\t' file1 file2 file3
file1   file2   file3   M-pattern
79      72      73      M10
8191    8584    9090    M0
139559  119837  137373  M1

Explanations:

Let N be the number of files passed to awk (3 in your example). The set_field function takes 3 parameters:

  • A string s which is a tab-separated concatenation of N+1 fields (if s is empty set_field first initializes it with the s0 string computed in the BEGIN block, that contains only N tabs, that is, N+1 empty fields separated by N tabs).

  • The position p of an empty field, between 1 for the leftmost field and N+1 for the rightmost field.

  • A field value v.

Function set_field replaces empty field number p in s with v and returns the result. We use function set_field to prepare the header line (variable header) and the other output lines (array line indexed by the M-patterns), that we print in the END block.

Array cnt is used to print only the M-patterns found in all files and to print the header line only if at least one M-pattern was found in all files.

Comments

0

Here is a Ruby to do that:

ruby -e '
common=Hash.new{|h,k| h[k]=Set.new()}
ARGV.each{|fn|
    File.foreach(fn) { |line| common[fn] << line.split[1] }
}
# find the common elements of the list of sets with reduce(:&)
keys=common.values.reduce(:&).to_a
# you can then sort the keys array in the order you want the table
data=[]
ARGV.each { |fn|
    tmp=[]
    File.foreach(fn) { |line| 
        spl=line.split[..1]
        tmp<<spl if keys.include? spl[1]
    }
    data << [fn]+tmp.sort_by{|x,y| keys.index(y)}.map(&:first)
}
data << ["M-Pattern"]+keys
data.transpose.each{|a| puts a.join("\t")}
' file?

Given the example, prints:

file1   file2   file3   M-Pattern
8191    8584    9090    M0
139559  119837  137373  M1
79  72  73  M10

The advantage is that you are only storing common keys and you can easily sort the keys as desired.

Comments

0

Could be done with Raku/Sparrow

Say, we have file1, file2, file2 in cwd

task.bash

# dump a content of each file with it's name 
for i in $(ls -1 file*); do 
  echo "== $i";
  cat $i;
  echo "==";
done

task.check

# collect data for every file
between: { "==" \s "file" } {"=="}
  # collect file names first
  regexp: "==" \s (\S+) $$
  # collect counters for patterns 
  regexp: ^^ (\d+) \s+ ("M" \d+)
end:

code: <<HERE
!raku

  my %data;

  sub is-common($k) {
    for %data.values -> $v {
      return False if not $v{$k}:exists
    }
    return True
  }


  # convert matched data into Raku Hash
  for streams_array()<> -> @i { 
    my $f = shift @i;
    for @i -> $c {
      %data{$f}{$c[1]} += $c[0]; 
    };
  }
  # filter out only patterns common for 
  # all the files
  
  for %data.kv -> $f, %d {
    for %d.kv -> $k, $c {
        say "$f $k $c" if is-common($k);
    }
    say "==="
  }
HERE

Sample output:

$ s6 --task-run .
14:25:53 :: [sparrowtask] - run sparrow task .
14:25:53 :: [sparrowtask] - run thing .
[task run: task.bash - .]
[task stdout]
14:25:54 :: == file1
14:25:54 :: 8191 M0
14:25:54 :: 139559 M1
14:25:54 :: 79 M10
14:25:54 :: 1 M10007
14:25:54 :: 1 M1006
14:25:54 :: 1 M10123
14:25:54 :: ==
14:25:54 :: == file2
14:25:54 :: 8584 M0
14:25:54 :: 119837 M1
14:25:54 :: 72 M10
14:25:54 :: 1 M10003
14:25:54 :: 1 M10045
14:25:54 :: 1 M1014
14:25:54 :: ==
14:25:54 :: == file3
14:25:54 :: 9090 M0
14:25:54 :: 137373 M1
14:25:54 :: 73 M10
14:25:54 :: 1 M10046
14:25:54 :: 2 M101
14:25:54 :: 1 M1039
14:25:54 :: ==
[task check]
stdout match (r) <"==" \s (\S+) $$> True
stdout match (r) <^^ (\d+) \s+ ("M" \d+)> True
# file1 M0 8191
# file1 M1 139559
# file1 M10 79
# ===
# file2 M10 72
# file2 M1 119837
# file2 M0 8584
# ===
# file3 M10 73
# file3 M0 9090
# file3 M1 137373
# ===

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.