132

I have tab delimited files with several columns. I want to count the frequency of occurrence of the different values in a column for all the files in a folder and sort them in decreasing order of count (highest count first). How would I accomplish this in a Linux command line environment?

It can use any common command line language like awk, perl, python etc.

0

7 Answers 7

194

To see a frequency count for column two (for example):

awk -F '\t' '{print $2}' * | sort | uniq -c | sort -nr

fileA.txt

z    z    a
a    b    c
w    d    e

fileB.txt

t    r    e
z    d    a
a    g    c

fileC.txt

z    r    a
v    d    c
a    m    c

Result:

  3 d
  2 r
  1 z
  1 m
  1 g
  1 b
Sign up to request clarification or add additional context in comments.

Comments

93

Here is a way to do it in the shell:

FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr

This is the sort of thing bash is great at.

3 Comments

The "sort" of thing... ar ar ar! :)
Kinda sorta unique thingy. :P (btw. use -d, to delimit fields by comma or any other delimiter).
Quiet a versatile puppy. To count number of times print appears in python files (rather than just flagging it appears): egrep -iHn print *.py | cut -d ":" -f 1 | sort| uniq -c |sort -nr
9

The GNU site suggests this nice awk script, which prints both the words and their frequency.

Possible changes:

  • You can pipe through sort -nr (and reverse word and freq[word]) to see the result in descending order.
  • If you want a specific column, you can omit the for loop and simply write freq[3]++ - replace 3 with the column number.

Here goes:

 # wordfreq.awk --- print list of word frequencies
 
 {
     $0 = tolower($0)    # remove case distinctions
     # remove punctuation
     gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
     for (i = 1; i <= NF; i++)
         freq[$i]++
 }
 
 END {
     for (word in freq)
         printf "%s\t%d\n", word, freq[word]
 }

2 Comments

Great example script. It demonstrates so much of the capability of awk.
This script was helpful for me to determine which rows in an Excel workbook I really needed to pay attention to :) (copied Excel contents to text file, use awk, and, voila!, I can make a pattern file for grep -n).
6

Perl

This code computes the occurrences of all columns, and prints a sorted report for each of them:

# columnvalues.pl
while (<>) {
    @Fields = split /\s+/;
    for $i ( 0 .. $#Fields ) {
        $result[$i]{$Fields[$i]}++
    };
}
for $j ( 0 .. $#result ) {
    print "column $j:\n";
    @values = keys %{$result[$j]};
    @sorted = sort { $result[$j]{$b} <=> $result[$j]{$a}  ||  $a cmp $b } @values;
    for $k ( @sorted ) {
        print " $k $result[$j]{$k}\n"
    }
}

Save the text as columnvalues.pl
Run it as: perl columnvalues.pl files*

Explanation

In the top-level while loop:
* Loop over each line of the combined input files
* Split the line into the @Fields array
* For every column, increment the result array-of-hashes data structure

In the top-level for loop:
* Loop over the result array
* Print the column number
* Get the values used in that column
* Sort the values by the number of occurrences
* Secondary sort based on the value (for example b vs g vs m vs z)
* Iterate through the result hash, using the sorted list
* Print the value and number of each occurrence

Results based on the sample input files provided by @Dennis

column 0:
 a 3
 z 3
 t 1
 v 1
 w 1
column 1:
 d 3
 r 2
 b 1
 g 1
 m 1
 z 1
column 2:
 c 4
 a 3
 e 2

.csv input

If your input files are .csv, change /\s+/ to /,/

Obfuscation

In an ugly contest, Perl is particularly well equipped.
This one-liner does the same:

perl -lane 'for $i (0..$#F){$g[$i]{$F[$i]}++};END{for $j (0..$#g){print "$j:";for $k (sort{$g[$j]{$b}<=>$g[$j]{$a}||$a cmp $b} keys %{$g[$j]}){print " $k $g[$j]{$k}"}}}' files*

Comments

3

Ruby(1.9+)

#!/usr/bin/env ruby
Dir["*"].each do |file|
    h=Hash.new(0)
    open(file).each do |row|
        row.chomp.split("\t").each do |w|
            h[ w ] += 1
        end
    end
    h.sort{|a,b| b[1]<=>a[1] }.each{|x,y| print "#{x}:#{y}\n" }
end

2 Comments

This is very interesting, both because I used it and it worked, and also because I am just amazed at how ugly ruby is.. I thought perl was bad!
In ruby's defence, this could be really neatened up. For instance, using each_with_object, among other things. In short, this is somewhat grossly written.
1

Here is a tricky one approaching linear time (but probably not faster!) by avoiding sort and uniq, except for the final sort. It is based on... tee and wc instead!

$ FIELD=2
$ values="$(cut -f $FIELD *)"
$ mkdir /tmp/counts
$ cd /tmp/counts
$ echo | tee -a $values
$ wc -l * | sort -nr
9 total
3 d
2 r
1 z
1 m
1 g
1 b
$

Comments

1

Pure-Bash version:

FIELD=1
declare -A results
while read -a line; do
    results[${line[$FIELD]:-(empty)}]=$((results[${line[$FIELD]:-(empty)}]+1));
done < file.txt
echo ${results[@]@A}

The key logic is to fill an associative array which keys are the values found in the file and the array's value is the number of occurrence:

  • $FIELD is the selected column number
  • ${line[$FIELD]} is the column value from that line in the file
  • ${...:-(empty)} is a special case for empty values (what happens if there is less columns than expected?)

To have the output sorted in the expected OP format, a little more work is needed:

sort -rn < <(
    for k in "${!results[@]}"; do
        echo "${results[$k]} $k";
    done
)

Warning: it works well for tab-delimited and space-delimited files, but works bad for values with spaces in it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.