1

I'am currently working on a script and I have a problem formatting the output. The index and input Files look like:

index
Pseudopropionibacterium propionicum
Kibdelosporangium phytohabitans
Steroidobacter denitrificans

File 1
Pseudopropionibacterium propionicum 1591.0
Kibdelosporangium phytohabitans 907.0
Olsenella sp. oral taxon 807    7323.0  oral bacterium
Steroidobacter denitrificans    6673.0  sludge bacterium

File 2
Pseudopropionibacterium propionicum 123.0
Caulobacteraceae bacterium OTSz_A_272   1019.0
Saccharopolyspora erythraea 939.0   soil bacterium
Rhodopseudomonas palustris  900.0   
Nitrospira moscoviensis 856.0   soil/water bacterium

File 3
Pseudopropionibacterium propionicum 1591.0
Kibdelosporangium phytohabitans 907.0
Verrucosispora maris    391.0   deep-sea actinomycete
Tannerella forsythia    389.0   periodontal pathogen
Actinoplanes missouriensis  376.0   soil bacterium

what the script does is looking with the help of the index for a match in File 2 and prints out field one and two of File 2. However this is done for more than one File 2 (the all look the same) and I wanted to create a new column for the output of each new File 2.

My Code until now:

#!/bin/bash

for file in ./*_TOP1000
do
basename $file >> output
awk 'BEGIN{FS="\t"}NR==FNR{a[$1]=$0;next}$1 in a{print $1,$2}' index $file >> output
done

And the output looks like:

File 1
Pseudopropionibacterium propionicum 1591.0
Kibdelosporangium phytohabitans 907.0
Steroidobacter denitrificans    6673.0
File 2
Pseudopropionibacterium propionicum 4326.0
File 3
Kibdelosporangium phytohabitans 1591.0
Pseudopropionibacterium propionicum 907.0

But it would like to have in in that way:

File 1                                       File 2                                        File 3
Pseudopropionibacterium propionicum 1591.0   Pseudopropionibacterium propionicum 4326.0    Pseudopropionibacterium propionicum 907.0
Kibdelosporangium phytohabitans 907.0                                                      Kibdelosporangium phytohabitans 1591.0
Steroidobacter denitrificans    6673.0

with the matching results directly under them. All the files could have different matches.
I tried solving it with the column command sneaking in separator but it was not working. So how can I archive the desired output?

5
  • Can you give more lines in file2? containing same lines? Pseudopropionibacterium propionicum 1591.0 does not appear in file2 Commented Apr 11, 2017 at 11:08
  • Where are Sample1/2/3 coming from? Commented Apr 11, 2017 at 11:18
  • you could save each sample result in separate files like sample1.txt sample2.txt etc and then use paste command to join them vertically Commented Apr 11, 2017 at 11:31
  • If I were you I would parse the longest line of every column with something like awk '{print $1} | wc -Lto get the longest element of each column, and use this longest size to display your elements with printf %s, let’s say your first column is 20-chars long at most, use printf %-20s. Commented Apr 11, 2017 at 11:50
  • I made some edits and hope it is clearer now. Sorry for the confusion with the sample and the file. Commented Apr 11, 2017 at 12:14

3 Answers 3

2
$ cat tst.awk
BEGIN { OFS="\t" }
NR==FNR { indices[$1]; next }
FNR==1 { filenames[++numCols] = FILENAME }
$1 in indices {
    vals[numCols,++rowCnt[numCols]] = $1 FS $2 FS $3
    numRows = (rowCnt[numCols] > numRows ? rowCnt[numCols] : numRows)
}
END {
    for (colNr=1; colNr<=numCols; colNr++) {
        printf "%s%s", filenames[colNr], (colNr<numCols ? OFS : ORS)
    }
    for (rowNr=1; rowNr<=numRows; rowNr++) {
        for (colNr=1; colNr<=numCols; colNr++) {
            printf "%s%s", vals[colNr,rowNr], (colNr<numCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk index file1 file2 file3 | column -s$'\t' -t
file1                                       file2                                      file3
Pseudopropionibacterium propionicum 1591.0  Pseudopropionibacterium propionicum 123.0  Pseudopropionibacterium propionicum 1591.0
Kibdelosporangium phytohabitans 907.0                                                  Kibdelosporangium phytohabitans 907.0
Steroidobacter denitrificans 6673.0

The pipe to column is just to show the output in aligned columns rather than tab-separated.

Sign up to request clarification or add additional context in comments.

Comments

1

It might be easier to rearrange the table with Perl than with Awk.

If you feed column with the data in the right order, it will format the columns correctly. Use the option -t and specify the column delimiter with -s.

#! /usr/bin/perl
use strict;
use warnings;

my $table;      # declares variables.
my $col = -1;
my $row = 0;

while (<DATA>)   # loop through the input line by line
{
  chomp;                              # remove end of line
  if (/^File/) { $col++; $row = 0; }  # increment col and init row if line starts with File
  $table->[$row++]->[$col] = $_;      # set value in two dimensional array and increment row
}

open (my $out, '|-', "column -s ^ -t");   # open pipe to columns
foreach (@$table)                         # loop over the rows of the table
{
  print $out join('^', map { $_ or ' ' } @$_), "\n";  # join the elements of a row with the delimiter ^ and replace undefined values with a space
}
close $out;

__DATA__
File 1
Pseudopropionibacterium propionicum 1591.0
Kibdelosporangium phytohabitans 907.0
File 2
Pseudopropionibacterium propionicum 4326.0
File 3
Kibdelosporangium phytohabitans 2019.0
Pseudopropionibacterium propionicum 1542.0

Prints the columns in this way:

File 1                                      File 2                                      File 3
Pseudopropionibacterium propionicum 1591.0  Pseudopropionibacterium propionicum 4326.0  Kibdelosporangium phytohabitans 2019.0
Kibdelosporangium phytohabitans 907.0                                                   Pseudopropionibacterium propionicum 1542.0

If you want to read standard input instead of Perl's data segment, change <DATA> to <*>.

10 Comments

Thank you for your answer but Perl is not my field. So I'am not fully capable of understanding what is going on.
@ceving you clearly have some significant misunderstanding about what awk is. There's no reason to favor perl over awk for simple text manipulation like this.
@JFS31 I have added some comments. It is not that complicated.
@EdMorton I have about 25 years experience with Awk and my observation is, that solving problems with Perl instead of Awk is always faster. The Awk code might be a bit shorter but the Perl code is easier to write. This question is a very good example, how people (including myself) struggle with Awk, although the problem is not that complicated and can be easily solved in a more general language like Perl.
Ah, so it's an add-on AFTER the shell loop calling awk has run - got it, I didn't understand that before, thanks for clarifying. I'm not looking for any flame war either, I'm just trying to understand where you're coming from with your recent negative statements about awk and this seemed like a good opportunity for me to learn. Obviously my personal experience is that I can solve problems in Awk in a few minutes, while solutions in Perl can easily take hours but I recognize that is just my personal experience. The problem is undoubtedly with me, not with the tool.
|
1

Something like this, in GNU awk since the third argument of match:

$ awk '
NR==FNR { a[$0]; next }               # read and hash index file to a
FNR==1 { print FILENAME }             # print filename at start of data files
{
    match($0,/^([^0-9]+)([0-9.]+)/,b) # get the name part and first value
    gsub(/^ +| +$/,"",b[1])           # trim name
    if(b[1] in a)                     # print indexed
        print b[1],b[2]
}' index file1 file1
file1
Pseudopropionibacterium propionicum 4326.0
Kibdelosporangium phytohabitans 3819.0
file1
Pseudopropionibacterium propionicum 4326.0
Kibdelosporangium phytohabitans 3819.0

The field version would be for GNU awk due to 2D arrays:

$ cat program.awk
NR==FNR { a[$0]; next }                 # read and hash index file to a
FNR==1 { c[++i][j=1]=FILENAME }         # print filename at start of data files
{
    match($0,/^([^0-9]+)([0-9.]+)/,b)   # get the name part and first value
    gsub(/^ +| +$/,"",b[1])             # trim name
    if(b[1] in a) {                     # print indexed
        c[i][++j]=b[1] OFS b[2]
        if(m<j||m=="") m=j              # max col count
        if(l[i]<=length(b[1] OFS b[2])||l[i]=="")
            l[i]=length(b[1] OFS b[2])  # this is for printf width
    }
}
END {
    for(k=1;k<=m;k++)
        for(j=1;j<=i;j++)
            printf "%-" l[k] "s %s", c[j][k], (j==i?ORS:OFS)
}

Test it:

$ awk -f index file1 file2 file3
file1                                       file2                                       file3
Pseudopropionibacterium propionicum 1591.0  Pseudopropionibacterium propionicum 123.0  Pseudopropionibacterium propionicum 1591.0
Kibdelosporangium phytohabitans 907.0                                                   Kibdelosporangium phytohabitans 907.0
Steroidobacter denitrificans 6673.0

1 Comment

@EdMorton Yeah, it seems to have lived some.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.