0

I have a tab-delimited text file like this:

contig11 GO:100 other columns of data
contig11 GO:289 other columns of data
contig11 GO:113 other columns of data
contig22 GO:388 other columns of data
contig22 GO:101 other columns of data

And another like this:

contig11 3 N
contig11 1 Y
contig22 1 Y
contig22 2 N

I need to combine them so that each 'multiple' entry of one of the files is duplicated and populated with its data in the other, so that I get:

contig11 3 N GO:100 other columns of data
contig11 3 N GO:289 other columns of data
contig11 3 N GO:113 other columns of data
contig11 1 Y GO:100 other columns of data
contig11 1 Y GO:289 other columns of data
contig11 1 Y GO:113 other columns of data
contig22 1 Y GO:388 other columns of data
contig22 1 Y GO:101 other columns of data
contig22 2 N GO:388 other columns of data
contig22 2 N GO:101 other columns of data

I have little scripting experience, but have done this where e.g. "contig11" occurs only once in one of the files, with hashes/keys. But I can't even begin to get my head around to do this! Really appreciate some help or hints as to how to tackle this problem.

EDIT So I have tried ikegami's suggestion (see answers) with this: However, this has produced the output I needed except the GO:100 column onwards ($rest in script???) - any ideas what I'm doing wrong?

#!/usr/bin/env/perl

use warnings;

open (GOTERMS, "$ARGV[0]") or die "Error opening the input file with GO terms";
open (SNPS, "$ARGV[1]") or die "Error opening the input file with SNPs";

my %goterm;

while (<GOTERMS>)
{
    my($id, $rest) = /^(\S++)(,*)/s;
    push @{$goterm{$id}}, $rest;
}

while (my $row2 = <SNPS>)
{
    chomp($row2);
    my ($id) = $row2 =~ /^(\S+)/;
    for my $rest (@{ $goterm{$id} })
    {
        print("$row2$rest\n");
    }
}

close GOTERMS;
close SNPS;
2
  • Are the order of output lines important? Using hash for both files, the result will be out of the original order... Actually how these lines are got? Are they come from a file or are there programs generating them? Commented Apr 15, 2013 at 18:44
  • @TrueY The order of the output is not important to me. They are just files but with potentially tens or hundreds of thousands of lines in each file. Commented Apr 15, 2013 at 18:48

2 Answers 2

2

Look at your output. It's clearly produced by

  • for each row of the second file,
    • for each row of the first file with the same id,
      • print out the combined rows

So the question is: How does you find the rows of the first file with the same id as a row of the second file?

The answer is: You store the rows of the first file in a hash indexed by the row's id.

my %file1;
while (<$file1_fh>) {
   my ($id, $rest) = /^(\S++)(.*)/s;
   push @{ $file1{$id} }, $rest;
}

So the earlier pseudo code resolves to

while (my $row2 = <$file2_fh>) {
   chomp($row2);
   my ($id) = $row2 =~ /^(\S+)/;
   for my $rest (@{ $file1{$id} }) {
      print("$row2$rest");
   }
}

#!/usr/bin/env perl

use strict;   
use warnings;

open(my $GOTERMS, $ARGV[0])
     or die("Error opening GO terms file \"$ARGV[0]\": $!\n");
open(my $SNPS, $ARGV[1])
     or die("Error opening SNP file \"$ARGV[1]\": $!\n");

my %goterm;
while (<$GOTERMS>) {
    my ($id, $rest) = /^(\S++)(.*)/s;
    push @{ $goterm{$id} }, $rest;
}

while (my $row2 = <$SNPS>) {
    chomp($row2);
    my ($id) = $row2 =~ /^(\S+)/;
    for my $rest (@{ $goterm{$id} }) {
        print("$row2$rest");
    }
}
Sign up to request clarification or add additional context in comments.

3 Comments

I am trying this but not getting all the data from file1 - am I missing something? I have added my script to the question.
If you change ,* back to .* and you remove the \n you added, you get exactly the requested output.
Woops, what a stupid typo - thank you for your script and explanation - very helpful!
0

I will describe how you can do this. You need each file pu to array (each libe is an array item). Then you just need to compare these array in needed way. You need 2 loops. Main loops for each record of array/file which contains string which you you will use to campare (in your example it will be 2nd file). Under this loop you need to have another loop for each record in a array/file with records which you will compare with. And just check each record of array with the each recrod of another array and process results.

foreach my $record2 (@array2) {
    foreach my $record1 (@array1){
        if ($record2->{field} eq $record1->{field}){
            #here you need to create the string which you will show
            my $res_string = $record2->{field}.$record1->{field};
            print "$res_string\n";
        }
    }
}

Or dont use array. Just read files and compare each line with each line of another file. General idea is the same ))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.