1

I am trying to figure out a way to do this, I know it should be possible. A little background first.

I want to automate the process of creating the NCBI Sequin block for submitting DNA sequences to GenBank. I always end up creating a table that lists the species name, the specimen ID value, the type of sequences, and finally the location of the the collection. It is easy enough for me to export this into a tab-delimited file. Right now I do something like this:

while ($csv) {
  foreach ($_) {
    if ($_ =! m/table|species|accession/i) {
      @csv = split('\t', $csv);
      print NEWFILE ">[species=$csv[0]] [molecule=DNA] [moltype=genomic] [country=$csv[2]] [spec-id=$csv[1]]\n";
    }
    else {
      next;
    }
  }
}

I know that is messy, and I just typed up something similar to what I have by memory (don't have script on any of my computers at home, only at work).

Now that works for me fine right now because I know which columns the information I need (species, location, and ID number) are in.

But is there a way (there must be) for me to find the columns that are for the needed info dynamically? That is, no matter the order of the columns the correct info from the correct column goes to the right place?

The first row will usually as Table X (where X is the number of the table in the publication), the next row will usually have the column headings of interest and are nearly universal in title. Nearly all tables will have standard headings to search for and I can just use | in my pattern matching.

2 Answers 2

3

First off, I would be remiss if I didn’t recommend the excellent Text::CSV_XS module; it does a much more reliable job of reading CSV files, and can even handle the column-mapping scheme that Barmar referred to above.

That said, Barmar has the right approach, though it ignores the "Table X" row being a separate row entirely. I recommend taking an explicit approach, perhaps something like this (and this is going to have a bit more detail just to make things clear; I would probably write it more tightly in production code):

# Assumes the file has been opened and that the filehandle is stored in $csv_fh.
# Get header information first.

my $hdr_data = {};

while( <$csv_fh> ) {
  if( ! $hdr_data->{'table'} && /Table (\d+)/ ) {
    $hdr_data->{'table'} = $1;
    next;
  }
  if( ! $hdr_data->{'species'} && /species/ ) {
    my $n = 0;
    # Takes the column headers as they come, creating
    # a map between the column name and column number.
    # Assumes that column names are case-insensitively
    # unique.
    my %columns = map { lc($_) => $n++ } split( /\t/ );
    # Now pick out exactly the columns we want.
    foreach my $thingy ( qw{ species accession country } ) {
      $hdr_data->{$thingy} = $columns{$thingy};
    }
    last;
  }
}

# Now process the rest of the lines.

while( <$csv_fh> ) {
  my $col = split( /\t/ );
  printf NEWFILE ">[species=%s] [molecule=DNA] [moltype=genomic] [country=%s] [spec-id=%s]\n",
    $col[$hdr_data->{'species'}],
    $col[$hdr_data->{'country'}],
    $col[$hdr_data->{'accession'}];
}

Some variation of that will get you close to what you need.

Sign up to request clarification or add additional context in comments.

4 Comments

The main reason for breaking out %columns and %{$hdr_data} is because you have a titch more flexibility. 'keys %{$hdr_data}' will always get you just the names of the columns you're interested in, for example; $hdr_data->{'bogus'} will always return undef even if there's a 'bogus' column in the data. It's almost always best to pare down your data to just what you'll need.
Text::CSV is great if you need to deal with quoting or escaping, but it's overkill if you're certain that won't be needed. Tab-delimited files don't normally use either; they just don't allow fields with tabs.
Can you point me to a website or book that has a good explanation of the mechanics of mapping? I am a biologist that knows enough perl to barely get done what I want, but I lack in depth knowledge of these things. I have o'reilly's learing perl, mastering perl, Programming Perl, beginning perl for bioinformatics, mastering perl for bioinformatics, and a couple other books.~alphaa
@AlphaA: If you send me an email (the address is on my profile), I'll respond that way; StackOverflow comments don't have a way to format appropriately, and it gags on too many at-signs. :)
1

Create a hash that maps column headings to column numbers:

my %columns;
...

if (/table|species|accession/i) {
  my @headings = split('\t');
  my $col = 0;
  foreach my $col (@headings) {
    $columns{"\L$col"} = $col++;
  }
}

Then you can use $csv[$columns{'species'}].

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.