Parsing CSV files, finding columns and remembering them

Question

I am trying to figure out a way to do this, I know it should be possible. A little background first.

I want to automate the process of creating the NCBI Sequin block for submitting DNA sequences to GenBank. I always end up creating a table that lists the species name, the specimen ID value, the type of sequences, and finally the location of the the collection. It is easy enough for me to export this into a tab-delimited file. Right now I do something like this:

while ($csv) {
  foreach ($_) {
    if ($_ =! m/table|species|accession/i) {
      @csv = split('\t', $csv);
      print NEWFILE ">[species=$csv[0]] [molecule=DNA] [moltype=genomic] [country=$csv[2]] [spec-id=$csv[1]]\n";
    }
    else {
      next;
    }
  }
}

I know that is messy, and I just typed up something similar to what I have by memory (don't have script on any of my computers at home, only at work).

Now that works for me fine right now because I know which columns the information I need (species, location, and ID number) are in.

But is there a way (there must be) for me to find the columns that are for the needed info dynamically? That is, no matter the order of the columns the correct info from the correct column goes to the right place?

The first row will usually as Table X (where X is the number of the table in the publication), the next row will usually have the column headings of interest and are nearly universal in title. Nearly all tables will have standard headings to search for and I can just use | in my pattern matching.

mcglk · Accepted Answer · 2013-04-30 03:31:24Z

3

First off, I would be remiss if I didn’t recommend the excellent Text::CSV_XS module; it does a much more reliable job of reading CSV files, and can even handle the column-mapping scheme that Barmar referred to above.

That said, Barmar has the right approach, though it ignores the "Table X" row being a separate row entirely. I recommend taking an explicit approach, perhaps something like this (and this is going to have a bit more detail just to make things clear; I would probably write it more tightly in production code):

# Assumes the file has been opened and that the filehandle is stored in $csv_fh.
# Get header information first.

my $hdr_data = {};

while( <$csv_fh> ) {
  if( ! $hdr_data->{'table'} && /Table (\d+)/ ) {
    $hdr_data->{'table'} = $1;
    next;
  }
  if( ! $hdr_data->{'species'} && /species/ ) {
    my $n = 0;
    # Takes the column headers as they come, creating
    # a map between the column name and column number.
    # Assumes that column names are case-insensitively
    # unique.
    my %columns = map { lc($_) => $n++ } split( /\t/ );
    # Now pick out exactly the columns we want.
    foreach my $thingy ( qw{ species accession country } ) {
      $hdr_data->{$thingy} = $columns{$thingy};
    }
    last;
  }
}

# Now process the rest of the lines.

while( <$csv_fh> ) {
  my $col = split( /\t/ );
  printf NEWFILE ">[species=%s] [molecule=DNA] [moltype=genomic] [country=%s] [spec-id=%s]\n",
    $col[$hdr_data->{'species'}],
    $col[$hdr_data->{'country'}],
    $col[$hdr_data->{'accession'}];
}

Some variation of that will get you close to what you need.

answered Apr 30, 2013 at 3:31

mcglk

3841 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

mcglk Over a year ago

The main reason for breaking out %columns and %{$hdr_data} is because you have a titch more flexibility. 'keys %{$hdr_data}' will always get you just the names of the columns you're interested in, for example; $hdr_data->{'bogus'} will always return undef even if there's a 'bogus' column in the data. It's almost always best to pare down your data to just what you'll need.

cjm Over a year ago

Text::CSV is great if you need to deal with quoting or escaping, but it's overkill if you're certain that won't be needed. Tab-delimited files don't normally use either; they just don't allow fields with tabs.

AlphaA Over a year ago

Can you point me to a website or book that has a good explanation of the mechanics of mapping? I am a biologist that knows enough perl to barely get done what I want, but I lack in depth knowledge of these things. I have o'reilly's learing perl, mastering perl, Programming Perl, beginning perl for bioinformatics, mastering perl for bioinformatics, and a couple other books.~alphaa

mcglk Over a year ago

@AlphaA: If you send me an email (the address is on my profile), I'll respond that way; StackOverflow comments don't have a way to format appropriately, and it gags on too many at-signs. :)

Barmar · Accepted Answer · 2013-04-30 02:34:20Z

1

Create a hash that maps column headings to column numbers:

my %columns;
...

if (/table|species|accession/i) {
  my @headings = split('\t');
  my $col = 0;
  foreach my $col (@headings) {
    $columns{"\L$col"} = $col++;
  }
}

Then you can use $csv[$columns{'species'}].

answered Apr 30, 2013 at 2:34

Barmar

789k57 gold badges555 silver badges669 bronze badges

Collectives™ on Stack Overflow

Parsing CSV files, finding columns and remembering them

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related