Combining multiple files by year (column in data) into one line and add file name as first column in Perl or Python

Question

I have over 3000 files that need to be combined based on specific attributes in the files. For example, each file is named by its ID (Eg. 101567AD_Mly.txt).These IDs correspond to latitude and longitude values that I will concatenate to the file after. I need to add a leading column in each output file with the ID. I would then like to merge lines in the input files so that the output looks like IDENTIFICATION, YEAR, JANUARY MONTH CODE (1), JANUARY PRECIPITATION, FEBRUARY MONTH CODE (2), FEBRUARY PRECIPITATION, MARCH MONTH CODE (3), MARCH PRECIPITATION......
Currently, the files are separating each month into a new line like the example below where it is currently structured as YEAR, MONTH, UNKNOWN VALUE, PRECIPITATION AMOUNT.I would like to combine the 3000+ files based on the ID for each, which is their filename. Ideally, I would have a file for each year of data (1979,1980, 1981, etc) so that all files with those years of data (eg. 1981) would all be in one file with each line representing a different ID. The filenames describe the identification of where this data came from so I would like it as the leading column in each file. Here is an example of what the files look like:

YEAR, MONTH, REMOVE THIRD COLUMN, PRECIPITATION 
1980 1 0 112.106
1980 2 0 131.909
1980 3 0 58.842
1980 4 0 42.075
1980 5 0 45.268
1980 6 0 126.168
1980 7 0 30.159
1980 8 0 32.168
1980 9 0 39.48
1980 10 0 46.117
1980 11 0 234.089
1980 12 0 180.227
1981 1 0 62.795
1981 2 0 131.817
1981 3 0 73.429
1981 4 0 77.282
1981 5 0 54.224
1981 6 0 81.217
1981 7 0 18.469
1981 8 0 12.695
1981 9 0 83.4
1981 10 0 130.957
1981 11 0 151.07
1981 12 0 190.582
1982 1 0 210.604
1982 2 0 143.732
1982 3 0 26.124
1982 4 0 45.952
1982 5 0 10.38
1982 6 0 33.602
1982 7 0 42.218
1982 8 0 9.541
1982 9 0 39.356
1982 10 0 87.037

The third column can just be removed. Any help to combine all these using perl or python would be great. Either a one-liner or script. I am using linux to do this.

The guy who used to do this work and based everything around Perl has retired leaving a lot of gaps in the knowledge transfer. I am struggling to piece some of it together.

The output format should look like:

IDENTIFICATION, YEAR, MONTH CODE, PRECIPITATION FOR THAT MONTH
ADB103884_Mly, 1989, 1, 123.56, 2, 56.23, 3, 58.9, 4, -99.99, 5, 6.9, 6, 48.2, 7, 89,1, 8, 85.3, 9, 98.1, 10, 190.2, 11, 283.9, 12, -99.99

The output file can be called the year_mly.txt

If there are missing values for a specific month, it should be given a value of -99.99. This way, every ID has a value for each month of the specified year.

HI Kaitlin! This is a tiny bit too broad, because we're not a free code-writing service. We love to help, though! Can you explain what you've tried and where you're stuck? To me, this looks like if you want it in Python, you are probably somewhat familiar in Python and could easily read in the lines from a file, split each line across the space characters, and then just write the elements you need into the right file. Have you tried something similar? — Marcus Müller
– Marcus Müller, Commented Oct 22, 2024 at 15:58
(1) Rather than showing 34 lines of input data that are (AFAICT) identical, features-wise, it would be better to show a sufficient representative sample of input data and show what you want to get as output. (1a) In particular, what do you want the 2024 file to look like (given that you, presumably, have data for some, but not all, months of the current year)? In general, what do you want to happen if some month(s) are missing? (2) What do you mean by “each initial file”? (3) Do any of your data values include space(s)? … (Cont’d) — G-Man Says 'Reinstate Monica'
– G-Man Says 'Reinstate Monica', Commented Oct 22, 2024 at 17:04
(Cont’d) … (4) Are you saying that you have 3000+ files of approximately 537 lines each (one per month, January 1980 through the present), and you want to end up with 45 files (one per year) of 3000+ lines each? Please clarify. — G-Man Says 'Reinstate Monica'
– G-Man Says 'Reinstate Monica', Commented Oct 22, 2024 at 17:04
Yes, I have 3000+ files that I need to pare down to around 45 files (one per year of data). Each file would include the following columns in each line. IDENTIFICATION, YEAR, JANUARY MONTH CODE (1), JANUARY PRECIPITATION, FEBRUARY MONTH CODE (2), FEBRUARY PRECIPITATION, MARCH MONTH CODE (3), MARCH PRECIPITATION...... — Kaitlin DeBoer
– Kaitlin DeBoer, Commented Oct 22, 2024 at 17:19
Please edit your question and include everything people asked for in the comments. We need to see representative input, and the output you want from that example input. We also need to see the file names so we don't waste your time or ours giving solutions that don't work. — terdon
– terdon ♦, Commented Oct 22, 2024 at 17:45

choroba · Accepted Answer · 2024-10-22 18:30:10Z

2

I created fake intput data like this:

#!/usr/bin/perl
use warnings;
use strict;
for my $id ('aaa' .. 'eod') {
    open my $out, '>', "file.$id" or die "file.$id: $!";
    for my $year (1979 .. 2023) {
        for my $month (1 .. 12) {
            printf {$out} "%d %d 0 %f\n", $year, $month, rand() / (1 + rand);
        }
    }
}

And then processed them by the following Perl script:

#!/usr/bin/perl
use warnings;
use strict;
use feature 'say';

my %by_year;

for my $file (glob 'file.???') {
    my $id = substr $file, -3;
    open my $in, '<', $file or die "$file: $!";
    while (my $line = <$in>) {
        my ($year, $month, undef, $precipitation) = split ' ', $line;
        die "Duplicate $id $year $month" if exists $by_year{$year}{$id}{$month};
        $by_year{$year}{$id}{$month} = $precipitation;
    }
}

for my $year (keys %by_year) {
    open my $out, '>', "$year.out" or die "$year.out: $!";
    for my $id (sort keys %{ $by_year{$year} }) {
        say {$out} join ' ', $id, map { $_, $by_year{$year}{$id}{$_} } 1 .. 12;
    }
}

You didn't show how exactly your files are named, so you'll have to adjust the line that does the glob as well as the one that extracts the id (using substr here).

edited Oct 22, 2024 at 18:30

answered Oct 22, 2024 at 16:08

choroba

49.6k7 gold badges92 silver badges119 bronze badges

The input files are all named like 101C0ME_A.txt with varying letter and number combinations.

Kaitlin DeBoer
– Kaitlin DeBoer

2024-10-22 17:30:50 +00:00
Commented Oct 22, 2024 at 17:30
I edited the original question to help answer some questions asked. Thank you for your help!!!

Kaitlin DeBoer
– Kaitlin DeBoer

2024-10-22 17:40:47 +00:00
Commented Oct 22, 2024 at 17:40
You can try tweaking my solution to your real data. Ask a new question if you get stuck.

choroba
– choroba

2024-10-22 19:38:44 +00:00
Commented Oct 22, 2024 at 19:38

Add a comment |

user3408541 · Accepted Answer · 2025-10-29 06:06:03Z

Ahoy!

I was able to sort through it. The steps I followed were as follows.

A separate script to create 3000 files of the format "YEAR, MONTH, REMOVE THIRD COLUMN, PRECIPITATION"
The files created have a random name, an underscore, the order they were created, and a .txt extension i.e. 17920_2624.txt would be file number 2624
The precipitation in the randomly created files is a random number between 0-250, and a decimal between 0-999 i.e. 229.370
The main script will parse each of these 3000 files, and store the data in hash %filenameYears. The hash key is each filename, the hash value is a pointer to %allYearsInFile. %allYearsInFile will become an anonymous hash containing precipitation data for each year in the file
The anonymous hash key is each year, the anonymous hash value is the output string of data for that year in the condensed formatted you requested
If the precipitation data for any month is missing, substitute the missing value with -99.99

The scipt to create the output files looks like this...

#!/usr/bin/perl -w

my $minimum = 1980;
my $maximum = 2025;
my $nfiles = shift or die("no command line arg");
my $count = 0;
my @header = ("YEAR,", "MONTH,", "REMOVE THIRD COLUMN,", "PRECIPITATION");
my @lengthHeaders;           #for printf table formatting
for(@header){
  $lengthHeaders[$count] = "%-" . length($_) . "s";
  $count++;
}

for $count (1 .. $nfiles) {
  my $filename = int(rand(99999));
  $filename .="_${count}.txt";
  open my $out, '>', "$filename" or die "$filename: $!";
  printf($out "@lengthHeaders\n",@header);
  for my $year ($minimum .. $maximum) {
        for my $month (1 .. 12) {
      printf($out "@lengthHeaders\n", $year, $month, ,0, int(rand(250)) . "." .  int(rand(999)));
        }
    }
}

#run this command to print the files in the order they were created
#perl -e 'print "$_\n" for(sort { ($a =~ /_(\d+)\.txt/)[0] <=> ($b =~ /_(\d+)\.txt/)[0] } @ARGV)' *.txt

This script will generate however many files you specify. To create 3000 files, run the following command...

$ perl create.files.pl 3000

Here is the code to parse the input files and put them in the condensed format you described...

#!/usr/bin/perl -w

my @headers = ("IDENTIFICATION,", "YEAR,", "MONTH CODE,", "PRECIPITATION FOR THAT MONTH");
my @lengthHeaders;                                   #find length of each header for printf table formatting
my ($file,$filecount,$count) = ("",0,0);
for(@headers){
  $lengthHeaders[$count] = "%-" . length($_) . "s";
  $count++;
}
my %filenameYears;                                    #key is filename, value is a pointer to an anonymous hash containing data from all years in the file
my %allYearsInFile;                                    #key is year, value is output string of data for that year

while(<>){
  if($file ne $ARGV){                                   #filename being processed has just changed
    $filecount++;
    $file = $ARGV;                                       #update new filename
    next;                                                 #skip header lines
  }
  my %line;
  @line{("year","month","remove","precipitation")} = split(/ +/);
  if(!defined($allYearsInFile{$line{year}})){              #start a new year value
    #$allYearsInFile{$line{year}} = $line{year};
    $allYearsInFile{$line{year}} = sprintf("$lengthHeaders[1]", $line{year} . ",");
  }
  if( $line{precipitation} !~ /\d/){                        #if precipitation value is missing, set value to -99.99
    #DEBUG: warn "missing value $_";
    $line{precipitation} = -99.99;
  }
  #$allYearsInFile{$line{year}} .= ", $line{month}, $line{precipitation}";
  $allYearsInFile{$line{year}} .= sprintf( " %-4s%-8s", $line{month} . ",", $line{precipitation} . ",");
  if($line{month} == 12){
    $allYearsInFile{$line{year}} =~ s/, *$//;                #remove trailing comma
  }
  if(eof){                                                    #when file ends, save hash and start a new one for the next file
    $filenameYears{$file} = {%allYearsInFile};                 #save old hash as anonymous hash by filename
    %allYearsInFile = ();                                       #empty old hash for next year
  }
}

printf("@lengthHeaders\n",@headers);                             #done processing files, print headers

#DEBUG: print "Processed $filecount files\n";
for my $filename (sort keys %filenameYears){                      #dereference hash and print output string in formatted printf table
  my $hashref = $filenameYears{$filename};
  for my $year ( sort keys %$hashref ){
    #print "$filename, $$hashref{$year}\n";
    printf("@lengthHeaders[0..1]\n", $filename . ",", $$hashref{$year});
  }
}

To parse the 3000 files and put them in the condensed format you described, run the following command...

$ perl parse.precipitation.files.pl *.txt 

IDENTIFICATION, YEAR, MONTH CODE, PRECIPITATION FOR THAT MONTH
84406_1.txt,    1980, 1,  187.288, 2,  22.298,  3,  175.23,  4,  41.606,  5,  104.842, 6,  176.260, 7,  207.896, 8,  143.67,  9,  57.9,    10, 69.99,   11, 146.85,  12, 49.121
84406_1.txt,    1981, 1,  128.77,  2,  242.826, 3,  49.836,  4,  115.318, 5,  79.676,  6,  2.585,   7,  109.714, 8,  100.613, 9,  123.566, 10, 218.599, 11, 115.717, 12, 76.219
84406_1.txt,    1982, 1,  227.123, 2,  155.287, 3,  95.521,  4,  17.647,  5,  176.328, 6,  95.766,  7,  106.289, 8,  90.45,   9,  93.676,  10, 142.85,  11, 141.379, 12, 109.357
<cut>

This will parse all 3000 files sequentially and print the output in the format you requested. If you want to save this output in its own file, run the following command...

$ perl parse.precipitation.files.pl *.txt >  year_mly.txt

For all 3000 files it should take around 5 seconds to finish.

$ time perl parse.precipitation.files.pl *.txt > year_mly.txt

real    0m4.345s
user    0m4.282s
sys     0m0.061s

To separate this large file into files by year, i.e. one file containing all readings from all files for the year 1980, you can run the following script...

#!/usr/bin/perl -w

my $headerLine = 1;
my @headers = ("IDENTIFICATION", "YEAR", "MONTH CODE", "PRECIPITATION FOR THAT MONTH");
my @lengthHeaders;                                   #find length of each header for printf table formatting
my ($file,$filecount,$count) = ("",0,0);
for(@headers){
  $lengthHeaders[$count] = "%-" . length($_) . "s";
  $count++;
}

my %allFilesByYear;
while(<>){
  if($headerLine){
    $headerLine = 0;
    next;
  }
  my %line;
  @line{@headers} = split(/, +/);
  push(@{$allFilesByYear{$line{YEAR}}},$_);           #store each line in a hash where the key is each year, and the value is the row of data.
}

for $k (sort keys(%allFilesByYear)){                   #dereference hash
  my $arrayref = $allFilesByYear{$k};
  open my $out,'>',"${k}_mly.txt" or die "$!";          #create new file for each year
  for( @{$allFilesByYear{$k}} ){
    print $out "$_";                                     #output data for each year
  }
}

Run this script with the following command...

$ perl files.by.year.pl year_mly.txt

This will automatically separate all the data into smaller files based on the year. So all the readings from 1980 will be in the file 1980_mly.txt and so on for each year. You can also accomplish this manually using something like grep. To see all data for the year 1980 manually, you could run something like this...

more year_mly.txt | grep -i ' 1980,'

And to put this data in a file you could run something like this...

more year_mly.txt | grep -i ' 1980,' > data_from_1980.txt

That should be exactly what you are looking for. If the requirements arent all met just let me know what you need with some sample data.

Good Luck!

Stack Exchange Network

Combining multiple files by year (column in data) into one line and add file name as first column in Perl or Python

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Combining multiple files by year (column in data) into one line and add file name as first column in Perl or Python

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions