Ahoy!
I was able to sort through it. The steps I followed were as follows.
A separate script to create 3000 files of the format "YEAR, MONTH, REMOVE THIRD COLUMN, PRECIPITATION"
The files created have a random name, an underscore, the order they were created, and a .txt extension i.e. 17920_2624.txt would be file number 2624
The precipitation in the randomly created files is a random number between 0-250, and a decimal between 0-999 i.e. 229.370
The main script will parse each of these 3000 files, and store the data in hash %filenameYears. The hash key is each filename, the hash value is a pointer to %allYearsInFile. %allYearsInFile will become an anonymous hash containing precipitation data for each year in the file
The anonymous hash key is each year, the anonymous hash value is the output string of data for that year in the condensed formatted you requested
If the precipitation data for any month is missing, substitute the missing value with -99.99
The scipt to create the output files looks like this...
#!/usr/bin/perl -w
my $minimum = 1980;
my $maximum = 2025;
my $nfiles = shift or die("no command line arg");
my $count = 0;
my @header = ("YEAR,", "MONTH,", "REMOVE THIRD COLUMN,", "PRECIPITATION");
my @lengthHeaders; #for printf table formatting
for(@header){
$lengthHeaders[$count] = "%-" . length($_) . "s";
$count++;
}
for $count (1 .. $nfiles) {
my $filename = int(rand(99999));
$filename .="_${count}.txt";
open my $out, '>', "$filename" or die "$filename: $!";
printf($out "@lengthHeaders\n",@header);
for my $year ($minimum .. $maximum) {
for my $month (1 .. 12) {
printf($out "@lengthHeaders\n", $year, $month, ,0, int(rand(250)) . "." . int(rand(999)));
}
}
}
#run this command to print the files in the order they were created
#perl -e 'print "$_\n" for(sort { ($a =~ /_(\d+)\.txt/)[0] <=> ($b =~ /_(\d+)\.txt/)[0] } @ARGV)' *.txt
This script will generate however many files you specify. To create 3000 files, run the following command...
$ perl create.files.pl 3000
Here is the code to parse the input files and put them in the condensed format you described...
#!/usr/bin/perl -w
my @headers = ("IDENTIFICATION,", "YEAR,", "MONTH CODE,", "PRECIPITATION FOR THAT MONTH");
my @lengthHeaders; #find length of each header for printf table formatting
my ($file,$filecount,$count) = ("",0,0);
for(@headers){
$lengthHeaders[$count] = "%-" . length($_) . "s";
$count++;
}
my %filenameYears; #key is filename, value is a pointer to an anonymous hash containing data from all years in the file
my %allYearsInFile; #key is year, value is output string of data for that year
while(<>){
if($file ne $ARGV){ #filename being processed has just changed
$filecount++;
$file = $ARGV; #update new filename
next; #skip header lines
}
my %line;
@line{("year","month","remove","precipitation")} = split(/ +/);
if(!defined($allYearsInFile{$line{year}})){ #start a new year value
#$allYearsInFile{$line{year}} = $line{year};
$allYearsInFile{$line{year}} = sprintf("$lengthHeaders[1]", $line{year} . ",");
}
if( $line{precipitation} !~ /\d/){ #if precipitation value is missing, set value to -99.99
#DEBUG: warn "missing value $_";
$line{precipitation} = -99.99;
}
#$allYearsInFile{$line{year}} .= ", $line{month}, $line{precipitation}";
$allYearsInFile{$line{year}} .= sprintf( " %-4s%-8s", $line{month} . ",", $line{precipitation} . ",");
if($line{month} == 12){
$allYearsInFile{$line{year}} =~ s/, *$//; #remove trailing comma
}
if(eof){ #when file ends, save hash and start a new one for the next file
$filenameYears{$file} = {%allYearsInFile}; #save old hash as anonymous hash by filename
%allYearsInFile = (); #empty old hash for next year
}
}
printf("@lengthHeaders\n",@headers); #done processing files, print headers
#DEBUG: print "Processed $filecount files\n";
for my $filename (sort keys %filenameYears){ #dereference hash and print output string in formatted printf table
my $hashref = $filenameYears{$filename};
for my $year ( sort keys %$hashref ){
#print "$filename, $$hashref{$year}\n";
printf("@lengthHeaders[0..1]\n", $filename . ",", $$hashref{$year});
}
}
To parse the 3000 files and put them in the condensed format you described, run the following command...
$ perl parse.precipitation.files.pl *.txt
IDENTIFICATION, YEAR, MONTH CODE, PRECIPITATION FOR THAT MONTH
84406_1.txt, 1980, 1, 187.288, 2, 22.298, 3, 175.23, 4, 41.606, 5, 104.842, 6, 176.260, 7, 207.896, 8, 143.67, 9, 57.9, 10, 69.99, 11, 146.85, 12, 49.121
84406_1.txt, 1981, 1, 128.77, 2, 242.826, 3, 49.836, 4, 115.318, 5, 79.676, 6, 2.585, 7, 109.714, 8, 100.613, 9, 123.566, 10, 218.599, 11, 115.717, 12, 76.219
84406_1.txt, 1982, 1, 227.123, 2, 155.287, 3, 95.521, 4, 17.647, 5, 176.328, 6, 95.766, 7, 106.289, 8, 90.45, 9, 93.676, 10, 142.85, 11, 141.379, 12, 109.357
<cut>
This will parse all 3000 files sequentially and print the output in the format you requested. If you want to save this output in its own file, run the following command...
$ perl parse.precipitation.files.pl *.txt > year_mly.txt
For all 3000 files it should take around 5 seconds to finish.
$ time perl parse.precipitation.files.pl *.txt > year_mly.txt
real 0m4.345s
user 0m4.282s
sys 0m0.061s
To separate this large file into files by year, i.e. one file containing all readings from all files for the year 1980, you can run the following script...
#!/usr/bin/perl -w
my $headerLine = 1;
my @headers = ("IDENTIFICATION", "YEAR", "MONTH CODE", "PRECIPITATION FOR THAT MONTH");
my @lengthHeaders; #find length of each header for printf table formatting
my ($file,$filecount,$count) = ("",0,0);
for(@headers){
$lengthHeaders[$count] = "%-" . length($_) . "s";
$count++;
}
my %allFilesByYear;
while(<>){
if($headerLine){
$headerLine = 0;
next;
}
my %line;
@line{@headers} = split(/, +/);
push(@{$allFilesByYear{$line{YEAR}}},$_); #store each line in a hash where the key is each year, and the value is the row of data.
}
for $k (sort keys(%allFilesByYear)){ #dereference hash
my $arrayref = $allFilesByYear{$k};
open my $out,'>',"${k}_mly.txt" or die "$!"; #create new file for each year
for( @{$allFilesByYear{$k}} ){
print $out "$_"; #output data for each year
}
}
Run this script with the following command...
$ perl files.by.year.pl year_mly.txt
This will automatically separate all the data into smaller files based on the year. So all the readings from 1980 will be in the file 1980_mly.txt and so on for each year. You can also accomplish this manually using something like grep. To see all data for the year 1980 manually, you could run something like this...
more year_mly.txt | grep -i ' 1980,'
And to put this data in a file you could run something like this...
more year_mly.txt | grep -i ' 1980,' > data_from_1980.txt
That should be exactly what you are looking for. If the requirements arent all met just let me know what you need with some sample data.
Good Luck!
writethe elements you need into the right file. Have you tried something similar?2024file to look like (given that you, presumably, have data for some, but not all, months of the current year)? In general, what do you want to happen if some month(s) are missing? (2) What do you mean by “each initial file”? (3) Do any of your data values include space(s)? … (Cont’d)