0

I'd prefer a solution that uses bash rather than converting to a dataframe in python, etc as the files are quite big

I have a folder of CSVs that I'd like to merge into one CSV. The CSVs all have the same header save a few exceptions so I need to rewrite the name of each added column with the filename as a prefix to keep track of which file the column came from.

head globcover_color.csv glds00g.csv

==> file1.csv <==
id,max,mean,90
2870316.0,111.77777777777777
2870317.0,63.888888888888886
2870318.0,73.6
2870319.0,83.88888888888889


==> file2.csv <==
ogc_fid,id,_sum
"1","2870316",9.98795110916615
"2","2870317",12.3311055738527
"3","2870318",9.81535963468479
"4","2870319",7.77729743926775

The id column of each file might be in a different "datatype" but in every file the id matches the line number. For example, line 2 is always id 2870316.

Anticipated output:

file1_id,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

I'm not quite sure how to do this but I think I'd use the paste command at some point. I'm surprised that I couldn't find a similar question on stackoverflow but I guess it's not that common to have CSV with the same id on the same line number

edit:

I figured out the first part.

paste -d , * > ../rasterjointest.txt achieves what I want but the header needs to be replaced

0

1 Answer 1

1
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
    fname = FILENAME
    sub(/\.[^.]+$/,"",fname)
    for (i=1; i<=NF; i++) {
        $i = fname "_" $i
    }
}
{ row[FNR] = (NR==FNR ? "" : row[FNR] OFS) $0 }
END {
    for (rowNr=1; rowNr<=FNR; rowNr++) {
        print row[rowNr]
    }
}

$ awk -f tst.awk file1.csv file2.csv
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

To use minimal memory in awk:

$ cat tst.awk
BEGIN {
    FS=OFS=","
    for (fileNr=1; fileNr<ARGC; fileNr++) {
        filename = ARGV[fileNr]
        if ( (getline < filename) > 0 ) {
            fname = filename
            sub(/\.[^.]+$/,"",fname)
            for (i=1; i<=NF; i++) {
                $i = fname "_" $i
            }
        }
        row = (fileNr==1 ? "" : row OFS) $0
    }
    print row
    exit
}

$ awk -f tst.awk file1.csv file2.csv; paste -d, file1.csv file2.csv | tail -n +2
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
Sign up to request clarification or add additional context in comments.

3 Comments

awk runs for a good 8 minutes then runs out of RAM. is there a way to only process and output the header with awk ? paste seems to join the files correctly without using much RAM
creating temp files which only contained the header worked for me for i in *; do head -n 1 "$i" > heads/$i; done
I added an alternative that'll use very little memory in the awk script since almost all of the work is done by paste,

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.