bash - add columns to csv rewrite headers with prefix filename

Question

I'd prefer a solution that uses bash rather than converting to a dataframe in python, etc as the files are quite big

I have a folder of CSVs that I'd like to merge into one CSV. The CSVs all have the same header save a few exceptions so I need to rewrite the name of each added column with the filename as a prefix to keep track of which file the column came from.

head globcover_color.csv glds00g.csv

==> file1.csv <==
id,max,mean,90
2870316.0,111.77777777777777
2870317.0,63.888888888888886
2870318.0,73.6
2870319.0,83.88888888888889


==> file2.csv <==
ogc_fid,id,_sum
"1","2870316",9.98795110916615
"2","2870317",12.3311055738527
"3","2870318",9.81535963468479
"4","2870319",7.77729743926775

The id column of each file might be in a different "datatype" but in every file the id matches the line number. For example, line 2 is always id 2870316.

Anticipated output:

file1_id,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

I'm not quite sure how to do this but I think I'd use the paste command at some point. I'm surprised that I couldn't find a similar question on stackoverflow but I guess it's not that common to have CSV with the same id on the same line number

edit:

I figured out the first part.

paste -d , * > ../rasterjointest.txt achieves what I want but the header needs to be replaced

Ed Morton · Accepted Answer · 2019-07-30 21:12:38Z

1

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
    fname = FILENAME
    sub(/\.[^.]+$/,"",fname)
    for (i=1; i<=NF; i++) {
        $i = fname "_" $i
    }
}
{ row[FNR] = (NR==FNR ? "" : row[FNR] OFS) $0 }
END {
    for (rowNr=1; rowNr<=FNR; rowNr++) {
        print row[rowNr]
    }
}

$ awk -f tst.awk file1.csv file2.csv
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

To use minimal memory in awk:

$ cat tst.awk
BEGIN {
    FS=OFS=","
    for (fileNr=1; fileNr<ARGC; fileNr++) {
        filename = ARGV[fileNr]
        if ( (getline < filename) > 0 ) {
            fname = filename
            sub(/\.[^.]+$/,"",fname)
            for (i=1; i<=NF; i++) {
                $i = fname "_" $i
            }
        }
        row = (fileNr==1 ? "" : row OFS) $0
    }
    print row
    exit
}

$ awk -f tst.awk file1.csv file2.csv; paste -d, file1.csv file2.csv | tail -n +2
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

edited Jul 30, 2019 at 21:12

answered Jul 30, 2019 at 18:17

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jaksco Over a year ago

awk runs for a good 8 minutes then runs out of RAM. is there a way to only process and output the header with awk ? paste seems to join the files correctly without using much RAM

jaksco Over a year ago

creating temp files which only contained the header worked for me for i in *; do head -n 1 "$i" > heads/$i; done

Ed Morton Over a year ago

I added an alternative that'll use very little memory in the awk script since almost all of the work is done by paste,

Collectives™ on Stack Overflow

bash - add columns to csv rewrite headers with prefix filename

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related