I have a csv file with 28,000 columns and I want to select certain columns based on headers in another csv file, using a unix shell script. I cannot use tools like csvkit as I am working on a server and don't have admin rights to install new tools. I have read many posts on this but can't get what I want to work, possibly as the previous answers use tab delimited or space delimited text, not csv. I am new to shell script (and won't have to use it much, as I usually work in R or Python).
The header file looks like this:
$ cat headers.csv
eid
ABCD001
ABCD005
ABCD021
ABCD022
... etc (I need about 50 columns, not in sequence)
The data file is a csv file with data in a variety of formats (numeric, characters) with 28,000 columns including all of the 50 columns I need, with a header column as in the header file.
I tried this post: https://www.unix.com/shell-programming-and-scripting/269610-find-columns-file-based-header-print-new-file.html and this code in the post:
$ awk 'NR==FNR{a[$0]=NR;next}{for (i in a) printf "%s ", $a[i];print ""}' headers.csv data_file.csv > selected_data_file.csv
But it doesn't work, probably as it's looking for tab delimited text or space delimited and I have a csv file. It produces a huge output file, so is not doing the job.
I also read this post: Create CSV from specific columns in another CSV using shell scripting But I can't use the column indices, I need to use the headers from the other file, as there are so many columns in the input data file.
Any suggestions for how this code can be modified to produce the file of all rows of the data_file but just for the 50 columns I need would be really appreciated. Please note, I cannot use csvkit.
The output should be something like this:
$ cat selected_data_file.csv
eid,ABCD001,ABCD005,ABCD021,ABCD022
AB1, 1, 1, 0.5556, XXXX
AB2, 2, 2, 0.7687, YYYY
AB3, 1, 0, 0.5362, ZZZ
corresponding to all the rows for the columns whose headers I have selected in the headers.csv file.
I hope that makes sense, all help appreciated!