Unix shell script select columns in csv file based on headers from another csv file

Question

I have a csv file with 28,000 columns and I want to select certain columns based on headers in another csv file, using a unix shell script. I cannot use tools like csvkit as I am working on a server and don't have admin rights to install new tools. I have read many posts on this but can't get what I want to work, possibly as the previous answers use tab delimited or space delimited text, not csv. I am new to shell script (and won't have to use it much, as I usually work in R or Python).

The header file looks like this:

$ cat headers.csv
eid
ABCD001
ABCD005
ABCD021
ABCD022

... etc (I need about 50 columns, not in sequence)

The data file is a csv file with data in a variety of formats (numeric, characters) with 28,000 columns including all of the 50 columns I need, with a header column as in the header file.

I tried this post: https://www.unix.com/shell-programming-and-scripting/269610-find-columns-file-based-header-print-new-file.html and this code in the post:

$ awk 'NR==FNR{a[$0]=NR;next}{for (i in a) printf "%s ", $a[i];print ""}' headers.csv data_file.csv > selected_data_file.csv

But it doesn't work, probably as it's looking for tab delimited text or space delimited and I have a csv file. It produces a huge output file, so is not doing the job.

I also read this post: Create CSV from specific columns in another CSV using shell scripting But I can't use the column indices, I need to use the headers from the other file, as there are so many columns in the input data file.

Any suggestions for how this code can be modified to produce the file of all rows of the data_file but just for the 50 columns I need would be really appreciated. Please note, I cannot use csvkit.

The output should be something like this:

$ cat selected_data_file.csv
eid,ABCD001,ABCD005,ABCD021,ABCD022
AB1, 1, 1, 0.5556, XXXX
AB2, 2, 2, 0.7687, YYYY
AB3, 1, 0, 0.5362, ZZZ

corresponding to all the rows for the columns whose headers I have selected in the headers.csv file.

I hope that makes sense, all help appreciated!

don't have admin rights to install new tools : What speaks against installing tools locally in your home directory? — user1934428
– user1934428, Commented Jun 22, 2023 at 12:09

glenn jackman · Accepted Answer · 2023-06-22 12:04:07Z

2

You're pretty close. What you need to do after you've read the headers file is to scan the first line of the data file and select the column numbers that match the headers. Also, whitespace is not a precious resource, it's OK to use more.

awk '
    NR == FNR {wanted[$0] = 1; next}
    FNR == 1 {
        ncol = 0
        for (i = 1; i <= NR; i++)
            if ($i in wanted)
                columns[++ncol] = i
    }
    {
        for (i = 1; i <= ncol; i++)
            printf "%s%s", $columns[i], OFS
        print ""
    }
' headers.csv data_file.csv > selected_data_file.csv

answered Jun 22, 2023 at 12:04

glenn jackman

249k42 gold badges233 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

LucieCBurgess Over a year ago

Hi @glenn-jackman really appreciate you taking a look at this. However it is not working :-( maybe there's something wrong with the data? I've noticed that when I open the data_file.csv in TextEdit, each string in the file is enclosed in quotes, would that make a difference? I tried enclosing the column headers (in the headers.csv) in quotes, but that doesn't work either :-(

glenn jackman Over a year ago

How to create a Minimal, Reproducible Example

LucieCBurgess Over a year ago

Yes I thought you might say that :-) Ok I'll try again: header file is perfect, I literally cut and paste that from the terminal output. First line of data file looks like this: "eid","123456-0.1","123456-0.2","123456-0.3","132605-0.0". However when I print contents to the terminal not all the numeric columns appear to have quotes. If I add quotes to the header file, that doesn't work either.

glenn jackman Over a year ago

LOL. Yes, the quotes will matter. Also, since these are CSV files, you might want to add BEGIN {FS = OFS = ","}

LucieCBurgess Over a year ago

What's the easiest way to do this, post the question again?

|

Collectives™ on Stack Overflow

Unix shell script select columns in csv file based on headers from another csv file

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related