-2

I have a csv file with 28,000 columns and I want to select certain columns based on headers in another csv file, using a unix shell script. I cannot use tools like csvkit as I am working on a server and don't have admin rights to install new tools. I have read many posts on this but can't get what I want to work, possibly as the previous answers use tab delimited or space delimited text, not csv. I am new to shell script (and won't have to use it much, as I usually work in R or Python).

The header file looks like this:

$ cat headers.csv
eid
ABCD001
ABCD005
ABCD021
ABCD022

... etc (I need about 50 columns, not in sequence)

The data file is a csv file with data in a variety of formats (numeric, characters) with 28,000 columns including all of the 50 columns I need, with a header column as in the header file.

I tried this post: https://www.unix.com/shell-programming-and-scripting/269610-find-columns-file-based-header-print-new-file.html and this code in the post:

$ awk 'NR==FNR{a[$0]=NR;next}{for (i in a) printf "%s ", $a[i];print ""}' headers.csv data_file.csv > selected_data_file.csv

But it doesn't work, probably as it's looking for tab delimited text or space delimited and I have a csv file. It produces a huge output file, so is not doing the job.

I also read this post: Create CSV from specific columns in another CSV using shell scripting But I can't use the column indices, I need to use the headers from the other file, as there are so many columns in the input data file.

Any suggestions for how this code can be modified to produce the file of all rows of the data_file but just for the 50 columns I need would be really appreciated. Please note, I cannot use csvkit.

The output should be something like this:

$ cat selected_data_file.csv
eid,ABCD001,ABCD005,ABCD021,ABCD022
AB1, 1, 1, 0.5556, XXXX
AB2, 2, 2, 0.7687, YYYY
AB3, 1, 0, 0.5362, ZZZ

corresponding to all the rows for the columns whose headers I have selected in the headers.csv file.

I hope that makes sense, all help appreciated!

1
  • don't have admin rights to install new tools : What speaks against installing tools locally in your home directory? Commented Jun 22, 2023 at 12:09

1 Answer 1

2

You're pretty close. What you need to do after you've read the headers file is to scan the first line of the data file and select the column numbers that match the headers. Also, whitespace is not a precious resource, it's OK to use more.

awk '
    NR == FNR {wanted[$0] = 1; next}
    FNR == 1 {
        ncol = 0
        for (i = 1; i <= NR; i++)
            if ($i in wanted)
                columns[++ncol] = i
    }
    {
        for (i = 1; i <= ncol; i++)
            printf "%s%s", $columns[i], OFS
        print ""
    }
' headers.csv data_file.csv > selected_data_file.csv
Sign up to request clarification or add additional context in comments.

6 Comments

Hi @glenn-jackman really appreciate you taking a look at this. However it is not working :-( maybe there's something wrong with the data? I've noticed that when I open the data_file.csv in TextEdit, each string in the file is enclosed in quotes, would that make a difference? I tried enclosing the column headers (in the headers.csv) in quotes, but that doesn't work either :-(
Yes I thought you might say that :-) Ok I'll try again: header file is perfect, I literally cut and paste that from the terminal output. First line of data file looks like this: "eid","123456-0.1","123456-0.2","123456-0.3","132605-0.0". However when I print contents to the terminal not all the numeric columns appear to have quotes. If I add quotes to the header file, that doesn't work either.
LOL. Yes, the quotes will matter. Also, since these are CSV files, you might want to add BEGIN {FS = OFS = ","}
What's the easiest way to do this, post the question again?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.