Unix: filtering CSV columns according to external file

Question

I am working on a large csv file (millions of rows and 80 thousand columns). I want to extract and save in a new file all rows and only those columns that are listed in an external text file. For instance:

Source data file

id,snp1,snp2,snp3,snp4,snp5,snp6,snp7,snp8,snp9,snp10
sampl1,AA,BB,AB,BB,AA,AA,AB,BB,BB,BB
sampl2,AA,BB,BB,BB,AB,AA,AB,BB,BB,BB
sampl3,AA,BB,AB,BB,BB,AA,AA,BB,BB,BB
sampl4,AA,BB,AA,BB,AB,AA,BB,BB,BB,BB
sampl5,AA,BB,AB,BB,AB,AA,AA,BB,BB,BB
sampl6,AA,BB,AB,BB,BB,AA,AB,BB,BB,BB
sampl7,AA,BB,BB,AB,AB,AA,AB,BB,BB,BB

External file with list of columns to keep-

snp3
snp6
snp7
snp10

Resulting (new) file

id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

Is there an efficient approach to do that using awk?

What have you tried? What have you considered? It looks eminently doable and not very hard. What should happen if the list of columns includes names that aren't found in the list of columns in line 1 of the main data file? Does the code need to worry about quotes around and spaces or commas within column names? — Jonathan Leffler
– Jonathan Leffler, Commented Nov 21, 2017 at 19:57
Thank you. I did not know how to extend the approach of extracting specific columns (awk -F "\",\"" '{print $1,$3}' myfile.csv) to read from an external file. I am beginner on unix. If the list of columns includes names that are not in the external list, the resulting data file should include only the matches, which is the result from using solution provided by @karafka. — July
– July, Commented Nov 21, 2017 at 20:44

karakfa · Accepted Answer · 2017-11-21 20:08:02Z

2

a non-awk solution

$ cut -d, -f1,$(grep -Ff columns <(sed 1q file | tr ',' '\n' | nl -w1) | cut -f1 | paste -sd,) file

id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

or

awk to the rescue!

$ awk 'NR==FNR {cols[$1]; next}
       FNR==1  {for(i=2;i<=NF;i++) if($i in cols) colin[i]}
               {line=$1;
                for(i=1;i<=NF;i++) if(i in colin) line=line FS $i; 
                print line}' columns FS=, file

id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

answered Nov 21, 2017 at 20:08

karakfa

67.8k8 gold badges45 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

July Over a year ago

Thanks a lot @karafka. It works perfectly considering also the mismatches between initial and external list of columns to keep.

Olivier Lasne · Accepted Answer · 2017-11-21 20:49:01Z

1

I would recommand using csvkit. Csvkit it build for that job, and work properly if some of data are strings in double quote contaning ','.

Install :

sudo apt python3-csvkit

Use

 csvcut source.csv -c  $(cat cols.txt | tr '\n' ',' | sed 's/,$//')

The -c option take the names of the columns, tr is used to replace the character '\n' by a ','. And since, we don't want our arguments to finish by a ',' we use sed to remove it.

answered Nov 21, 2017 at 20:49

Olivier Lasne

1,0319 silver badges14 bronze badges

Comments

Ed Morton · Accepted Answer · 2017-11-21 21:56:44Z

0

$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
    list["id"]
    list[$0]
    next
}
FNR==1 {
    for (i=1; i<=NF; i++) {
        if ($i in list) {
            f[++nf] = i
        }
    }
}
{
    for (i=1; i<=nf; i++) {
        printf "%s%s", $(f[i]), (i<nf ? OFS : ORS)
    }
}

$ awk -f tst.awk list file
id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

answered Nov 21, 2017 at 21:56

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Collectives™ on Stack Overflow

Unix: filtering CSV columns according to external file

Source data file

External file with list of columns to keep-

Resulting (new) file

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Source data file

External file with list of columns to keep-

Resulting (new) file

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related