0

I am working on a large csv file (millions of rows and 80 thousand columns). I want to extract and save in a new file all rows and only those columns that are listed in an external text file. For instance:

Source data file

id,snp1,snp2,snp3,snp4,snp5,snp6,snp7,snp8,snp9,snp10
sampl1,AA,BB,AB,BB,AA,AA,AB,BB,BB,BB
sampl2,AA,BB,BB,BB,AB,AA,AB,BB,BB,BB
sampl3,AA,BB,AB,BB,BB,AA,AA,BB,BB,BB
sampl4,AA,BB,AA,BB,AB,AA,BB,BB,BB,BB
sampl5,AA,BB,AB,BB,AB,AA,AA,BB,BB,BB
sampl6,AA,BB,AB,BB,BB,AA,AB,BB,BB,BB
sampl7,AA,BB,BB,AB,AB,AA,AB,BB,BB,BB

External file with list of columns to keep-

snp3
snp6
snp7
snp10

Resulting (new) file

id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

Is there an efficient approach to do that using awk?

2
  • What have you tried? What have you considered? It looks eminently doable and not very hard. What should happen if the list of columns includes names that aren't found in the list of columns in line 1 of the main data file? Does the code need to worry about quotes around and spaces or commas within column names? Commented Nov 21, 2017 at 19:57
  • Thank you. I did not know how to extend the approach of extracting specific columns (awk -F "\",\"" '{print $1,$3}' myfile.csv) to read from an external file. I am beginner on unix. If the list of columns includes names that are not in the external list, the resulting data file should include only the matches, which is the result from using solution provided by @karafka. Commented Nov 21, 2017 at 20:44

3 Answers 3

2

a non-awk solution

$ cut -d, -f1,$(grep -Ff columns <(sed 1q file | tr ',' '\n' | nl -w1) | cut -f1 | paste -sd,) file

id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

or

awk to the rescue!

$ awk 'NR==FNR {cols[$1]; next}
       FNR==1  {for(i=2;i<=NF;i++) if($i in cols) colin[i]}
               {line=$1;
                for(i=1;i<=NF;i++) if(i in colin) line=line FS $i; 
                print line}' columns FS=, file

id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot @karafka. It works perfectly considering also the mismatches between initial and external list of columns to keep.
1

I would recommand using csvkit. Csvkit it build for that job, and work properly if some of data are strings in double quote contaning ','.

Install :

sudo apt python3-csvkit

Use

 csvcut source.csv -c  $(cat cols.txt | tr '\n' ',' | sed 's/,$//')

The -c option take the names of the columns, tr is used to replace the character '\n' by a ','. And since, we don't want our arguments to finish by a ',' we use sed to remove it.

Comments

0
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
    list["id"]
    list[$0]
    next
}
FNR==1 {
    for (i=1; i<=NF; i++) {
        if ($i in list) {
            f[++nf] = i
        }
    }
}
{
    for (i=1; i<=nf; i++) {
        printf "%s%s", $(f[i]), (i<nf ? OFS : ORS)
    }
}

$ awk -f tst.awk list file
id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.