Select columns based on their names from a file using AWK

Question

I have a file that contains a header and information under it.

zcat majorfile.gz | head -n 3 | cut -d ' ' -f1-10

marker alleleA alleleB FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged FINCH_WB_C049985_splitMerged FINCH_WB_C049985_splitMerged FINCH_WB_C049985_splitMerged FINCH_WB_C071898_splitMerged
LR761571.1_34273 G C 0.9955 0.0045 0 0.9996 0.0004 0 1
LR761571.1_34285 G A 0.9934 0.0066 0 0.9999 0.0001 0 0.9435

I'd like to subset that file based on the column names:

cat header.subset.txt | head
marker
alleleA
alleleB
FINCH_WB_633_splitMerged
FINCH_WB_ES1B002_splitMerged
FINCH_WB_JH1417_splitMerged
FINCH_WB_JH1452_splitMerged
FINCH_WB_JH1495_splitMerged
FINCH_WB_JP000_splitMerged
FINCH_WB_JP004_splitMerged

I have multiple "header.subset.txt" files so I'm going to loop through them.

for file1 in header.subset.txt 
do 
awk 'NR==FNR{a[$1]++;next} {if(FNR==1){for(i=1;i<=NF;i++){if(a[$i]){printf $i" ";b[i]=$i}}}else{printf "\n";for(j=1;j<=NF;j++){if(b[j]) {printf $j" "}}}}END {printf "\n"}' \
  $file1 \
  majorfile.gz > majorfile_sub.gz
done

The awk command works for a file with tab separated fields, but not with spaces (like in this case)

In the example, it would give:

marker alleleA alleleB FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged
LR761571.1_34273 G C 0.9955 0.0045 0
LR761571.1_34285 G A 0.9934 0.0066 0

EDIT: here's the awk code above formatted by gawk -o- to be much easier to read (but obviously still lacking meaningful variable names):

NR == FNR {
        a[$1]++
        next
}

{
        if (FNR == 1) {
                for (i = 1; i <= NF; i++) {
                        if (a[$i]) {
                                printf $i " "
                                b[i] = $i
                        }
                }
        } else {
                printf "\n"
                for (j = 1; j <= NF; j++) {
                        if (b[j]) {
                                printf $j " "
                        }
                }
        }
}

END {
        printf "\n"
}

and the question is? (The reader on stackoverflow should NOT need to guess what the real question is) — Luuk
– Luuk, Commented Nov 15, 2022 at 19:53
Maybe you should read something about "FS" (field separator) and "OFS" (output field separator) in the manual ? — Luuk
– Luuk, Commented Nov 15, 2022 at 19:56
"I'd like to subset that file based on the column names:" So I'd like to 'extract' the columns that are in majorfile.gz, based on the lines in header.subset.txt. So 'marker', 'alleleA', 'alleleB' and all other matching columns will be selected. Is this clear? The last chunk of code shows a small example (here,FINCH_WB_C049985_splitMerged and FINCH_WB_C071898_splitMerged will be dropped) — M. Beausoleil
– M. Beausoleil, Commented Nov 15, 2022 at 19:58
If you're going to ask people for help with your code, don't cram it all onto 1 line with almost no white space like awk 'NR==FNR{a[$1]++;next} {if(FNR==1){for(i=1;i<=NF;i++){if(a[$i]){printf $i" ";b[i]=$i}}}else{printf "\n";for(j=1;j<=NF;j++){if(b[j]) {printf $j" "}}}}END {printf "\n"}'. Instead format it in a way that's legibile. Also use meaningful variable names. Make it as easy as possible for us to help you, not almost impossible. — Ed Morton
– Ed Morton, Commented Nov 15, 2022 at 20:04
I'm surprised your script works when feeding it majorfile.gz (or is the suffix misleading, ie, file really isn't gzip'd?) — markp-fuso
– markp-fuso, Commented Nov 15, 2022 at 20:24

markp-fuso · Accepted Answer · 2022-11-15 21:00:09Z

2

A variation on OP's current code:

awk '
#BEGIN  { FS=OFS="\t" }                             # uncomment if input/output fields are tab delimited
FNR==NR { headers[$1]; next }
        { sep=""
          for (i=1; i<=NF; i++) {
              if (FNR==1 && ($i in headers)) {
                 fldids[i]
              }
              if (i in fldids) {
                 printf "%s%s",sep,$i
                 sep=OFS                            # if not set elsewhere (eg, in a BEGIN{}block) then default OFS == <space>
              }
          }
          print ""
        }
' header.subset.txt <(zcat majorfile.gz)

This generates:

marker alleleA alleleB FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged
LR761571.1_34273 G C 0.9955 0.0045 0
LR761571.1_34285 G A 0.9934 0.0066 0

edited Nov 15, 2022 at 21:00

answered Nov 15, 2022 at 20:47

markp-fuso

38.5k5 gold badges24 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

micans · Accepted Answer · 2023-08-22 23:37:25Z

To help with exactly this sort of problem I made a tool to deal with columns and rows in tabular data with column names. Above would be solved by

zcat majorfile.gz | pick $(cat subset.txt)

although pick only accepts tab-separated files. Anything else the user is responsible for changing the crazy format (looking at you comma-separated files) into something sane. Normally I don't flog scripts on the internet (much), but this lack of functionality in Unix has bugged me for a long time. Pick can do a lot of things (selecting/changing/combining/computing-new columns as well as filtering rows) - I use it on a daily basis. It does not answer the question in the narrow sense, but I will add that abstraction layers and domain-specific languages are the tools of the trade.

Collectives™ on Stack Overflow

Select columns based on their names from a file using AWK

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related