1

I have a file that contains a header and information under it.

zcat majorfile.gz | head -n 3 | cut -d ' ' -f1-10

marker alleleA alleleB FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged FINCH_WB_C049985_splitMerged FINCH_WB_C049985_splitMerged FINCH_WB_C049985_splitMerged FINCH_WB_C071898_splitMerged
LR761571.1_34273 G C 0.9955 0.0045 0 0.9996 0.0004 0 1
LR761571.1_34285 G A 0.9934 0.0066 0 0.9999 0.0001 0 0.9435

I'd like to subset that file based on the column names:

cat header.subset.txt | head
marker
alleleA
alleleB
FINCH_WB_633_splitMerged
FINCH_WB_ES1B002_splitMerged
FINCH_WB_JH1417_splitMerged
FINCH_WB_JH1452_splitMerged
FINCH_WB_JH1495_splitMerged
FINCH_WB_JP000_splitMerged
FINCH_WB_JP004_splitMerged

I have multiple "header.subset.txt" files so I'm going to loop through them.

for file1 in header.subset.txt 
do 
awk 'NR==FNR{a[$1]++;next} {if(FNR==1){for(i=1;i<=NF;i++){if(a[$i]){printf $i" ";b[i]=$i}}}else{printf "\n";for(j=1;j<=NF;j++){if(b[j]) {printf $j" "}}}}END {printf "\n"}' \
  $file1 \
  majorfile.gz > majorfile_sub.gz
done 

The awk command works for a file with tab separated fields, but not with spaces (like in this case)

In the example, it would give:

marker alleleA alleleB FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged
LR761571.1_34273 G C 0.9955 0.0045 0
LR761571.1_34285 G A 0.9934 0.0066 0

EDIT: here's the awk code above formatted by gawk -o- to be much easier to read (but obviously still lacking meaningful variable names):

NR == FNR {
        a[$1]++
        next
}

{
        if (FNR == 1) {
                for (i = 1; i <= NF; i++) {
                        if (a[$i]) {
                                printf $i " "
                                b[i] = $i
                        }
                }
        } else {
                printf "\n"
                for (j = 1; j <= NF; j++) {
                        if (b[j]) {
                                printf $j " "
                        }
                }
        }
}

END {
        printf "\n"
}
10
  • and the question is? (The reader on stackoverflow should NOT need to guess what the real question is) Commented Nov 15, 2022 at 19:53
  • 1
    Maybe you should read something about "FS" (field separator) and "OFS" (output field separator) in the manual ? Commented Nov 15, 2022 at 19:56
  • "I'd like to subset that file based on the column names:" So I'd like to 'extract' the columns that are in majorfile.gz, based on the lines in header.subset.txt. So 'marker', 'alleleA', 'alleleB' and all other matching columns will be selected. Is this clear? The last chunk of code shows a small example (here,FINCH_WB_C049985_splitMerged and FINCH_WB_C071898_splitMerged will be dropped) Commented Nov 15, 2022 at 19:58
  • 1
    If you're going to ask people for help with your code, don't cram it all onto 1 line with almost no white space like awk 'NR==FNR{a[$1]++;next} {if(FNR==1){for(i=1;i<=NF;i++){if(a[$i]){printf $i" ";b[i]=$i}}}else{printf "\n";for(j=1;j<=NF;j++){if(b[j]) {printf $j" "}}}}END {printf "\n"}'. Instead format it in a way that's legibile. Also use meaningful variable names. Make it as easy as possible for us to help you, not almost impossible. Commented Nov 15, 2022 at 20:04
  • 1
    I'm surprised your script works when feeding it majorfile.gz (or is the suffix misleading, ie, file really isn't gzip'd?) Commented Nov 15, 2022 at 20:24

2 Answers 2

2

A variation on OP's current code:

awk '
#BEGIN  { FS=OFS="\t" }                             # uncomment if input/output fields are tab delimited
FNR==NR { headers[$1]; next }
        { sep=""
          for (i=1; i<=NF; i++) {
              if (FNR==1 && ($i in headers)) {
                 fldids[i]
              }
              if (i in fldids) {
                 printf "%s%s",sep,$i
                 sep=OFS                            # if not set elsewhere (eg, in a BEGIN{}block) then default OFS == <space>
              }
          }
          print ""
        }
' header.subset.txt <(zcat majorfile.gz)

This generates:

marker alleleA alleleB FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged
LR761571.1_34273 G C 0.9955 0.0045 0
LR761571.1_34285 G A 0.9934 0.0066 0
Sign up to request clarification or add additional context in comments.

Comments

1

To help with exactly this sort of problem I made a tool to deal with columns and rows in tabular data with column names. Above would be solved by

zcat majorfile.gz | pick $(cat subset.txt)

although pick only accepts tab-separated files. Anything else the user is responsible for changing the crazy format (looking at you comma-separated files) into something sane. Normally I don't flog scripts on the internet (much), but this lack of functionality in Unix has bugged me for a long time. Pick can do a lot of things (selecting/changing/combining/computing-new columns as well as filtering rows) - I use it on a daily basis. It does not answer the question in the narrow sense, but I will add that abstraction layers and domain-specific languages are the tools of the trade.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.