I have a file that contains a header and information under it.
zcat majorfile.gz | head -n 3 | cut -d ' ' -f1-10
marker alleleA alleleB FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged FINCH_WB_C049985_splitMerged FINCH_WB_C049985_splitMerged FINCH_WB_C049985_splitMerged FINCH_WB_C071898_splitMerged
LR761571.1_34273 G C 0.9955 0.0045 0 0.9996 0.0004 0 1
LR761571.1_34285 G A 0.9934 0.0066 0 0.9999 0.0001 0 0.9435
I'd like to subset that file based on the column names:
cat header.subset.txt | head
marker
alleleA
alleleB
FINCH_WB_633_splitMerged
FINCH_WB_ES1B002_splitMerged
FINCH_WB_JH1417_splitMerged
FINCH_WB_JH1452_splitMerged
FINCH_WB_JH1495_splitMerged
FINCH_WB_JP000_splitMerged
FINCH_WB_JP004_splitMerged
I have multiple "header.subset.txt" files so I'm going to loop through them.
for file1 in header.subset.txt
do
awk 'NR==FNR{a[$1]++;next} {if(FNR==1){for(i=1;i<=NF;i++){if(a[$i]){printf $i" ";b[i]=$i}}}else{printf "\n";for(j=1;j<=NF;j++){if(b[j]) {printf $j" "}}}}END {printf "\n"}' \
$file1 \
majorfile.gz > majorfile_sub.gz
done
The awk command works for a file with tab separated fields, but not with spaces (like in this case)
In the example, it would give:
marker alleleA alleleB FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged FINCH_WB_633_splitMerged
LR761571.1_34273 G C 0.9955 0.0045 0
LR761571.1_34285 G A 0.9934 0.0066 0
EDIT: here's the awk code above formatted by gawk -o- to be much easier to read (but obviously still lacking meaningful variable names):
NR == FNR {
a[$1]++
next
}
{
if (FNR == 1) {
for (i = 1; i <= NF; i++) {
if (a[$i]) {
printf $i " "
b[i] = $i
}
}
} else {
printf "\n"
for (j = 1; j <= NF; j++) {
if (b[j]) {
printf $j " "
}
}
}
}
END {
printf "\n"
}
FINCH_WB_C049985_splitMergedandFINCH_WB_C071898_splitMergedwill be dropped)awk 'NR==FNR{a[$1]++;next} {if(FNR==1){for(i=1;i<=NF;i++){if(a[$i]){printf $i" ";b[i]=$i}}}else{printf "\n";for(j=1;j<=NF;j++){if(b[j]) {printf $j" "}}}}END {printf "\n"}'. Instead format it in a way that's legibile. Also use meaningful variable names. Make it as easy as possible for us to help you, not almost impossible.majorfile.gz(or is the suffix misleading, ie, file really isn'tgzip'd?)