Parallelize a awk script with multiple input files and changing the name of the output file

Question

I have a series of text files in a folder sub.yr_by_yr which I pass to a for loop to subset a Beagle file from the header. I want to parallelize this script to subset the Beagle file from the header values (which is done using my subbeagle.awk script). I use the title of the text files to export the subset to a new file name using the base pattern matching in bash (file11=${file1%.subbeagle.txt}) to get the desired output (MM.beagle.${file11}.gz)

for file1 in $(ls sub.yr_by_yr)
do 
echo -e  "Doing sub-samples \n $file1"
file11=${file1%.subbeagle.txt}
awk -f subbeagle.awk \
       ./sub.yr_by_yr/$file1 <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz
done

The for loop works, but takes for ever... hence the need for parallelization. the folder sub.yr_by_yr contains >10 files named something like similar to this: sp.yrseries.site1.1.subbeagle.txt, sp.yrseries.site1.2.subbeagle.txt, sp.yrseries.site1.3.subbeagle.txt...

I've tried

parallel "file11=${{}%.subbeagle.txt}; awk -f $SUBBEAGLEAWKSCRIPT ./sub.yr_by_yr/{} <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz" ::: sub.yr_by_yr/*.subbeagle.txt

But it gives me 'bad substitution'

How could I use the awk script in parallel and rename the files accordingly?

Content of subbeagle.awk:

# Source: https://stackoverflow.com/questions/74451358/select-columns-based-on-their-names-from-a-file-using-awk

BEGIN  { FS=OFS="\t" }                             # uncomment if input/output fields are tab delimited
FNR==NR { headers[$1]; next }
        { sep=""
          for (i=1; i<=NF; i++) {
              if (FNR==1 && ($i in headers)) {
                 fldids[i]
              }
              if (i in fldids) {
                 printf "%s%s",sep,$i
                 sep=OFS                            # if not set elsewhere (eg, in a BEGIN{}block) then default OFS == <space>
              }
          }
          print ""
        }

Content of MajorMinor.beagle.gz

marker      allele1  allele2  FINCH_WB_ID1_splitMerged  FINCH_WB_ID1_splitMerged  FINCH_WB_ID1_splitMerged  FINCH_WB_ID2_splitMerged  FINCH_WB_ID2_splitMerged
chr1_34273  G        C        0.79924                   0.20076                   3.18183e-09               0.940649                      0.0593509
chr1_34285  G        A        0.79924                   0.20076                   3.18183e-09               0.969347                      0.0306534
chr1_34291  G        C        0.666111                  0.333847                  4.20288e-05               0.969347                      0.0306534
chr1_34299  C        G        0.000251063               0.999498                  0.000251063               0.996035                      0.00396529

UPDATE:

I was able to get this from this source:

parallel "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{/.}_test.gz'" ::: sub.yr_by_yr/*.subbeagle.txt

The only fancy thing that needs to be removed is the .subbeagle par of the input file name...

Consider making it easy for folk to help you... Just show your input as <(zcat XXX.gz) and your output as gzip > YYY${file11}.gz Then show the first three filenames in your subdirectory and the first 3 commands you want GNU Parallel to run with the parameters. — Mark Setchell
– Mark Setchell, Commented Dec 15, 2022 at 19:52
Please read mywiki.wooledge.org/BashFAQ/001, mywiki.wooledge.org/DontReadLinesWithFor and mywiki.wooledge.org/ParsingLs — Paul Hodges
– Paul Hodges, Commented Dec 15, 2022 at 20:01
OK, for file1 in sub.yr_by_yr/*.txt would be better in this case. But it doesn't answer the parallel part. — M. Beausoleil
– M. Beausoleil, Commented Dec 15, 2022 at 20:11
fldids[i] ?? don't you want something like fldids[i]=$i"? Good luck. — shellter
– shellter, Commented Dec 15, 2022 at 20:14
See you solved your problem with the best tool for the job. And yes, I appreciated that you wanted your scripts to run in parallel. If your system allows 22 jobs in the background (many now do, in the 32bit days it was often 12-15), changing that one line of code to awk -f subbeagle.awk \ ./sub.yr_by_yr/$file1 <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz & (note the ending char, &, which means run in the background), would have executed all of your script "instances" with different values for $file1 all at "once" (I think!). Good luck — shellter
– shellter, Commented Dec 16, 2022 at 1:51

M. Beausoleil · Accepted Answer · 2022-12-15 22:55:34Z

3

So the parallel tutorial helped me here:

parallel --rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;' "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'" ::: sub.yr_by_yr/*.subbeagle.txt

Let's break this:

--rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;'

--rpl will "define a shorthand replacement string" (see parallel tutorial and another example here)
{mymy} is my 'new' replacement string, which will execute what is after it.
s:.*/::; is the definition to {/} (see parallel tutorial, search for "Perl expression replacement string", the last part of that section shows the definition of 7 'default' replacement strings)
s:\.[^.]+$::;s:\.[^.]+$::; removes 2 extensions (so .subbeagle.txt where .txt is the first extension and .subbeagle is the second)
```
"awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'"
```
is the subsetting and compressing par of the script. Note that the {mymy} is where the replacement will take place. As you can see {} will be in input string. The rest is unchanged!
::: sub.yr_by_yr/*.subbeagle.txt will pass all the files to parallel as input.

It took ~ 2 hours to do at least ~5 files, but using 22 cores, I could do all files this in a fraction of the time (~20 minutes)!

edited Dec 15, 2022 at 22:55

answered Dec 15, 2022 at 21:07

M. Beausoleil

3,8437 gold badges33 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ole Tange Over a year ago

--plus defines {/..}.

Collectives™ on Stack Overflow

Parallelize a awk script with multiple input files and changing the name of the output file

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related