4

I have a series of text files in a folder sub.yr_by_yr which I pass to a for loop to subset a Beagle file from the header. I want to parallelize this script to subset the Beagle file from the header values (which is done using my subbeagle.awk script). I use the title of the text files to export the subset to a new file name using the base pattern matching in bash (file11=${file1%.subbeagle.txt}) to get the desired output (MM.beagle.${file11}.gz)

for file1 in $(ls sub.yr_by_yr)
do 
echo -e  "Doing sub-samples \n $file1"
file11=${file1%.subbeagle.txt}
awk -f subbeagle.awk \
       ./sub.yr_by_yr/$file1 <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz
done

The for loop works, but takes for ever... hence the need for parallelization. the folder sub.yr_by_yr contains >10 files named something like similar to this: sp.yrseries.site1.1.subbeagle.txt, sp.yrseries.site1.2.subbeagle.txt, sp.yrseries.site1.3.subbeagle.txt...

I've tried

parallel "file11=${{}%.subbeagle.txt}; awk -f $SUBBEAGLEAWKSCRIPT ./sub.yr_by_yr/{} <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz" ::: sub.yr_by_yr/*.subbeagle.txt

But it gives me 'bad substitution'

How could I use the awk script in parallel and rename the files accordingly?

Content of subbeagle.awk:

# Source: https://stackoverflow.com/questions/74451358/select-columns-based-on-their-names-from-a-file-using-awk

BEGIN  { FS=OFS="\t" }                             # uncomment if input/output fields are tab delimited
FNR==NR { headers[$1]; next }
        { sep=""
          for (i=1; i<=NF; i++) {
              if (FNR==1 && ($i in headers)) {
                 fldids[i]
              }
              if (i in fldids) {
                 printf "%s%s",sep,$i
                 sep=OFS                            # if not set elsewhere (eg, in a BEGIN{}block) then default OFS == <space>
              }
          }
          print ""
        }

Content of MajorMinor.beagle.gz

marker      allele1  allele2  FINCH_WB_ID1_splitMerged  FINCH_WB_ID1_splitMerged  FINCH_WB_ID1_splitMerged  FINCH_WB_ID2_splitMerged  FINCH_WB_ID2_splitMerged
chr1_34273  G        C        0.79924                   0.20076                   3.18183e-09               0.940649                      0.0593509
chr1_34285  G        A        0.79924                   0.20076                   3.18183e-09               0.969347                      0.0306534
chr1_34291  G        C        0.666111                  0.333847                  4.20288e-05               0.969347                      0.0306534
chr1_34299  C        G        0.000251063               0.999498                  0.000251063               0.996035                      0.00396529

UPDATE:

I was able to get this from this source:

parallel "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{/.}_test.gz'" ::: sub.yr_by_yr/*.subbeagle.txt

The only fancy thing that needs to be removed is the .subbeagle par of the input file name...

7
  • 1
    Consider making it easy for folk to help you... Just show your input as <(zcat XXX.gz) and your output as gzip > YYY${file11}.gz Then show the first three filenames in your subdirectory and the first 3 commands you want GNU Parallel to run with the parameters. Commented Dec 15, 2022 at 19:52
  • 1
    Please read mywiki.wooledge.org/BashFAQ/001, mywiki.wooledge.org/DontReadLinesWithFor and mywiki.wooledge.org/ParsingLs Commented Dec 15, 2022 at 20:01
  • OK, for file1 in sub.yr_by_yr/*.txt would be better in this case. But it doesn't answer the parallel part. Commented Dec 15, 2022 at 20:11
  • fldids[i] ?? don't you want something like fldids[i]=$i"? Good luck. Commented Dec 15, 2022 at 20:14
  • 1
    See you solved your problem with the best tool for the job. And yes, I appreciated that you wanted your scripts to run in parallel. If your system allows 22 jobs in the background (many now do, in the 32bit days it was often 12-15), changing that one line of code to awk -f subbeagle.awk \ ./sub.yr_by_yr/$file1 <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz & (note the ending char, &, which means run in the background), would have executed all of your script "instances" with different values for $file1 all at "once" (I think!). Good luck Commented Dec 16, 2022 at 1:51

1 Answer 1

3

So the parallel tutorial helped me here:

parallel --rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;' "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'" ::: sub.yr_by_yr/*.subbeagle.txt

Let's break this:

--rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;'
  • --rpl will "define a shorthand replacement string" (see parallel tutorial and another example here)

  • {mymy} is my 'new' replacement string, which will execute what is after it.

  • s:.*/::; is the definition to {/} (see parallel tutorial, search for "Perl expression replacement string", the last part of that section shows the definition of 7 'default' replacement strings)

  • s:\.[^.]+$::;s:\.[^.]+$::; removes 2 extensions (so .subbeagle.txt where .txt is the first extension and .subbeagle is the second)

    "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'"
    
  • is the subsetting and compressing par of the script. Note that the {mymy} is where the replacement will take place. As you can see {} will be in input string. The rest is unchanged!

  • ::: sub.yr_by_yr/*.subbeagle.txt will pass all the files to parallel as input.

It took ~ 2 hours to do at least ~5 files, but using 22 cores, I could do all files this in a fraction of the time (~20 minutes)!

Sign up to request clarification or add additional context in comments.

1 Comment

--plus defines {/..}.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.