1
  • I am using the following script:
    #!/usr/bin/awk -f
    BEGIN {
        FS = "[_.]"
    }
    
    function display() {
        if (length(gene_ids) > 1)
            for (j=0; j <= i; j++)
                print a[j]
    }
    
    {
        if (/^>Cluster /) {
            display()
            delete a
            delete gene_ids
            a[i=0] = $0
        } else {
            a[++i] = $0
            gene_ids[$7] = 1
        }
    }
    
    END {
        display()
    }
  • To process the following file:
>Cluster 0
0   3843aa, >9606_9d1c13f4f2796e1bc5d9c034d256608e_ENSP00000478752_3843_318_ENST00000621744_ENSG00000286185... *
1   3843aa, >9606_9d1c13f4f2796e1bc5d9c034d256608e_ENSP00000498781_3843_318_ENST00000651566_ENSG00000271383... at 1:3843:1:3843/100.00%
>Cluster 17
0   1388aa, >9606_e3f5b4b466cd2bae95842b586d4d5ff5_ENSP00000419786_1388_4_ENST00000465301_ENSG00000243978... *
1   1388aa, >9606_e3f5b4b466cd2bae95842b586d4d5ff5_ENSP00000441452_1388_4_ENST00000540313_ENSG00000243978... at 1:1388:1:1388/100.00%
>Cluster 34
0   1150aa, >9606_c6fca1c116a00dbb0d2e8930f4056625_ENSP00000353655_1150_26_ENST00000360468_ENSG00000196547... *
1   1150aa, >9606_c6fca1c116a00dbb0d2e8930f4056625_ENSP00000452948_1150_26_ENST00000559717_ENSG00000196547... at 1:1150:1:1150/100.00%
>Cluster 39
0   1072aa, >9606_64cead9c681fd594c83c17cc06748bb6_ENSP00000315112_1072_50_ENST00000324103_ENSG00000092098... *
1   1072aa, >9606_64cead9c681fd594c83c17cc06748bb6_ENSP00000457512_1072_50_ENST00000558468_ENSG00000259529... at 1:1072:1:1072/100.00%
>Cluster 271
0       551aa, >9606_95dbfd3f219d32f1cc1074a79bfc576d_ENSP00000415200_551_42_ENST00000429354_ENSG00000268500... *
1       551aa, >9606_95dbfd3f219d32f1cc1074a79bfc576d_ENSP00000470259_551_42_ENST00000599649_ENSG00000268500... at 1:551:1:551/100.00%
2       551aa, >9606_95dbfd3f219d32f1cc1074a79bfc576d_ENSP00000473238_551_42_ENST00000534261_ENSG00000105501... at 1:551:1:551/100.00%
>Cluster 284
0       547aa, >9606_8ed59e1e16a1229b55495ff661b5aa66_ENSP00000354675_547_9_ENST00000361229_ENSG00000198908... *
1       547aa, >9606_8ed59e1e16a1229b55495ff661b5aa66_ENSP00000361820_547_9_ENST00000372735_ENSG00000198908... at 1:547:1:547/100.00%
2       547aa, >9606_8ed59e1e16a1229b55495ff661b5aa66_ENSP00000391722_547_9_ENST00000448867_ENSG00000198908... at 1:547:1:547/100.00%
3       547aa, >9606_8ed59e1e16a1229b55495ff661b5aa66_ENSP00000403226_547_9_ENST00000457056_ENSG00000198908... at 1:547:1:547/100.00%
4       547aa, >9606_8ed59e1e16a1229b55495ff661b5aa66_ENSP00000405893_547_9_ENST00000447531_ENSG00000198908... at 1:547:1:547/100.00%
  • Which results in the following output:
>Cluster 0
0   3843aa, >9606_9d1c13f4f2796e1bc5d9c034d256608e_ENSP00000478752_3843_318_ENST00000621744_ENSG00000286185... *
1   3843aa, >9606_9d1c13f4f2796e1bc5d9c034d256608e_ENSP00000498781_3843_318_ENST00000651566_ENSG00000271383... at 1:3843:1:3843/100.00%
>Cluster 39
0   1072aa, >9606_64cead9c681fd594c83c17cc06748bb6_ENSP00000315112_1072_50_ENST00000324103_ENSG00000092098... *
1   1072aa, >9606_64cead9c681fd594c83c17cc06748bb6_ENSP00000457512_1072_50_ENST00000558468_ENSG00000259529... at 1:1072:1:1072/100.00%
>Cluster 271
0       551aa, >9606_95dbfd3f219d32f1cc1074a79bfc576d_ENSP00000415200_551_42_ENST00000429354_ENSG00000268500... *
1       551aa, >9606_95dbfd3f219d32f1cc1074a79bfc576d_ENSP00000470259_551_42_ENST00000599649_ENSG00000268500... at 1:551:1:551/100.00%
2       551aa, >9606_95dbfd3f219d32f1cc1074a79bfc576d_ENSP00000473238_551_42_ENST00000534261_ENSG00000105501... at 1:551:1:551/100.00%
  • The script works like a charm on my test machine (running GNU Awk 5.1.0, API: 3.0). But when I attempt to run the script on my production machines (either running GNU Awk 5.1.0, or GNU Awk 4.1.4), the script gives me the following error:
(FILENAME=test_cluster FNR=1) fatal: attempt to use scalar `gene_ids' as an array
  • I have tested if the error is related to the length(array) by running the following:
awk 'BEGIN{a[1]=10;a[2]=20;print length(a)}'

as suggested here

  • But this gives me the expected result in all my machines.

  • I have also tested if the state of the posix variable, using the following code:

set -o | grep posix
  • But these tests give me the same result (off) in all my machines.

  • Giving that my production machines are all running Ubuntu server 18.01, I have also tested using AWK on an Ubuntu 20.01 server machine, but the result was the same (not successful).

  • Also, giving that my test machine (running GNU Awk 5.1.0), is a MacOS with AWK installed via MacPorts, I have tried compiling AWK on my Ubuntu machines using the same configuration command, but while the compilation worked, running the script using this newly compiled AWK also gave me the same error.

  • I would appreciate any help that would identify the origin of the problem and possible solutions

1 Answer 1

1

length(gene_ids) declares gene_ids as a scalar if gene_ids is previously unused because historically length() was used only on strings (that behavior will change in a upcoming gawk release such that length() won't set the type of it's argument if it was previously unset).

Add delete gene_ids to the BEGIN section to declare it as an array regardless of the order in which the existing lines of your script get hit, which is driven by your input data:

$ awk 'BEGIN{ length(gene_ids); gene_ids[1] }'
awk: cmd. line:1: fatal: attempt to use scalar `gene_ids' as an array

$ awk 'BEGIN{ delete gene_ids; length(gene_ids); gene_ids[1] }'
$
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.