1

I have an array, let's call it ensembldb that has the following lines:

rs2799070   ENST00000379389 ENSG00000187608 ISG15   inframe_insertion   NA  NA  protein_coding  ISG15   ubiquitin-like  modifier    [Source:HGNC    Symbol;Acc:HGNC:4053]NM_005101.3    NP_005092
rs2799070   ENST00000458555 ENSG00000224969 AL645608.2  missense_variant    NA  NA  antisense   NA  NULL    NULL
rs2799070   ENST00000624652 ENSG00000187608 ISG15   inframe_deletion    NA  NA  protein_coding  ISG15   ubiquitin-like  modifier    [Source:HGNC    Symbol;Acc:HGNC:4053]NULL   NULL
rs2799070   ENST00000624697 ENSG00000187608 ISG15   frameshift_variant  NA  NA  protein_coding  ISG15   ubiquitin-like  modifier    [Source:HGNC    Symbol;Acc:HGNC:4053]NULL   NULL

and another ordered array, let's call it ordered_array:

frameshift_variant
missense_variant
inframe_insertion
inframe_deletion

I would like to order my array ensembldb to match the orders in array ordered_array. The output expected is the following:

rs2799070   ENST00000624697 ENSG00000187608 ISG15   frameshift_variant  NA  NA  protein_coding  ISG15   ubiquitin-like  modifier    [Source:HGNC    Symbol;Acc:HGNC:4053]NULL   NULL
rs2799070   ENST00000458555 ENSG00000224969 AL645608.2  missense_variant    NA  NA  antisense   NA  NULL    NULL
rs2799070   ENST00000379389 ENSG00000187608 ISG15   inframe_insertion   NA  NA  protein_coding  ISG15   ubiquitin-like  modifier    [Source:HGNC    Symbol;Acc:HGNC:4053]NM_005101.3    NP_005092
rs2799070   ENST00000624652 ENSG00000187608 ISG15   inframe_deletion    NA  NA  protein_coding  ISG15   ubiquitin-like  modifier    [Source:HGNC    Symbol;Acc:HGNC:4053]NULL   NULL

I checked this question but it doesn't answer my question as it is a multidimensional array. How can I order my array ensembldb according to the ordered array ordered_array ?

Thank you.

Edit 1: Adding code as requested by @anubhava

declare -A ordered_array
ordered_array[0]="frameshift_variant"
ordered_array[1]="missense_variant"
ordered_array[2]="inframe_insertion"
ordered_array[3]="inframe_deletion"

while read -r LINE; do
    chrom=$(echo -e "$LINE" | cut -f1 -d$'\t' | sed 's/^chr//g')
    pos=$(echo -e "$LINE" | cut -f2 -d$'\t')
    ref=$(echo -e "$LINE" | cut -f3 -d$'\t')
    alt=$(echo -e "$LINE" | cut -f4 -d$'\t')
    LINE=$(echo -e "$LINE" | sed 's/^chr//g')
    ensembldb=$(echo "PREPARE stmt1 FROM 'SELECT Annotated_ID, Transcript, Gene_ID, Gene_name, Consequence, Swissprot_ID, AA_change, Biotype, Gene_description, RefSeq_mRNA, RefSeq_peptide FROM SNP_annot.37_annot_ensembl_89_full_descr where chrom = \"$chrom\" and Start = \"$pos\" and Local_alleles = \"$ref/$alt\"'; execute stmt1;" | mariadb -A -N)
    readarray -t array <<< "$ensembldb"
    pos19=$(echo "PREPARE stmt2 FROM 'select hg19_pos from SNP_annot.mut_convert_pos where chrom = \"$chrom\" and hg38_pos = \"$pos\"'; execute stmt2;" | mariadb -A -N)
    hits=$(echo -e "$ensembldb" | wc -l)
    [ ! -z "$pos19" ] && awk -v line="$LINE" -v pos="$pos19" -v ensembl="$ensembldb" -v hit="$hits" 'BEGIN {print line"\t"ensembl"\t"hit"\t"pos}'
done

1.The variable LINE has rows like this:

CHROM   POS REF ALT QUAL    DP  Genotype
chr1    16495   G   C   1722.77 252 G/C
chr1    16719   T   A   145.77  189 T/A
chr1    16841   G   T   701.77  521 G/T
chr1    17626   G   A   154.77  124 G/A

2.The variable ensembldb is a MySQL query that returns multiple rows and converted to an array. It contains rows that I want to sort according to ordered_array and pick the first row that matches ordered_array.

3
  • @anubhava I added some code. Hopefully it's clear. Commented Jan 18, 2019 at 9:52
  • @Law Some feedback on my answer would be nice. Doesn't it do what you want? :) Commented Jan 18, 2019 at 9:54
  • @mickp I am trying it right now, I will let you know asap Commented Jan 18, 2019 at 9:57

1 Answer 1

2

This awk might work for you:

awk 'FNR==NR{a[$5]=$0;next}{print a[$1]}' file_a file_b

If a and b are really arrays:

readarray -t a < <(awk 'FNR==NR{a[$5]=$0;next}{print a[$1]}' <(printf '%s\n' "${a[@]}") <(printf '%s\n' "${b[@]}"))
Sign up to request clarification or add additional context in comments.

7 Comments

Could you please give some explanation on the command ? thank you in advance
Also, shouldn't you be passing the arrays for awk as awk variables with -v argument ?
First of all, does the solution work for you? :) No point in explaining if it doesn't work.
No, the solution is not working for me, sorry. I tried doing readarray -t a < <(awk 'FNR==NR{ensembldb[$5]=$0;next}{print ensembldb[$1]}' <(printf '%s\n' "${ensembldb[@]}") <(printf '%s\n' "${ordered_array[@]}")) and echo "$a" nothing's returned
Show the input properly in your question. For example what your a variable contains. It is not an array as I can see now after your edit BTW.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.