0

I've essentially created a function which inputs a string, and then adds the integers which appear before each unique letter. (e.g. 21M4D35M, would provide 56M and 4D).

import re
import itertools

# example input: X is "21M4D35M1I84M9S15=92X"

def CIGAR(X):
    m = list(map(int, (re.findall(r"(\d+)M", X)))) #need to create a loop for these.
    d = list(map(int, (re.findall(r"(\d+)D", X))))
    n = list(map(int, (re.findall(r"(\d+)N", X))))
    i = list(map(int, (re.findall(r"(\d+)I", X))))
    s = list(map(int, (re.findall(r"(\d+)S", X))))
    h = list(map(int, (re.findall(r"(\d+)H", X))))
    x = list(map(int, (re.findall(r"(\d+)X", X))))
    equals = list(map(int, (re.findall(r"(\d+)=", X))))
    
    query = itertools.chain(m,n,d,i,s,h,x,equals)
    reference = itertools.chain(m,n,d,i,s,h,x,equals)
    
    query = m + i + s + equals + x #according to samtools, these operations cause the alignment to step along the query sequence 
    reference = m + d + n + equals + x
    
    print(sum(m), "Exact Matches\n", 
          sum(d), "Deletions\n", 
          sum(n), "Skipped region from the reference\n",
          sum(i), "Insertions\n",
          sum(s), "Soft Clippings\n", 
          sum(h), "Hard Clippings\n",
          sum(x), "sequence match\n",
          sum(equals), "sequence mismatch\n",
          sum(query), "bases in query sequence\n",
          sum(reference), "bases in the reference sequence") 

However, I'm trying to implement a more efficient way to perform the first part of the function (lines 7-14). I'm trying to do something like, but don't know exactly how to implement this

acronyms = [m,d,i,s,h,x,"="]
for i in acronyms:
i = list(map(int, (re.findall(r"(\d+)capitalize[i]", X))))

As Always, any help is greatly appreciated and if something isn't clear let me know!

1
  • 2
    Use string formatting to create the regex and put the appropriate letter in it. Instead of trying to figure out which letter to use from the variable name, don't have those variables in the first place - instead, use a dict that maps from the letter to the corresponding integers. Functionally, that means you have two questions, both of which are duplicates. Commented Nov 8, 2021 at 17:58

1 Answer 1

2

An efficient solution would be to used collections.Counter:

def CIGAR(cigar_string):
    from collections import Counter

    c = Counter()
    for n, k in re.findall('(\d+)(\D)', cigar_string):
        c[k] += int(n)


    keys = {'M': 'Exact Matches', 
            'D': 'Deletions', 
            'N': 'Skipped region from the reference',
            'I': 'Insertions',
            'S': 'Soft Clippings', 
            'H': 'Hard Clippings',
            'X': 'sequence match',
            '=': 'sequence mismatch',
            }

    for k in keys:
        print(f'{c[k]:>5}: {keys[k]}')

    query = sum(c[x] for x in 'MIS=X')
    ref   = sum(c[x] for x in 'MDN=X')
    print(f'{query:>5}: bases in query sequence')
    print(f'{ref:>5}: bases in reference sequence')

    return c

Example:

>>> CIGAR('21M4D35M')
   56: Exact Matches
    4: Deletions
    0: Skipped region from the reference
    0: Insertions
    0: Soft Clippings
    0: Hard Clippings
    0: sequence match
    0: sequence mismatch
   56: bases in query sequence
   60: bases in reference sequence

Counter({'M': 56, 'D': 4})
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.