3

my pattern is

Forward primer
CGAAGCCTGGGGTGCCCGCGATTT Plus 24 1 24 71.81 66.67 4.00 2.00 

Reverse primer
AAATCGGTCCCATCACCTTCTTAT Minus 24 420 397 59.83 41.67 5.00 2.00 

Product length
420 


Products on potentially unintended templates


>CP049108.1 Mycobacterium tuberculosis strain 5005 chromosome, complete genome product length = 495
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        1930054  ........................  1930077

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        1930548  ........................  1930525


product length = 2946
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        1927603  .......C....C..T..T...G.  1927626

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        1930548  ........................  1930525


>CP046728.2 Mycobacterium tuberculosis strain TCDC11 chromosome, complete genome product length = 420
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        2150761  ........................  2150784

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        2151180  ........................  2151157


product length = 2595
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        2148586  .......C....C..T..T...G.  2148609

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        2151180  ........................  2151157


>CP047258.1 Mycobacterium tuberculosis strain TCDC3 chromosome product length = 345
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        2166300  ........................  2166323

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        2166644  ........................  2166621

What I need is

>CP049108.1 = 495   1930054 1930548 
>CP046728.2 = 420   2150761 2151180
>CP047258.1 = 345   2166300 2166644

I am microbiologist and Python beginner. I tried

import re
file = open(r"C:\\Users\\Lab\\Desktop\\amplicons\\ETRA", "r")
handle = file.read()
file.close()

pattern1 = re.compile(r'>.{5,10}\.\d')
matches1 = pattern1.finditer(handle)

for match1 in matches1:
    print(match1.group(0))

but I need specific terms coming after my accession number too (accession number is >CP049108.1 for an example). I will adapt your knowledge to my other work too.

appreciate your help and thank you in advance

5 Answers 5

3

Here's what I came up with- >([\w\d]*?\.\d*?) .+= (\d+)\n.+\n.*?(\d+).+\n{2}.+\n.*?(\d+)

Let's see an example with only one set of data, you can feed in the whole data you pasted and still get results as long as you have global flag set to True (it is set to True by default in python)

>CP049108.1 Mycobacterium tuberculosis strain 5005 chromosome, complete genome product length = 495 Forward primer 1
CGAAGCCTGGGGTGCCCGCGATTT 24 Template 1930054 ........................ 1930077

Reverse primer 1 AAATCGGTCCCATCACCTTCTTAT 24 Template
1930548 ........................ 1930525

The first group will be - CP049108.1

The second group will be - 495

The third group will be - 1930054

The fourth (and final) group will be - 1930548

Ofcourse, now you can restructure the whole data to be as you want it to be, if you're reading the data from a text file, you may use this code snippet-

import re

with open('test.txt', 'r') as file:
    content = file.read()

pattern = re.compile(r'>([\w\d]*?\.\d*?) .+= (\d+)\n.+\n.*?(\d+).+\n{2}.+\n.*?(\d+)')

for match in pattern.finditer(content):
    output = '>{} = {} {} {}'.format(match.group(1), match.group(2), match.group(3), match.group(4))
    print(output)

If I feed in exactly the data set you provided to test.txt, I get this output-

>CP049108.1 = 495   1930054 1930548
>CP046728.2 = 420   2150761 2151180
>CP047258.1 = 345   2166300 2166644

Regex Explanation

>(\w+\.\d+) .+= (\d+)\n.+\n.*?(\d+).+\n{2}.+\n.*?(\d+)

  • Let's analyze the first line first- >(\w+\.\d+) .+= (\d+)\n

    First this matches the CP049108, stops until a .(dot) is found and then matches the next digits, in this case - 1, stops until a = is reached. It'll then combine those to get CP049108.1 in a single capture group

    Later it will grab the digits right after the = and go to the next line, in this case it's 495

  • Time for the second line - .+\n

    Yeah, the second line is just ignored

  • Now, the third line - .*?(\d+).+\n{2}

    It ignores everything up until it reaches the first set of digits, grabs those and skips to the next next line (2 new lines). In this case the result is 1930054

  • Now, the fourth line - .+\n

    This is also ignored

  • Finally, the last line - .*?(\d+)

    This works exactly the same as the 3rd line, the result is 1930548

Check out the demo!

Sign up to request clarification or add additional context in comments.

2 Comments

@ChooseelBunsuwansakul Welcome to stackoverflow! You should mark an answer as accepted by clicking on the tick mark next to the answer. This ensures the next person searching up the question, knows immediately what worked :)
In this part >([\w\d]*?\.\d*?) the \w also matches \d in the character class. You don't have to make the quantifiers non greedy as \w and \d can not cross the dot or the space boundary. You could use >(\w+\.\d+)
2

You can match the following regular expression then extract the contents of the three capture groups:

r'^(>CP\d{6}\.\d).+?\bproduct +length += +(\d+).*?^Template +(\d+).*?^Template +(\d+)'gms

Demo

This can be made self documenting by employing Python's VERBOSE (aka X) flag.

This is derived from Perl's free-spacing mode, which I'll use because I'm not familiar with Python. Reader who don't know Perl will be able to follow this just fine.

/
^                        # match beginning of line        
(>CP\d{6}\.\d)           # match '>CP', 6 digits, '.', 1 digit in cap group 1  
.+?                      # match 0+ characters, lazily (`?`)
\b                       # match a word break               
product\ +length\ +=\ +  # match 'product', 1+ spaces, 'length', 1+ spaces,
                         #   '=', 1+ spaces 
(\d+)                    # match 1+ digits in cap group 2
.*?                      # match 0+ characters, lazily (`?`)
^Template\ +             # match beginning of line, 'Template', 1+ spaces 
(\d+)                    # match 1+ digits in cap group 3
.*?                      # match 0+ characters, lazily (`?`)
^Template\ +             # match beginning of line, 'Template', 1+ spaces 
(\d+)                    # match 1+ digits in cap group 4                
/xgms                    # free-spacing, global, multiline, single-line modes

The meanings of the different modes are given at the link. In free-spacing mode non-escaped spaces outside of character classes are removed before the regex is parsed. Spaces that are part of the expression, such as those between "product" and "length", must therefore be protected. I've chosen to escape them here but other options are to put each space in character class ([ ]), use Unicode expressions \p{Space} or [[:space:\\ or, if appropriate, \s (a whitespace character).

Comments

0

Here's one way to do it. The regex search is rather dependent on the text being consistent with the file you provided.

rows = re.findall('(>.{5,10}).*length( = \d*).*\s*.*\s*Template\s*(   \d*).*\s*.*\s*Template\s*( \d*)', handle)
print('\n'.join(''.join(x) for x in rows))

Comments

0

This is what you need:

matches=re.findall(r'(>.{5,10}\.\d).*?( = \d+).*?Template.*?( \d+).+?( \d+)', handle, re.DOTALL)
["".join(x) for x in matches]

As you can see four groups declared using round brackets in regex. Each group will fetch four parts of your output separated by space. you will join this result to get your desired output.

Comments

-1

I would suggest thinking about the logic you need to have and then write rules to help.

Here is a useful book that I used to learn Regex. It covers the basic rules you can use to get the results you need: http://diveinto.org/python3/regular-expressions.html

1 Comment

This should be a comment.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.