How to use Python REGEX to rethrive specific terms in a pattern

Question

my pattern is

Forward primer
CGAAGCCTGGGGTGCCCGCGATTT Plus 24 1 24 71.81 66.67 4.00 2.00 

Reverse primer
AAATCGGTCCCATCACCTTCTTAT Minus 24 420 397 59.83 41.67 5.00 2.00 

Product length
420 


Products on potentially unintended templates


>CP049108.1 Mycobacterium tuberculosis strain 5005 chromosome, complete genome product length = 495
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        1930054  ........................  1930077

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        1930548  ........................  1930525


product length = 2946
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        1927603  .......C....C..T..T...G.  1927626

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        1930548  ........................  1930525


>CP046728.2 Mycobacterium tuberculosis strain TCDC11 chromosome, complete genome product length = 420
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        2150761  ........................  2150784

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        2151180  ........................  2151157


product length = 2595
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        2148586  .......C....C..T..T...G.  2148609

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        2151180  ........................  2151157


>CP047258.1 Mycobacterium tuberculosis strain TCDC3 chromosome product length = 345
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        2166300  ........................  2166323

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        2166644  ........................  2166621

What I need is

>CP049108.1 = 495   1930054 1930548 
>CP046728.2 = 420   2150761 2151180
>CP047258.1 = 345   2166300 2166644

I am microbiologist and Python beginner. I tried

import re
file = open(r"C:\\Users\\Lab\\Desktop\\amplicons\\ETRA", "r")
handle = file.read()
file.close()

pattern1 = re.compile(r'>.{5,10}\.\d')
matches1 = pattern1.finditer(handle)

for match1 in matches1:
    print(match1.group(0))

but I need specific terms coming after my accession number too (accession number is >CP049108.1 for an example). I will adapt your knowledge to my other work too.

appreciate your help and thank you in advance

Chase · Accepted Answer · 2020-03-06 10:34:38Z

3

Here's what I came up with- >([\w\d]*?\.\d*?) .+= (\d+)\n.+\n.*?(\d+).+\n{2}.+\n.*?(\d+)

Let's see an example with only one set of data, you can feed in the whole data you pasted and still get results as long as you have global flag set to True (it is set to True by default in python)

>CP049108.1 Mycobacterium tuberculosis strain 5005 chromosome, complete genome product length = 495 Forward primer 1
CGAAGCCTGGGGTGCCCGCGATTT 24 Template 1930054 ........................ 1930077

Reverse primer 1 AAATCGGTCCCATCACCTTCTTAT 24 Template
1930548 ........................ 1930525

The first group will be - CP049108.1

The second group will be - 495

The third group will be - 1930054

The fourth (and final) group will be - 1930548

Ofcourse, now you can restructure the whole data to be as you want it to be, if you're reading the data from a text file, you may use this code snippet-

import re

with open('test.txt', 'r') as file:
    content = file.read()

pattern = re.compile(r'>([\w\d]*?\.\d*?) .+= (\d+)\n.+\n.*?(\d+).+\n{2}.+\n.*?(\d+)')

for match in pattern.finditer(content):
    output = '>{} = {} {} {}'.format(match.group(1), match.group(2), match.group(3), match.group(4))
    print(output)

If I feed in exactly the data set you provided to test.txt, I get this output-

>CP049108.1 = 495   1930054 1930548
>CP046728.2 = 420   2150761 2151180
>CP047258.1 = 345   2166300 2166644

Regex Explanation

>(\w+\.\d+) .+= (\d+)\n.+\n.*?(\d+).+\n{2}.+\n.*?(\d+)

Let's analyze the first line first- >(\w+\.\d+) .+= (\d+)\n

First this matches the CP049108, stops until a .(dot) is found and then matches the next digits, in this case - 1, stops until a = is reached. It'll then combine those to get CP049108.1 in a single capture group

Later it will grab the digits right after the = and go to the next line, in this case it's 495
Time for the second line - .+\n

Yeah, the second line is just ignored
Now, the third line - .*?(\d+).+\n{2}

It ignores everything up until it reaches the first set of digits, grabs those and skips to the next next line (2 new lines). In this case the result is 1930054
Now, the fourth line - .+\n

This is also ignored
Finally, the last line - .*?(\d+)

This works exactly the same as the 3rd line, the result is 1930548

Check out the demo!

edited Mar 6, 2020 at 10:34

answered Mar 6, 2020 at 5:21

Chase

5,6552 gold badges21 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Chase Over a year ago

@ChooseelBunsuwansakul Welcome to stackoverflow! You should mark an answer as accepted by clicking on the tick mark next to the answer. This ensures the next person searching up the question, knows immediately what worked :)

The fourth bird Over a year ago

In this part >([\w\d]*?\.\d*?) the \w also matches \d in the character class. You don't have to make the quantifiers non greedy as \w and \d can not cross the dot or the space boundary. You could use >(\w+\.\d+)

Cary Swoveland · Accepted Answer · 2020-03-06 18:27:17Z

You can match the following regular expression then extract the contents of the three capture groups:

r'^(>CP\d{6}\.\d).+?\bproduct +length += +(\d+).*?^Template +(\d+).*?^Template +(\d+)'gms

Demo

This can be made self documenting by employing Python's VERBOSE (aka X) flag.

This is derived from Perl's free-spacing mode, which I'll use because I'm not familiar with Python. Reader who don't know Perl will be able to follow this just fine.

/
^                        # match beginning of line        
(>CP\d{6}\.\d)           # match '>CP', 6 digits, '.', 1 digit in cap group 1  
.+?                      # match 0+ characters, lazily (`?`)
\b                       # match a word break               
product\ +length\ +=\ +  # match 'product', 1+ spaces, 'length', 1+ spaces,
                         #   '=', 1+ spaces 
(\d+)                    # match 1+ digits in cap group 2
.*?                      # match 0+ characters, lazily (`?`)
^Template\ +             # match beginning of line, 'Template', 1+ spaces 
(\d+)                    # match 1+ digits in cap group 3
.*?                      # match 0+ characters, lazily (`?`)
^Template\ +             # match beginning of line, 'Template', 1+ spaces 
(\d+)                    # match 1+ digits in cap group 4                
/xgms                    # free-spacing, global, multiline, single-line modes

The meanings of the different modes are given at the link. In free-spacing mode non-escaped spaces outside of character classes are removed before the regex is parsed. Spaces that are part of the expression, such as those between "product" and "length", must therefore be protected. I've chosen to escape them here but other options are to put each space in character class ([ ]), use Unicode expressions \p{Space} or [[:space:\\ or, if appropriate, \s (a whitespace character).

alec · Accepted Answer · 2020-03-06 05:16:23Z

0

Here's one way to do it. The regex search is rather dependent on the text being consistent with the file you provided.

rows = re.findall('(>.{5,10}).*length( = \d*).*\s*.*\s*Template\s*(   \d*).*\s*.*\s*Template\s*( \d*)', handle)
print('\n'.join(''.join(x) for x in rows))

answered Mar 6, 2020 at 5:16

alec

6,1321 gold badge9 silver badges20 bronze badges

Comments

jawad-khan · Accepted Answer · 2020-03-06 05:29:58Z

0

This is what you need:

matches=re.findall(r'(>.{5,10}\.\d).*?( = \d+).*?Template.*?( \d+).+?( \d+)', handle, re.DOTALL)
["".join(x) for x in matches]

As you can see four groups declared using round brackets in regex. Each group will fetch four parts of your output separated by space. you will join this result to get your desired output.

answered Mar 6, 2020 at 5:29

jawad-khan

3131 silver badge10 bronze badges

Comments

cyneo · Accepted Answer · 2020-03-06 05:20:16Z

-1

I would suggest thinking about the logic you need to have and then write rules to help.

Here is a useful book that I used to learn Regex. It covers the basic rules you can use to get the results you need: http://diveinto.org/python3/regular-expressions.html

answered Mar 6, 2020 at 5:20

cyneo

9467 silver badges11 bronze badges

1 Comment

Cary Swoveland Over a year ago

This should be a comment.

Collectives™ on Stack Overflow

How to use Python REGEX to rethrive specific terms in a pattern

5 Answers 5

Regex Explanation

2 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Regex Explanation

2 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related