python regular expression. Extract text between patterns

Question

How to get all the values in between 'uniprotkb:' and '(gene name)' in the 'str' below:

str = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)'

The result is:

HIST1H3D
HIST1H3A
HIST1H3B
HIST1H3C
HIST1H3E
HIST1H3F
HIST1H3G
HIST1H3H
HIST1H3I
HIST1H3J

Please don't name a variable 'str' -- you'll hide the built-in string class — Ian Clelland
– Ian Clelland, Commented Oct 2, 2012 at 14:51

Ian Clelland · Accepted Answer · 2012-10-02 14:54:55Z

8

Using re.findall(), you can get all parts of a string that match a regular expression:

>>> import re
>>> sstr = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)' 
>>> re.findall(r'uniprotkb:([^(]*)\(gene name\)', sstr)

['HIST1H3D', 'HIST1H3A', 'HIST1H3B', 'HIST1H3C', 'HIST1H3E', 'HIST1H3F', 'HIST1H3G', 'HIST1H3H', 'HIST1H3I', 'HIST1H3J']

answered Oct 2, 2012 at 14:54

Ian Clelland

44.4k8 gold badges90 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Michael · Accepted Answer · 2016-08-02 13:00:33Z

0

Here is a oneliner:

astr = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)'
[pt.split('(')[0] for pt in astr.strip().split('uniprotkb:')][1:]

Gives:

['HIST1H3D',
 'HIST1H3A',
 'HIST1H3B',
 'HIST1H3C',
 'HIST1H3E',
 'HIST1H3F',
 'HIST1H3G',
 'HIST1H3H',
 'HIST1H3I',
 'HIST1H3J']

I don't recommend regexp solutions, if runtime matters.

edited Aug 2, 2016 at 13:00

answered Oct 2, 2012 at 14:58

Michael

7,8061 gold badge41 silver badges64 bronze badges

Comments

Benjamin Hodgson · Accepted Answer · 2012-10-02 15:01:45Z

-1

I wouldn't bother with a regular expression:

s = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)'  # etc

gene_names = []
for substring in s.split('|'):
    removed_first = substring.partition('uniprotkb:')[2]  # remove the first part of the substring
    removed_second = removed_first.partition('(gene name)')[0]  # remove the second part
    gene_names.append(removed_second)  # put it on the list

should do the trick. You could even one-liner it - the above is equivalent to:

gene_names = [substring.partition('uniprotkb:')[2].partition('(gene name)')[0] for substring in s.split('|')]

edited Oct 2, 2012 at 15:01

answered Oct 2, 2012 at 14:55

Benjamin Hodgson

44.9k18 gold badges115 silver badges168 bronze badges

Collectives™ on Stack Overflow

python regular expression. Extract text between patterns

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related