Regex Python [python-2.7]

Question

I'm working on a Python program that sifts through a .txt file to find the genus and species name. The lines are formatted like this (yes, the equals signs are consistently around the common name):

1. =Common Name= Genus Species some other words that I don't want.
2. =Common Name= Genus Species some other words that I don't want.

I can't seem to figure out a regex that will work to match only the genus and species and not the common name. I know the equals signs (=) will probably help in some way but I cannot think of how to use them.

Edit: Some real data:

1. =Western grebe.= ÆCHMOPHORUS OCCIDENTALIS. Rare migrant; western species, chiefly interior regions of North America.

2. =Holboell's grebe.= COLYMBUS HOLBOELLII. Rare migrant; breeds far north; range, all of North America.

3. =Horned grebe.= COLYMBUS AURITUS. Rare migrant; range, almost the same as the last.

4. =American eared grebe.= COLYMBUS NIGRICOLLIS CALIFORNICUS. Summer resident; rare in eastern, common in western Colorado; breeds from plains to 8,000 feet; partial to alkali lakes; western species.

Sorry it took so long for me to respond some real input data: 1. =Western grebe.= ÆCHMOPHORUS OCCIDENTALIS. Rare migrant; western species, chiefly interior regions of North America. 2. =Holboell's grebe.= COLYMBUS HOLBOELLII. Rare migrant; breeds farnorth; range, all of North America. 3. =Horned grebe.= COLYMBUS AURITUS. Rare migrant; range, almost the same as the last. 4. =American eared grebe.= COLYMBUS NIGRICOLLIS CALIFORNICUS. Summer resident; rare in eastern, common in western Colorado; breeds from plains to 8,000 feet; partial to alkali lakes; western species. — Nick Vitha
– Nick Vitha, Commented Sep 21, 2015 at 22:59

Ilya Peterov · Accepted Answer · 2015-09-21 23:26:53Z

4

You probably don't need regex for this one. If the order of the words you need and the count of the words is always the same, you can just split each line into list of substrings and get the third (genus) and the fourth (species) element of that list. The code will probably look like that:

myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
    words = line.split()
    genus, species = words[2], words[3]

It just looks a little more "pythonic" to me.

If common name can consist of multiple words, then suggested code will return an incorrect result. To get the right result in this case too, you can use this code:

myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
    words = line.split('=')[2].split() # If the program returns wrong results, try changing the index from 2 to 1 or 3. What number is the right one depends on whether there can be any symbols before the first "=".
    genus, species = words[0], words[1]

edited Sep 21, 2015 at 23:26

answered Sep 21, 2015 at 22:23

Ilya Peterov

2,0651 gold badge17 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

red Over a year ago

Agreed. Regex is overkill for this.

m.cekiera Over a year ago

But I supose common name could be more than one word (I assume this is about living species), however using first = as delimeter, and then spliting by words could work with such input

Ilya Peterov Over a year ago

@m.cekiera Yes, I didn't think of that, I'll edit my answer, thank you.

m.cekiera · Accepted Answer · 2015-09-21 22:21:04Z

1

If it is enough to capture words in groups (and you dont't wont direct match) you can try with:

(?=\d\.\s*=[^=]+=\s(?:(?P<genus>\w+)\s(?P<species>\w+)))

DEMO

the desired values will be in groups <genus> and <species>. The whole regex is a positive lookbehind, so it match a zero point position on a beginning of string, but it captures some content into groups.

(?=\d\.\s*=[^=]+=\s - decimal folowed by some content between equal signs and space,
(?:(?P<genus>\w+)\s(?P<species>\w+))) - capture first word to genus groups, and second word do species groups,

answered Sep 21, 2015 at 22:21

m.cekiera

5,3935 gold badges24 silver badges35 bronze badges

Comments

rcbevans · Accepted Answer · 2015-09-21 22:25:14Z

You can try something like:

import re

txt='1. =Common Name= Genus Species some other words that I don\'t want.'

re1='.*?'   # Non-greedy match on filler
re2='(?:[a-z][a-z]+)'   # Uninteresting: word
re3='.*?'   # Non-greedy match on filler
re4='(?:[a-z][a-z]+)'   # Uninteresting: word
re5='.*?'   # Non-greedy match on filler
re6='((?:[a-z][a-z]+))' # Word 1
re7='.*?'   # Non-greedy match on filler
re8='((?:[a-z][a-z]+))' # Word 2

rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
    word1=m.group(1)
    word2=m.group(2)
    print "("+word1+")"+"("+word2+")"+"\n"

In your test input as shown in txt, this will print

(Genus)(Species)

You can you this awesome site to help do regexes like this!

Hope this helps

Collectives™ on Stack Overflow

Regex Python [python-2.7]

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related