1

I'm working on a Python program that sifts through a .txt file to find the genus and species name. The lines are formatted like this (yes, the equals signs are consistently around the common name):

1. =Common Name= Genus Species some other words that I don't want.
2. =Common Name= Genus Species some other words that I don't want.

I can't seem to figure out a regex that will work to match only the genus and species and not the common name. I know the equals signs (=) will probably help in some way but I cannot think of how to use them.

Edit: Some real data:

1. =Western grebe.= ÆCHMOPHORUS OCCIDENTALIS. Rare migrant; western species, chiefly interior regions of North America.

2. =Holboell's grebe.= COLYMBUS HOLBOELLII. Rare migrant; breeds far north; range, all of North America.

3. =Horned grebe.= COLYMBUS AURITUS. Rare migrant; range, almost the same as the last.

4. =American eared grebe.= COLYMBUS NIGRICOLLIS CALIFORNICUS. Summer resident; rare in eastern, common in western Colorado; breeds from plains to 8,000 feet; partial to alkali lakes; western species.
6
  • what you want as an output for your example? Commented Sep 21, 2015 at 22:16
  • Genus Species (beginlocation, endlocation) Commented Sep 21, 2015 at 22:19
  • 1
    Can you show us your attempt at solving this problem? Commented Sep 21, 2015 at 22:20
  • Can you show us some real input data? Commented Sep 21, 2015 at 22:20
  • Sorry it took so long for me to respond some real input data: 1. =Western grebe.= ÆCHMOPHORUS OCCIDENTALIS. Rare migrant; western species, chiefly interior regions of North America. 2. =Holboell's grebe.= COLYMBUS HOLBOELLII. Rare migrant; breeds farnorth; range, all of North America. 3. =Horned grebe.= COLYMBUS AURITUS. Rare migrant; range, almost the same as the last. 4. =American eared grebe.= COLYMBUS NIGRICOLLIS CALIFORNICUS. Summer resident; rare in eastern, common in western Colorado; breeds from plains to 8,000 feet; partial to alkali lakes; western species. Commented Sep 21, 2015 at 22:59

3 Answers 3

4

You probably don't need regex for this one. If the order of the words you need and the count of the words is always the same, you can just split each line into list of substrings and get the third (genus) and the fourth (species) element of that list. The code will probably look like that:

myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
    words = line.split()
    genus, species = words[2], words[3]

It just looks a little more "pythonic" to me.

If common name can consist of multiple words, then suggested code will return an incorrect result. To get the right result in this case too, you can use this code:

myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
    words = line.split('=')[2].split() # If the program returns wrong results, try changing the index from 2 to 1 or 3. What number is the right one depends on whether there can be any symbols before the first "=".
    genus, species = words[0], words[1]
Sign up to request clarification or add additional context in comments.

3 Comments

Agreed. Regex is overkill for this.
But I supose common name could be more than one word (I assume this is about living species), however using first = as delimeter, and then spliting by words could work with such input
@m.cekiera Yes, I didn't think of that, I'll edit my answer, thank you.
1

If it is enough to capture words in groups (and you dont't wont direct match) you can try with:

(?=\d\.\s*=[^=]+=\s(?:(?P<genus>\w+)\s(?P<species>\w+)))

DEMO

the desired values will be in groups <genus> and <species>. The whole regex is a positive lookbehind, so it match a zero point position on a beginning of string, but it captures some content into groups.

  • (?=\d\.\s*=[^=]+=\s - decimal folowed by some content between equal signs and space,
  • (?:(?P<genus>\w+)\s(?P<species>\w+))) - capture first word to genus groups, and second word do species groups,

Comments

0

You can try something like:

import re

txt='1. =Common Name= Genus Species some other words that I don\'t want.'

re1='.*?'   # Non-greedy match on filler
re2='(?:[a-z][a-z]+)'   # Uninteresting: word
re3='.*?'   # Non-greedy match on filler
re4='(?:[a-z][a-z]+)'   # Uninteresting: word
re5='.*?'   # Non-greedy match on filler
re6='((?:[a-z][a-z]+))' # Word 1
re7='.*?'   # Non-greedy match on filler
re8='((?:[a-z][a-z]+))' # Word 2

rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
    word1=m.group(1)
    word2=m.group(2)
    print "("+word1+")"+"("+word2+")"+"\n"

In your test input as shown in txt, this will print

(Genus)(Species)

You can you this awesome site to help do regexes like this!

Hope this helps

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.