How to extract specific strings using Python Regex

Question

I have very challenging strings that I have been struggling.
For example,

str1 = '95% for Pikachu, 92% for Sandshrew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'

String starts with % integer and may have for or not, then following with name of pokemon. There might be comma(,) or & sign then new % integer. Finally there is another name of pokemon.(All start with capital case alphabet)
I want to extract two pokemons, for example result,

['Pikachu', 'Sandshrew']
['Paras', 'Arcanine']
['Diglett', 'Dugtrio']
['Squirtle', 'Alakazam']
['Metopod', 'Dewgong']

I can create a list of all pokemen then using in syntax, but it is not the best way(In case they add more pokemon). Is it possible to extract using Regex?
Thanks in advance!
EDIT
As requested, I am adding my code,

str_list = [str1, str2, str3, str4, str5]

for x in str_list:
    temp_list = []
    if 'for' in x:
        temp = x.split('% for', 1)[1].strip()
        temp_list.append(temp)
    else:
        temp = x.split(" ", 1)[1]
        temp_list.append(temp)
print(temp_list)

I know it is not regex express. The expression I tried is, \d+ to extract integer to start... but have no idea how to start.
EDIT2
@b_c has good edge case so, I am adding it here

edge_str = '100% for Pikachu, 29% Pika Pika Pikachu'

result

['Pikachu', 'Pika Pika Pikachu']

Does your regex need to support Mr. Mime, Mime Jr., Porygon2 or Type: Null? (Other pokemon names for those unfamiliar) — Sayse
– Sayse, Commented Jan 9, 2020 at 16:15

b_c · Accepted Answer · 2020-01-09 16:33:04Z

2

Hopefully I didn't over engineer this, but I wanted to cover the edge cases of the slightly-more-complicated named pokemon, such as "Mr. Mime", "Farfetch'd", and/or "Nidoran♂" (only looking at the first 151).

The pattern I used is (?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*, which looks to be working in my testing (here's the regex101 link for a breakdown).

For a general summary, I'm looking for:

1+ digits followed by a %
A space or the word "for" at least once
(To start the capture) A starting capital letter
At least one of (ending the capture group):
- a word character, a period, the male/female symbols, or an apostrophe
  - Note: If you want to catch additional "weird" pokemon characters, like numbers, colon, etc., add them to this portion (the [\w\.♀♂'] bit).
- OR a space, but only if followed by a capital letter
A comma, space, or ampersand, any number of times

Unless it's changed, Python's builtin re module doesn't support repeated capture groups (which I believe I did correctly), so I just used re.findall and organized them into pairs (I replaced a couple names from your input with the complicated ones):

import re

str1 = '95% for Pikachu, 92% for Mr. Mime'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = "10% Squirtle, 100% for Farfetch'd"
str5 = '30% Metopod & 99% Nidoran♂'

pattern = r"(?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*"

# Find matches in each string, then unpack each list of
# matches into a flat list
all_matches = [match
               for s in [str1, str2, str3, str4, str5]
               for match in re.findall(pattern, s)]

# Pair up the matches
pairs = zip(all_matches[::2], all_matches[1::2])

for pair in pairs:
    print(pair)

This then prints out:

('Pikachu', 'Mr. Mime')
('Paras', 'Arcanine')
('Diglett', 'Dugtrio')
('Squirtle', "Farfetch'd")
('Metopod', 'Nidoran♂')

Also, as was already mentioned, you do have a few typos in the pokemon names, but a regex isn't the right fix for that unfortunately :)

answered Jan 9, 2020 at 16:33

b_c

1,24213 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jayko03 Over a year ago

What if I don't want to filter out ♀♂ ?? Can I use r"(?:(?:\d+%(?: |for)+([A-Z](?:[\w\.']|(?: (?=[A-Z])))+))+)[, &]*" ??

b_c Over a year ago

You can leave those out if you're not interested in them. It looks like the pattern will still grab the name (whether it's listed 1st or 2nd in your strings), but leave off the gender marker. An important side effect of that is that it will stop the match when it hits those characters, so if there's anything else following them, they will also be ignored.

LeoE · Accepted Answer · 2020-01-09 15:54:54Z

1

Since there seems to be no other upper-case letter in your strings you can simply use [A-Z]\w+ as regex. See regex101

Code:

import re

str1 = '95% for Pikachu, 92% for Sandsherew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'

str_list = [str1, str2, str3, str4, str5]
regex = re.compile('[A-Z]\w+')
pokemon_list = []
for x in str_list:
    pokemon_list.append(re.findall(regex, x))
print(pokemon_list)

Output:

[['Pikachu', 'Sandsherew'], ['Paras', 'Arcanine'], ['Diglett', 'Dugtrio'], ['Squirtle', 'Alakazam'], ['Metopod', 'Dewgong']]

answered Jan 9, 2020 at 15:54

LeoE

2,0831 gold badge15 silver badges32 bronze badges

3 Comments

Sayse Over a year ago

There are pokemon names that include more than just the alphabet, not sure if the op expects to match those too (although I can't think of any that don't start with a capital letter)

LeoE Over a year ago

I'm not that good with pokemon, so I did not know that, but if this is the case the example from OP is rather badly chosen...

Sayse Over a year ago

It might be better to use ([A-Z])[^,&\n]+ since the OP explicitly mentions they terminate with a comma or & (and then rstrip any trailing space)

Matt M · Accepted Answer · 2020-01-09 16:37:58Z

An alternate method if you dont want to use regex and you don't want to rely on capitalization

def pokeFinder(strng):
    wordList = strng.split()
    pokeList = []
    for word in wordList:
        if not set('[~!@#$%^&*()_+{}":;\']+$').intersection(word) and 'for' not in word:
            pokeList.append(word.replace(',', ''))
    return pokeList

This won't add words with special chars. It also won't add words that are for. Then it removes commas from the found words.

A print of str2 returns ['Diglett', 'Dugtrio']

EDIT In light of the fact that there are apparently Pokemon with two words and special chars, I made this slightly more convoluted version of the above code

def pokeFinder(strng):
    wordList = strng.split()
    pokeList = []
    prevWasWord = False
    for word in wordList:
        if not set('%&').intersection(word) and 'for' not in word:
            clnWord = word.replace(',', '')
            if prevWasWord is True: # 2 poke in a row means same poke
                pokeList[-1] = pokeList[-1] + ' ' + clnWord
            else:
                pokeList.append(clnWord)
                prevWasWord = True
        else:
            prevWasWord = False
    return pokeList

If there's no "three word" pokemon, and the rules OP set remain constant, this should always work. 2 poke matches in a row adds to the previous pokemon.

So printing a string of '30% for Mr. Mime & 20% for Type: Null' gets ['Mr. Mime', 'Type: Null']

Libra · Accepted Answer · 2020-01-09 16:46:35Z

0

Use a positive lookbehind, this will work regardless of capitalization.

(?<=\d\d% for )[A-Za-z]+|(?<=\d% for )[A-Za-z]+

EDIT: Changed it to work in Python.

edited Jan 9, 2020 at 16:46

answered Jan 9, 2020 at 15:59

Libra

2,5951 gold badge10 silver badges29 bronze badges

4 Comments

LeoE Over a year ago

Does not work "+ A quantifier inside a lookbehind makes it non-fixed width" python needs fixed width lookbehinds. See regex101 And as OP stated "All start with capital case alphabet"

Libra Over a year ago

Could you clarify what you mean? This works for me @LeoE

LeoE Over a year ago

Could you show the code? I get an error if I try to run your regex raise error("look-behind requires fixed-width pattern") sre_constants.error: look-behind requires fixed-width pattern in the code I posted in my answer

Libra Over a year ago

@LeoE You're right, I was using Regxr to test it, try that one.

Collectives™ on Stack Overflow

How to extract specific strings using Python Regex

4 Answers 4

2 Comments

3 Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

3 Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related