3

I have very challenging strings that I have been struggling.
For example,

str1 = '95% for Pikachu, 92% for Sandshrew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'

String starts with % integer and may have for or not, then following with name of pokemon. There might be comma(,) or & sign then new % integer. Finally there is another name of pokemon.(All start with capital case alphabet)
I want to extract two pokemons, for example result,

['Pikachu', 'Sandshrew']
['Paras', 'Arcanine']
['Diglett', 'Dugtrio']
['Squirtle', 'Alakazam']
['Metopod', 'Dewgong']

I can create a list of all pokemen then using in syntax, but it is not the best way(In case they add more pokemon). Is it possible to extract using Regex?
Thanks in advance!
EDIT
As requested, I am adding my code,

str_list = [str1, str2, str3, str4, str5]

for x in str_list:
    temp_list = []
    if 'for' in x:
        temp = x.split('% for', 1)[1].strip()
        temp_list.append(temp)
    else:
        temp = x.split(" ", 1)[1]
        temp_list.append(temp)
print(temp_list)

I know it is not regex express. The expression I tried is, \d+ to extract integer to start... but have no idea how to start.
EDIT2
@b_c has good edge case so, I am adding it here

edge_str = '100% for Pikachu, 29% Pika Pika Pikachu'

result

['Pikachu', 'Pika Pika Pikachu']
2
  • Please post the code that you tried to solve this with. Commented Jan 9, 2020 at 15:49
  • 3
    Does your regex need to support Mr. Mime, Mime Jr., Porygon2 or Type: Null? (Other pokemon names for those unfamiliar) Commented Jan 9, 2020 at 16:15

4 Answers 4

2

Hopefully I didn't over engineer this, but I wanted to cover the edge cases of the slightly-more-complicated named pokemon, such as "Mr. Mime", "Farfetch'd", and/or "Nidoran♂" (only looking at the first 151).

The pattern I used is (?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*, which looks to be working in my testing (here's the regex101 link for a breakdown).

For a general summary, I'm looking for:

  • 1+ digits followed by a %
  • A space or the word "for" at least once
  • (To start the capture) A starting capital letter
  • At least one of (ending the capture group):
    • a word character, a period, the male/female symbols, or an apostrophe
      • Note: If you want to catch additional "weird" pokemon characters, like numbers, colon, etc., add them to this portion (the [\w\.♀♂'] bit).
    • OR a space, but only if followed by a capital letter
  • A comma, space, or ampersand, any number of times

Unless it's changed, Python's builtin re module doesn't support repeated capture groups (which I believe I did correctly), so I just used re.findall and organized them into pairs (I replaced a couple names from your input with the complicated ones):

import re

str1 = '95% for Pikachu, 92% for Mr. Mime'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = "10% Squirtle, 100% for Farfetch'd"
str5 = '30% Metopod & 99% Nidoran♂'

pattern = r"(?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*"

# Find matches in each string, then unpack each list of
# matches into a flat list
all_matches = [match
               for s in [str1, str2, str3, str4, str5]
               for match in re.findall(pattern, s)]

# Pair up the matches
pairs = zip(all_matches[::2], all_matches[1::2])

for pair in pairs:
    print(pair)

This then prints out:

('Pikachu', 'Mr. Mime')
('Paras', 'Arcanine')
('Diglett', 'Dugtrio')
('Squirtle', "Farfetch'd")
('Metopod', 'Nidoran♂')

Also, as was already mentioned, you do have a few typos in the pokemon names, but a regex isn't the right fix for that unfortunately :)

Sign up to request clarification or add additional context in comments.

2 Comments

What if I don't want to filter out ♀♂ ?? Can I use r"(?:(?:\d+%(?: |for)+([A-Z](?:[\w\.']|(?: (?=[A-Z])))+))+)[, &]*" ??
You can leave those out if you're not interested in them. It looks like the pattern will still grab the name (whether it's listed 1st or 2nd in your strings), but leave off the gender marker. An important side effect of that is that it will stop the match when it hits those characters, so if there's anything else following them, they will also be ignored.
1

Since there seems to be no other upper-case letter in your strings you can simply use [A-Z]\w+ as regex. See regex101

Code:

import re

str1 = '95% for Pikachu, 92% for Sandsherew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'

str_list = [str1, str2, str3, str4, str5]
regex = re.compile('[A-Z]\w+')
pokemon_list = []
for x in str_list:
    pokemon_list.append(re.findall(regex, x))
print(pokemon_list)

Output:

[['Pikachu', 'Sandsherew'], ['Paras', 'Arcanine'], ['Diglett', 'Dugtrio'], ['Squirtle', 'Alakazam'], ['Metopod', 'Dewgong']]

3 Comments

There are pokemon names that include more than just the alphabet, not sure if the op expects to match those too (although I can't think of any that don't start with a capital letter)
I'm not that good with pokemon, so I did not know that, but if this is the case the example from OP is rather badly chosen...
It might be better to use ([A-Z])[^,&\n]+ since the OP explicitly mentions they terminate with a comma or & (and then rstrip any trailing space)
0

An alternate method if you dont want to use regex and you don't want to rely on capitalization

def pokeFinder(strng):
    wordList = strng.split()
    pokeList = []
    for word in wordList:
        if not set('[~!@#$%^&*()_+{}":;\']+$').intersection(word) and 'for' not in word:
            pokeList.append(word.replace(',', ''))
    return pokeList

This won't add words with special chars. It also won't add words that are for. Then it removes commas from the found words.

A print of str2 returns ['Diglett', 'Dugtrio']


EDIT In light of the fact that there are apparently Pokemon with two words and special chars, I made this slightly more convoluted version of the above code

def pokeFinder(strng):
    wordList = strng.split()
    pokeList = []
    prevWasWord = False
    for word in wordList:
        if not set('%&').intersection(word) and 'for' not in word:
            clnWord = word.replace(',', '')
            if prevWasWord is True: # 2 poke in a row means same poke
                pokeList[-1] = pokeList[-1] + ' ' + clnWord
            else:
                pokeList.append(clnWord)
                prevWasWord = True
        else:
            prevWasWord = False
    return pokeList

If there's no "three word" pokemon, and the rules OP set remain constant, this should always work. 2 poke matches in a row adds to the previous pokemon.

So printing a string of '30% for Mr. Mime & 20% for Type: Null' gets ['Mr. Mime', 'Type: Null']

Comments

0

Use a positive lookbehind, this will work regardless of capitalization.

(?<=\d\d% for )[A-Za-z]+|(?<=\d% for )[A-Za-z]+

EDIT: Changed it to work in Python.

4 Comments

Does not work "+ A quantifier inside a lookbehind makes it non-fixed width" python needs fixed width lookbehinds. See regex101 And as OP stated "All start with capital case alphabet"
Could you clarify what you mean? This works for me @LeoE
Could you show the code? I get an error if I try to run your regex raise error("look-behind requires fixed-width pattern") sre_constants.error: look-behind requires fixed-width pattern in the code I posted in my answer
@LeoE You're right, I was using Regxr to test it, try that one.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.