How to apply regex function to dataframe column to return value

Question

I'm trying to apply a regex function to a column of a dataframe to determine gender pronouns. This is what my dataframe looks like:

    name                                            Descrip
0  Sarah           she doesn't like this because her mum...
1  David                 he does like it because his dad...
2    Sam  they generally don't like it because their par...

These are the codes I ran to make that dataframe:

list_label = ["Sarah", "David", "Sam"]
list_descriptions = ["she doesn't like this because her mum...", "he does like it because his dad...", "they generally don't like it because their parent..."]

data3 = {'name':list_label, 'Descrip':list_descriptions}
test_df = pd.DataFrame(data3)

I'm trying to determine the genders of the person by applying a regex function on the "Descrip" column. Specifically, these are the patterns I want to implement:

"male":"(he |his |him )",
"female":"(she |her |hers )",
"plural, or singular non-binary":"(they |them |their )"

The full code I've written is as follows:

This function attempts to match each pattern and returns the name of the gender pronoun mentioned most often in a row value description. Each gender pronoun has several key words in a pattern string (eg. him, her, they).The idea is to determine the max_gender, or the gender associated with the pattern group most mentioned throughout the values in the Descrip column. Thus, max_gender can take on one of three values: male | female | plural, or singular non-binary. If none of the patterns are identified throughout the Descrip row values, then "unknown" will be returned instead.

import re
def get_pronouns(text):
    patterns = {
        "male":"(he |his |him )",
        "female":"(she |her |hers )",
        "plural, or singular non-binary":"(they |them |their )"
    }
    max_gender = "unknown"
    max_gender_count = 0
    for gender in patterns:
        pattern = re.compile(gender)
        mentions = re.findall(pattern, text)
        count_mentions = len(mentions)
        if count_mentions > max_gender_count:
            max_gender_count = count_mentions
            max_gender = gender
    return max_gender

test_df["pronoun"] = test_df.loc[:, "Descrip"].apply(get_pronouns)
print(test_df)

However, when I run the code, it obviously fails to determine the gender pronoun. This is shown in the following output:

    name                                            Descrip  pronoun
0  Sarah           she doesn't like this because her mum...  unknown
1  David                 he does like it because his dad...  unknown
2    Sam  they generally don't like it because their par...  unknown

Does anyone know what is wrong with my code?

Can you explain the algorithm you're using in get_pronouns()? I'll try figuring it out, there are also a few other things I would change. I'm also confused as to why the descriptions are cut off. — AMC
– AMC, Commented Nov 21, 2019 at 23:49
I've added some comments -- thanks! This is an example case, hence why the descriptions are cut off. I'm super open to substantially different approaches. — Benjamin Png
– Benjamin Png, Commented Nov 22, 2019 at 0:07
Sounds good, I'm almost done. If you are looking for different approaches, it might help if you talk a bit more about your actual situation, not just the example. — AMC
– AMC, Commented Nov 22, 2019 at 0:16
I'm doing a web-scraping exercise. Specifically I'm extracting the entire text source of a long list of wikipedia biography articles. So I have a dataframe with two columns. Column one is the name of the article. Column two is the entire source content of that article. I need to make a third column, the one that determines if the person the article is about is a male, female, or plural. This is why I need this regex code, to apply it to column two (content) to create column three (gender pronoun). — Benjamin Png
– Benjamin Png, Commented Nov 22, 2019 at 0:22
I'm starting to question whether a DataFrame is the right data structure for this. Of course if you intend to manipulate the results, you could have the article text column hold a key to a dictionary which contains all the texts. — AMC
– AMC, Commented Nov 22, 2019 at 0:41

Community · Accepted Answer · 2020-06-20 09:12:55Z

2

If you want to discover why your code isn't working, add a print statement to your function like so:

    for gender in patterns:
        print(gender)
        pattern = re.compile(gender)

Your regex also needs some tweaks. For example, in the first line of the song Breathe by Pink Floyd, Breathe, breathe in the air, your regex finds two male pronouns.

There may be other problems too, I'm not sure.

Here is a solution quite similar to yours. The regex are fixed, the dictionary is replaced by a list of tuples, etc.

Solution code

import pandas as pd
import numpy as np
import re
import operator as op

names_list = ['Sarah', 'David', 'Sam']
descs_list = ["she doesn't like this because her mum...", 'he does like it because his dad...',
              "they generally don't like it because their parent..."]

df_1 = pd.DataFrame(data=zip(names_list, descs_list), columns=['Name', 'Desc'])

pronoun_re_list = [('male', re.compile(r"\b(?:he|his|him)\b", re.IGNORECASE)),
                   ('female', re.compile(r"\b(?:she|her|hers)\b", re.IGNORECASE)),
                   ('plural/nb', re.compile(r"\b(?:they|them|their)\b", re.IGNORECASE))]


def detect_pronouns(str_in: str) -> str:
    match_results = ((curr_pron, len(curr_patt.findall(str_in))) for curr_pron, curr_patt in pronoun_re_list)
    max_pron, max_counts = max(match_results, key=op.itemgetter(1))
    if max_counts == 0:
        return np.NaN
    else:
        return max_pron


df_1['Pronouns'] = df_1['Desc'].map(detect_pronouns)

Explanations

Code

match_results is a generator expression. curr_pron stands for "current pronoun", and curr_patt for "current pattern". It might make things clearer if I rewrite it as a for loop which creates a list:

    match_results = []
    for curr_pron, curr_patt in pronoun_re_list:
        match_counts = len(curr_patt.findall(str_in))
        match_results.append((curr_pron, match_counts))

for curr_pron, curr_patt in ... is taking advantage of something which goes by a few different names, usually multiple assignment or tuple unpacking. You can find a nice article on it here. In this case, it's just a different way of writing:

    for curr_tuple in pronoun_re_list:
        curr_pron = curr_tuple[0]
        curr_patt = curr_tuple[1]

RegEx

Time for everyone's favorite subject; Regex! I use a wonderful website called RegEx101, you can mess around with the patterns there, it makes things so much easier to understand. I have set up a page with some test data and the regex I'll be covering below: https://regex101.com/r/Y1onRC/2.

Now, let's take a look at the regex I used: \b(?:he|his|him)\b.

The he|his|him part is exactly like in yours, it matches the words 'he', 'his' or 'him'. In your regex, that is surrounded by parentheses, mine also includes ?: after the opening parenthesis. (pattern stuff) is a capturing group, which as the name implies, means it captures whatever it matches. Since here we don't actually care about the contents of the matches, only whether there is or isn't a match, we add ?: to create a non-capturing group, which doesn't capture (or save) the contents.

I said that the he|his|him part of the regex is the same as yours, but that isn't exactly true. You include a space after each pronoun, presumably to avoid it matching the letters he in the middle of a word. Unfortunately, as I mentioned above, it finds two matches in the sentence Breathe, breathe in the air. Our saviour is \b, which matches word boundaries. This means we catch the he in Words words words he., whereas (he |his |him ) doesn't.

Finally, we compile the patterns with the re.IGNORECASE flag, which I don't think requires much explanation, although please do let me know if I'm wrong.

Here is how I would describe the two patterns in plain english:

(he |his |him ) matches the letters he followed by a space, his followed by a space, or him followed by a space, and returns the full match plus a group.
\b(?:he|his|him)\b with the re.IGNORECASE flag matches the words he, his, or him, regardless of case, and returns the full match.

Hope that was clear enough, let me know!

Result output

    Name    Desc                                                  Pronouns
--  ------  ----------------------------------------------------  ----------
 0  Sarah   she doesn't like this because her mum...              female
 1  David   he does like it because his dad...                    male
 2  Sam     they generally don't like it because their parent...  plural/nb

Let me know if you have any questions :)

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Nov 22, 2019 at 0:38

AMC

2,6977 gold badges15 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

AMC Over a year ago

@BenjaminPng Was everything clear? Did you discover why your code wasn't working?

Benjamin Png Over a year ago

Actually could you please elaborate a little on your detect_pronouns function? My knowledge of regex doesn't extend beyond pandas and I found it rather confusing. For example, what does curr_pron and curr_patt mean? It would be great if you could point me in the right direction with regards to learning more about these concepts. I haven't been able to find anything informative on google. Thanks!

AMC Over a year ago

@BenjaminPng I made an edit, let me know what you think :)

Benjamin Png Over a year ago

Thanks Alexander! Super helpful. I really appreciate you taking the time.

Benjamin Png Over a year ago

Nope! You've explained everything perfectly. Thanks!

Collectives™ on Stack Overflow

How to apply regex function to dataframe column to return value

1 Answer 1

Solution code

Explanations

Code

RegEx

Result output

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Solution code

Explanations

Code

RegEx

Result output

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related