I'm trying to apply a regex function to a column of a dataframe to determine gender pronouns. This is what my dataframe looks like:
name Descrip
0 Sarah she doesn't like this because her mum...
1 David he does like it because his dad...
2 Sam they generally don't like it because their par...
These are the codes I ran to make that dataframe:
list_label = ["Sarah", "David", "Sam"]
list_descriptions = ["she doesn't like this because her mum...", "he does like it because his dad...", "they generally don't like it because their parent..."]
data3 = {'name':list_label, 'Descrip':list_descriptions}
test_df = pd.DataFrame(data3)
I'm trying to determine the genders of the person by applying a regex function on the "Descrip" column. Specifically, these are the patterns I want to implement:
"male":"(he |his |him )",
"female":"(she |her |hers )",
"plural, or singular non-binary":"(they |them |their )"
The full code I've written is as follows:
This function attempts to match each pattern and returns the name of the gender pronoun mentioned most often in a row value description. Each gender pronoun has several key words in a pattern string (eg. him, her, they).The idea is to determine the max_gender, or the gender associated with the pattern group most mentioned throughout the values in the Descrip column. Thus, max_gender can take on one of three values: male | female | plural, or singular non-binary. If none of the patterns are identified throughout the Descrip row values, then "unknown" will be returned instead.
import re
def get_pronouns(text):
patterns = {
"male":"(he |his |him )",
"female":"(she |her |hers )",
"plural, or singular non-binary":"(they |them |their )"
}
max_gender = "unknown"
max_gender_count = 0
for gender in patterns:
pattern = re.compile(gender)
mentions = re.findall(pattern, text)
count_mentions = len(mentions)
if count_mentions > max_gender_count:
max_gender_count = count_mentions
max_gender = gender
return max_gender
test_df["pronoun"] = test_df.loc[:, "Descrip"].apply(get_pronouns)
print(test_df)
However, when I run the code, it obviously fails to determine the gender pronoun. This is shown in the following output:
name Descrip pronoun
0 Sarah she doesn't like this because her mum... unknown
1 David he does like it because his dad... unknown
2 Sam they generally don't like it because their par... unknown
Does anyone know what is wrong with my code?
get_pronouns()? I'll try figuring it out, there are also a few other things I would change. I'm also confused as to why the descriptions are cut off.