function returns only None values when replacing pandas column values by regex match

Question

Goal: replace values in column que_text with matches of re.search pattern. Else None

Problem: Receiving only None values in que_text_new column although regex pattern is thoroughly tested!

def override(s):
    x = re.search(r'(an|frage(\s+ich)?)\s+d(i|ı)e\s+Staatsreg(i|ı)erung(.*)(Dresden(\.|,|\s+)?)?', str(s), flags = re.DOTALL | re.MULTILINE))
    if x :
        return x.group(5)
    return None
df2['que_text_new'] = df2['que_text'].apply(override)

What am i doing wrong? removing return None doesent help. There must be some structural error within my function, i assume.

Will you please post a sample of your dataframe? It's nearly impossible to help without that. — user17242583
– user17242583, Commented Nov 6, 2021 at 14:30
dont know, i assumed s is supposed to be an arbitrary placeholder just like in loops?! — id345678
– id345678, Commented Nov 6, 2021 at 15:17
No, you need an input string as the 2nd argument to the re.search method. Before editing the question, you had str(s) (that is why I mention it in my answer). — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 6, 2021 at 15:23

Wiktor Stribiżew · Accepted Answer · 2021-11-06 15:15:20Z

1

You can use a pattern with a single capturing group and then simpy use Series.str.extract and chain .fillna(np.nan) to fill the non-matched values with NaN:

pattern = r'(?s)(?:an|frage(?:\s+ich)?)\s+d[iı]e\s+Staatsreg[iı]erung(.*)'
df2['que_text_new'] = df2['que_text'].astype(str).str.extract(pattern).fillna(np.nan)

Not sure you need .astype(str), but there is str(s) in your code, so it might be safer with this part.

Here,

Capturing groups with single char alternatives are converted to character classes, e.g. (i|ı) -> [iı]
Other capturing groups are converted to non-capturing ones, i.e. ( -> (?:.
To make np.nan work do not forget to import numpy as np.
(?s) is an in-pattern re.DOTALL option.

edited Nov 6, 2021 at 15:15

answered Nov 6, 2021 at 14:34

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

id345678 Over a year ago

Thanks, Wiktor. Do you have an idea why i would get ValueError: Wrong number of items passed 7, placement implies 1 on my test dataset? I will add representative data in a sec

Wiktor Stribiżew Over a year ago

You did not use my solution. Use a regex with just one capturing group.

id345678 Over a year ago

Ok i see. I assume there is no solution with multiple capturing groups? Then i will give you the actual regex. I updated the question text. I need .group(5) of that regex

Wiktor Stribiżew Over a year ago

I assume you did not quite understand why I suggested this kind of solution. 1) Series.str.extract in Pandas is handy to extract (parts of) matches into new column(s) and to do that, you need to use capturing groups in the pattern around those parts you want to extract. 2) Those groupings you do not need to extract into new column(s) should be non-capturing ((?:...)). 3) Only "capturing groups with single char alternatives are converted to character classes". If you have (\.|,|\s+) you can't use [.,\s+], it would match a different thing.

Wiktor Stribiżew Over a year ago

@id345678 Group 0 is the match value. Capturing groups start with ID=1. Click Save regex on the left at the top. Then share the link.

|

Collectives™ on Stack Overflow

function returns only None values when replacing pandas column values by regex match

1 Answer 1

15 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

15 Comments

Your Answer

Sign up or log in

Post as a guest

Related