1

Goal: replace values in column que_text with matches of re.search pattern. Else None

Problem: Receiving only None values in que_text_new column although regex pattern is thoroughly tested!

def override(s):
    x = re.search(r'(an|frage(\s+ich)?)\s+d(i|ı)e\s+Staatsreg(i|ı)erung(.*)(Dresden(\.|,|\s+)?)?', str(s), flags = re.DOTALL | re.MULTILINE))
    if x :
        return x.group(5)
    return None
df2['que_text_new'] = df2['que_text'].apply(override)

What am i doing wrong? removing return None doesent help. There must be some structural error within my function, i assume.

4
  • 2
    Will you please post a sample of your dataframe? It's nearly impossible to help without that. Commented Nov 6, 2021 at 14:30
  • Where is s used the override function? Commented Nov 6, 2021 at 15:16
  • dont know, i assumed s is supposed to be an arbitrary placeholder just like in loops?! Commented Nov 6, 2021 at 15:17
  • 1
    No, you need an input string as the 2nd argument to the re.search method. Before editing the question, you had str(s) (that is why I mention it in my answer). Commented Nov 6, 2021 at 15:23

1 Answer 1

1

You can use a pattern with a single capturing group and then simpy use Series.str.extract and chain .fillna(np.nan) to fill the non-matched values with NaN:

pattern = r'(?s)(?:an|frage(?:\s+ich)?)\s+d[iı]e\s+Staatsreg[iı]erung(.*)'
df2['que_text_new'] = df2['que_text'].astype(str).str.extract(pattern).fillna(np.nan)

Not sure you need .astype(str), but there is str(s) in your code, so it might be safer with this part.

Here,

  • Capturing groups with single char alternatives are converted to character classes, e.g. (i|ı) -> [iı]
  • Other capturing groups are converted to non-capturing ones, i.e. ( -> (?:.
  • To make np.nan work do not forget to import numpy as np.
  • (?s) is an in-pattern re.DOTALL option.
Sign up to request clarification or add additional context in comments.

15 Comments

Thanks, Wiktor. Do you have an idea why i would get ValueError: Wrong number of items passed 7, placement implies 1 on my test dataset? I will add representative data in a sec
You did not use my solution. Use a regex with just one capturing group.
Ok i see. I assume there is no solution with multiple capturing groups? Then i will give you the actual regex. I updated the question text. I need .group(5) of that regex
I assume you did not quite understand why I suggested this kind of solution. 1) Series.str.extract in Pandas is handy to extract (parts of) matches into new column(s) and to do that, you need to use capturing groups in the pattern around those parts you want to extract. 2) Those groupings you do not need to extract into new column(s) should be non-capturing ((?:...)). 3) Only "capturing groups with single char alternatives are converted to character classes". If you have (\.|,|\s+) you can't use [.,\s+], it would match a different thing.
@id345678 Group 0 is the match value. Capturing groups start with ID=1. Click Save regex on the left at the top. Then share the link.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.