0

I'm using numpy where with multiple conditions to assign a category based on a text string a transaction description.

Part of the code is below

`import numpy as np

conditions = [
    df2['description'].str.contains('AEGON', na=False),                        
    df2['description'].str.contains('IB/PVV', na=False),                   
    df2['description'].str.contains('Picnic', na=False),                   
    df2['description'].str.contains('Jumbo', na=False),  
]    
values = [
    'Hypotheek',                                                  
    'Hypotheek',                                                  
    'Boodschappen',                                               
    'Boodschappen']   

df2['Classificatie'] = np.select(conditions, values, default='unknown')
                                        

I have many conditions which - only partly shown here. I want to create a table / dataframe in stead of including every seperate condition and value in the code. So for instance the following dataframe:

import pandas as pd

Conditions = {'Condition': ['AEGON','IB/PVV','Picnic','Jumbo'],
         'Value': ['Hypotheek','Hypotheek','Boodschappen','Boodschappen']
         
        }

df_conditions = pd.DataFrame(Conditions, columns= ['Condition','Value'])

How can I adjust the condition to look for (in the str.contains) a text string as listed in df_condictions['condition'] and to apply the Value column to df2['Classificatie']? The values are already a list in the variable explorer, but I can't find a way to have the str.contains to look for a value in a list / dataframe.

    Desired outcome:
    In [3]: iwantthis
    Out[3]:
       Description               Classificatie
    0  groceries Jumbo on date   boodschappen         
1  mortgage payment Aegon.    Hypotheek
    2  transfer picnic.           Boodschappen

The first column is the input data frame, te second column is what I'm looking for.

Please note that my current code already allows me to create this column, but I want to use another more automated way using de df_condtions table.

I'm not yet really familiair with Python and I can't find anything online.

3
  • Can you please edit your question and put there sample (small) input dataframe and expected output? Commented Apr 9, 2022 at 13:54
  • I edited my question, did it clarify? @AndrejKesely I found another similar topic in the mean time: stackoverflow.com/questions/62854013/… tried this, but it led to strange combinations, also when adjusting the cut-offs. But in essence it seems to be the same question Commented May 1, 2022 at 18:16
  • thanks. I get two errors: 1) InvalidIndexError: Reindexing only valid with uniquely valued Index objects and if I run again: raise KeyError(key) from err KeyError: 'Condition' I had some duplcicates in the conditions table, which I removed. But the error keeps occurring. I've added a verification of the integrity (both set tot TRUE and false), but this does not solve the issue df_conditions = df_conditions.set_index('Condition', verify_integrity=False) Commented May 6, 2022 at 12:11

1 Answer 1

0

Try:

import re

df_conditions["Condition"] = df_conditions["Condition"].str.lower()
df_conditions = df_conditions.set_index("Condition")

tmp = df["Description"].str.extract(
    "(" + "|".join(re.escape(c) for c in df_conditions.index) + ")",
    flags=re.I,
)

df["Classificatie"] = tmp[0].str.lower().map(df_conditions["Value"])
print(df)

Prints:

               Description Classificatie
0  groceries Jumbo on date  Boodschappen
1  mortgage payment Aegon.     Hypotheek
2         transfer picnic.  Boodschappen
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.