Loop through list and find strings in pandas dataframe

Question

I'm a python/pandas newbie and have the following problem: I have a list called 'cat' containing different 13 strings that represent categories. I further have a dataframe called 'ku_drop' that contains 10 columns with HTML code (string format), from which I want to extract information. Now, I want to search for each string of my 'cat'-list in the dataframe and save each cell containing the specific string in the same column. (E.g. all cells containing the string 'Arbeitsatmosphäre' should be saved in Column X1, all containing 'Kommunikation' in Column X2 etc.) How can I do this? I tried with the following, but I only receive an empty dataframe ...

cat = ['Arbeitsatmosphäre', 'Kommunikation', 'Kollegenzusammenhalt', 'Work-Life-Balance', 'Vorgesetztenverhalten', 'Interessante Aufgaben', 'Gleichberechtigung', 'Umgang mit älteren Kollegen', 'Arbeitsbedingungen', 'Umwelt-/Sozialbewusstsein', 'Gehalt/Sozialleistungen', 'Image', 'Karriere/Weiterbildung']
cat_length = len(cat)
df_appender = []
for i in range(cat_length):
    x = "{}".format(category[i] for category in cat)
    df_cat = ku_drop[ku_drop.apply(lambda col: col.str.contains(x, case=False), axis=1)].stack().to_frame()
    df_cat.columns = ['X[i]']
    df_cat = df_cat.dropna(axis=0)
    df_appender.append(df_cat)
df_appender

I'm aware that my code might have a lot of flaws, please excuse this as I am really not very familiar with pandas so far.

Please have a look at How to make good pandas examples and edit to provide a sample of your input and your expected output so that we can better understand your task — G. Anderson
– G. Anderson, Commented Mar 23, 2022 at 15:39

keramat · Accepted Answer · 2022-03-23 15:59:58Z

1

Try:

cat = ['Arbeitsatmosphäre', 'Kommunikation', 'Kollegenzusammenhalt', 'Work-Life-Balance', 'Vorgesetztenverhalten', 'Interessante Aufgaben', 'Gleichberechtigung', 'Umgang mit älteren Kollegen', 'Arbeitsbedingungen', 'Umwelt-/Sozialbewusstsein', 'Gehalt/Sozialleistungen', 'Image', 'Karriere/Weiterbildung']
ku_drop = pd.DataFrame({'c1': ['Arbeitsatmosphäre abc', 'abc', 'Work-Life-Balance abc', 'Arbeitsatmosphäre abc'], 'c2': ['abc', 'abc Vorgesetztenverhalten abc', 'Kommunikation abc abc', 'abc abc Arbeitsatmosphäre']})

df = pd.DataFrame(index= range(len(ku_drop)), columns = cat)
for i, c in enumerate(cat):
    used = 0
    for j, c2 in enumerate(ku_drop.columns):
        temp = ku_drop[ku_drop[c2].str.contains(c)][c2].values
        if len(temp)>0:
            df.loc[used:used+len(temp)-1,c] = temp
            used += len(temp)

Output:

answered Mar 23, 2022 at 15:59

keramat

4,6138 gold badges29 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Freyana Over a year ago

Thank you, I only saw this now unfortunately! I can make it work for your example, but not for my 10000 observation dataframe... can you explain to me what the last part (if len(temp)>0 onwards) means? I think this would help me a lot! @keramat

keramat Over a year ago

Can you provide the error or problem encountered the code on your data? if there is any match we assign the rows to the end of added rows. The used variable holds the number of used rows.

Freyana Over a year ago

I don't get an error, I just get a dataframe with only NaNs back... I tried to understand why and replaced ...str.contains(c)... with ...str.contains('Arbeitsatmosphäre')... to check whether the extraction works. The problem seems to be, that the respective strings are not saved in column "Arbeitsatmosphäre", but in every column.

keramat Over a year ago

Can you provide a sample of your data?

Freyana Over a year ago

strange, now it works for me too... thanks so much for your time!!

|

Collectives™ on Stack Overflow

Loop through list and find strings in pandas dataframe

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related