1

I'm a python/pandas newbie and have the following problem: I have a list called 'cat' containing different 13 strings that represent categories. I further have a dataframe called 'ku_drop' that contains 10 columns with HTML code (string format), from which I want to extract information. Now, I want to search for each string of my 'cat'-list in the dataframe and save each cell containing the specific string in the same column. (E.g. all cells containing the string 'Arbeitsatmosphäre' should be saved in Column X1, all containing 'Kommunikation' in Column X2 etc.) How can I do this? I tried with the following, but I only receive an empty dataframe ...

cat = ['Arbeitsatmosphäre', 'Kommunikation', 'Kollegenzusammenhalt', 'Work-Life-Balance', 'Vorgesetztenverhalten', 'Interessante Aufgaben', 'Gleichberechtigung', 'Umgang mit älteren Kollegen', 'Arbeitsbedingungen', 'Umwelt-/Sozialbewusstsein', 'Gehalt/Sozialleistungen', 'Image', 'Karriere/Weiterbildung']
cat_length = len(cat)
df_appender = []
for i in range(cat_length):
    x = "{}".format(category[i] for category in cat)
    df_cat = ku_drop[ku_drop.apply(lambda col: col.str.contains(x, case=False), axis=1)].stack().to_frame()
    df_cat.columns = ['X[i]']
    df_cat = df_cat.dropna(axis=0)
    df_appender.append(df_cat)
df_appender

I'm aware that my code might have a lot of flaws, please excuse this as I am really not very familiar with pandas so far.

1
  • Please have a look at How to make good pandas examples and edit to provide a sample of your input and your expected output so that we can better understand your task Commented Mar 23, 2022 at 15:39

1 Answer 1

1

Try:

cat = ['Arbeitsatmosphäre', 'Kommunikation', 'Kollegenzusammenhalt', 'Work-Life-Balance', 'Vorgesetztenverhalten', 'Interessante Aufgaben', 'Gleichberechtigung', 'Umgang mit älteren Kollegen', 'Arbeitsbedingungen', 'Umwelt-/Sozialbewusstsein', 'Gehalt/Sozialleistungen', 'Image', 'Karriere/Weiterbildung']
ku_drop = pd.DataFrame({'c1': ['Arbeitsatmosphäre abc', 'abc', 'Work-Life-Balance abc', 'Arbeitsatmosphäre abc'], 'c2': ['abc', 'abc Vorgesetztenverhalten abc', 'Kommunikation abc abc', 'abc abc Arbeitsatmosphäre']})

df = pd.DataFrame(index= range(len(ku_drop)), columns = cat)
for i, c in enumerate(cat):
    used = 0
    for j, c2 in enumerate(ku_drop.columns):
        temp = ku_drop[ku_drop[c2].str.contains(c)][c2].values
        if len(temp)>0:
            df.loc[used:used+len(temp)-1,c] = temp
            used += len(temp)

Output:

enter image description here

Sign up to request clarification or add additional context in comments.

10 Comments

Thank you, I only saw this now unfortunately! I can make it work for your example, but not for my 10000 observation dataframe... can you explain to me what the last part (if len(temp)>0 onwards) means? I think this would help me a lot! @keramat
Can you provide the error or problem encountered the code on your data? if there is any match we assign the rows to the end of added rows. The used variable holds the number of used rows.
I don't get an error, I just get a dataframe with only NaNs back... I tried to understand why and replaced ...str.contains(c)... with ...str.contains('Arbeitsatmosphäre')... to check whether the extraction works. The problem seems to be, that the respective strings are not saved in column "Arbeitsatmosphäre", but in every column.
Can you provide a sample of your data?
strange, now it works for me too... thanks so much for your time!!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.