Pandas create list from column values based on condition

Question

I have a dataframe with columns with tags assigned to the text. I want to create a tags column, which would contain a list of all possible tags without NaN.

I can remove NaN from a single list, but unsure what is the most efficient way to remove them for all lists in the tags column. My dataframe contains 30,000 rows.

Any help would be greatly appreciated!

import pandas as pd

df = pd.DataFrame(data = {'text': ['Quinbrook acquires planned 350 MW project', 'Australian rooftop solar to shine bright', 'The US installed 5.7 GW of solar in Q2'],
                          'acquisition': ['acquisition', np.nan, np.nan], 'tender': [np.nan, np.nan, np.nan], 'opinion': [np.nan, 'opinion', np.nan]})

# get names of the tags 
tags = list(df.columns)
tags.remove('text')

# Create tags column
df['tags'] = df[tags].values.tolist()


# Remove NaN values from a single list

[x for x in df['tags'][0] if str(x) != 'nan']

# ['acquisition']

jezrael · Accepted Answer · 2021-10-26 10:12:44Z

1

If use pandas solution with reshape by DataFrame.stack and aggregate list is possible, but slow:

df['tags'] = df[tags].stack().groupby(level=0).agg(list).reindex(df.index, fill_value=[])
print (df)
                                        text  acquisition  tender  opinion  \
0  Quinbrook acquires planned 350 MW project  acquisition     NaN      NaN   
1   Australian rooftop solar to shine bright          NaN     NaN  opinion   
2     The US installed 5.7 GW of solar in Q2          NaN     NaN      NaN   

            tags  
0  [acquisition]  
1      [opinion]  
2             []

Your solution is faster if use nested list comprehension:

df['tags'] = [[y for y in x if str(y) != 'nan'] for x in df[tags].to_numpy()]

Or:

df['tags'] = [[y for y in x if pd.notna(y)] for x in df[tags].to_numpy()]

Performance in sample data for 30k rows:

df = pd.concat([df] * 10000, ignore_index=True)

tags = list(df.columns)
tags.remove('text')


In [129]: %timeit df['tags'] = df[tags].stack().groupby(level=0).agg(list).reindex(df.index, fill_value=[])
1.21 s ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [130]: %timeit df['tags'] = [[y for y in x if str(y) != 'nan'] for x in df[tags].to_numpy()]
76.2 ms ± 487 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [131]: %timeit df['tags'] = [[y for y in x if pd.notna(y)] for x in df[tags].to_numpy()]
110 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Oct 26, 2021 at 10:12

answered Oct 26, 2021 at 10:06

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Cassiopea Over a year ago

never thought of using list comprehension inside the list comprehension)

Collectives™ on Stack Overflow

Pandas create list from column values based on condition

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related