0

I have a dataframe with 37m rows (below is just an excerpt) where I need to check for each row if one of the elements in the list starts with "Y02". I would like to identify the columns where this is the case by creating a new column that is 1 where at least one element in the list of 'cpc_class_symbol' starts with 'Y02' and 0 if otherwise.

The code below works, but only manages 270 it/s, which means that I still need to wait ages to run the code on 37m rows. Does anyone know how I can speed things up?

enter image description here

1 Answer 1

5

Let's say you start with the following mock DataFrame:

In [21]: df = pd.DataFrame(
    ...:     {
    ...:         'appln_id': range(10),
    ...:         'cpc_class_symbol': [
    ...:             ['ABC', 'DEF', 'Y02_foo'],
    ...:             ['Y02_bar', 'ABC', 'DEF', 'XYZ'],
    ...:             ['ABC'],
    ...:             ['XYZ'],
    ...:             [],
    ...:             ['Y02'],
    ...:             ['ABC', 'Y02_foo'],
    ...:             ['ABC', 'XYZ'],
    ...:             ['ABC', 'DEF', 'XYZ'],
    ...:             ['Y02_foo', 'XYZ'],
    ...:         ],
    ...:     },
    ...: )

In [22]: df
Out[22]:
   appln_id          cpc_class_symbol
0         0       [ABC, DEF, Y02_foo]
1         1  [Y02_bar, ABC, DEF, XYZ]
2         2                     [ABC]
3         3                     [XYZ]
4         4                        []
5         5                     [Y02]
6         6            [ABC, Y02_foo]
7         7                [ABC, XYZ]
8         8           [ABC, DEF, XYZ]
9         9            [Y02_foo, XYZ]

If you use df['cpc_class_symbol'].explode() you will end up with a Series where each list item is in a separate row:

In [23]: df['cpc_class_symbol'].explode()
Out[23]:
0        ABC
0        DEF
0    Y02_foo
1    Y02_bar
1        ABC
1        DEF
1        XYZ
2        ABC
3        XYZ
4        NaN
5        Y02
6        ABC
6    Y02_foo
7        ABC
7        XYZ
8        ABC
8        DEF
8        XYZ
9    Y02_foo
9        XYZ
Name: cpc_class_symbol, dtype: object

The index of the Series shows the original row labels. Now, you can use the str accessor of the Series to check whether it startswith a certain string or not.

In [24]: df['cpc_class_symbol'].explode().str.startswith('Y02')
Out[24]:
0    False
0    False
0     True
1     True
1    False
1    False
1    False
2    False
3    False
4      NaN
5     True
6    False
6     True
7    False
7    False
8    False
8    False
8    False
9     True
9    False
Name: cpc_class_symbol, dtype: object

What you want to do is to group this Series by the index and check whether any of the items for the corresponding index is True. You can do that with Series.groupby:

In [25]:  df['cpc_class_symbol'].explode().str.startswith('Y02').groupby(level=0).any().astype('int')
Out[25]:
0    1
1    1
2    0
3    0
4    0
5    1
6    1
7    0
8    0
9    1
Name: cpc_class_symbol, dtype: int64

Here, groupby(level=0) refers to the first level of the index (in your case you only have one level so it basically means group by the index). You can assign this back to the DataFrame with

df['Y02_bin'] = df['cpc_class_symbol'].explode().str.startswith('Y02').groupby(level=0).any().astype('int')

This might be a little memory intensive and I think using a regular for-loop on the Series and collecting the results in a list should be pretty efficient for your use case compared to using DataFrame.itertuples and assigning to each row individually.

That would look like this:

In [31]: [any(item.startswith('Y02') for item in row) for row in df['cpc_class_symbol']]
Out[31]: [True, True, False, False, False, True, True, False, False, True]

You can also assign this back to your original DataFrame:

In [34]: df['Y02_bin_loop'] = [int(any(item.startswith('Y02') for item in row)) for row in df['cpc_class_symbol']]

The results will be the same:

In [37]: df
Out[37]:
   appln_id          cpc_class_symbol  Y02_bin_loop  Y02_bin
0         0       [ABC, DEF, Y02_foo]             1        1
1         1  [Y02_bar, ABC, DEF, XYZ]             1        1
2         2                     [ABC]             0        0
3         3                     [XYZ]             0        0
4         4                        []             0        0
5         5                     [Y02]             1        1
6         6            [ABC, Y02_foo]             1        1
7         7                [ABC, XYZ]             0        0
8         8           [ABC, DEF, XYZ]             0        0
9         9            [Y02_foo, XYZ]             1        1
Sign up to request clarification or add additional context in comments.

3 Comments

For 10 million rows, explode took 27 seconds while the for loop took 9 seconds. I think both in terms of efficiency and code clarity, the for loop is the better option here. It is generally suggested that you don't use for loops while using pandas but in this case you are only iterating on a Series not a DataFrame and you are dealing with lists inside columns at which pandas is generally not very good.
Thanks a lot! I have now solved it essentially as suggested by you, i.e. first exploded the list and then assigned the binary value with the following code: df_224_03['Y02_bin'] = [1 if i[:3] == "Y02" else 0 for i in df_224_03['cpc_class_symbol']] Afterwards I used groupby to get back to the "applnid" level that I want to work with. df_224_03_agg = (df_224_03.groupby(['appln_id']) .agg({'cpc_class_symbol': lambda x: x.tolist(), 'Y02_bin': 'max', 'Y02A_bin': 'max'}) .reset_index())
I am still wondering if there might be an efficient way to get through lists within a pd df column. I would have thought that is something that is quite standard. Thanks a lot for your comment!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.