Nested for-loop optimization while iterating over Dataframes

Question

I am fairly new to python and coding. I am looking for a way to optimize a nested for loop. The nested for loop I have written works perfectly fine, but it takes a lot of time to run. I have explained the basic idea behind my original code and what I have tried to do, below:

data = [['a', '35-44', 'male', ['b', 'z', 'x']], ['b', '15-24', 'female', ['a', 'z', 'q']], \
        ['r', '35-44', 'male', ['z', 'a', 'd']], ['q', '15-24', 'female', ['u', 'k', 'b']]]
df = pd.DataFrame(data, columns= ['ID', 'age_group', 'gender', 'matching_ids'])

df is the Dataframe that I am working on. What I want to do is compare each 'ID' in df with every other 'ID' in the same df and check if it follows certain conditions.

If the age_group is equal.
If the gender is the same.
If the 'ID' is in 'matched_ids'.

If these conditions are met I need to append that row to a separate dataframe (sample_df) This is the code with the nested for loop that works fine:

df_copy = df.copy()
sample_df = pd.DataFrame()
for i in range(len(df)):
    for j in range(len(df)):
        if (i!=j) and (df.iloc[i]['ID'] in df_copy.iloc[j]['matching_ids']) and \
        (df.iloc[i]['gender'] == df_copy.iloc[j]['gender']) and\
        (df.iloc[i]['age_group'] == df_copy.iloc[j]['age_group']):
            sample_df = sample_df.append(df_copy.iloc[[j]])

I tried simplifying it by writing a function and using df.apply(func), but it still takes almost the same amount of time. Below is the code written with using a function:

sample_df_func = pd.DataFrame()
def func_extract(x):
     for k in range(len(df)):
        if (x['ID'] != df_copy.iloc[k]['ID']) and (x['ID'] in df_copy.iloc[k]['matching_ids']) and \
        (x['gender'] == df_copy.iloc[k]['gender']) and\
        (x['age_group'] == df_copy.iloc[k]['age_group']):
            global sample_df_func
            sample_df_func = sample_df_func.append(df_copy.iloc[[k]])
df.apply(func_extract, axis = 1)
sample_df_func

I am looking for ways to simplify this and optimize it further. Forgive me, if the solution to this is very simple and I am not able to figure it out.

Thanks

PS: I've just started coding 2 months back.

Please provide rows for the ID 'd, k, u, x, z'.

Corralien
– Corralien

2021-05-29 07:29:02 +00:00
Commented May 29, 2021 at 7:29 — Corralien
– Corralien, Commented May 29, 2021 at 7:29

Mustafa Aydın · Accepted Answer · 2021-05-29 07:39:29Z

3

We can form groups over age_group and gender to obtain subsets where first two conditions hold automatically. For the third condition, we can explode the matching_ids and then check if any of the ids isin the ID and keep those rows within groups only with boolean indexing:

out = (df.groupby(["age_group", "gender"])
         .apply(lambda s: s[s.matching_ids.explode().isin(s.ID).groupby(level=0).any()])
         .reset_index(drop=True))

where lastly we reset the index to get rid of grouping variables as index,

to get

>>> out

  ID age_group  gender matching_ids
0  b     15-24  female    [a, z, q]
1  q     15-24  female    [u, k, b]
2  r     35-44    male    [z, a, d]

answered May 29, 2021 at 7:39

Mustafa Aydın

18.4k4 gold badges21 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Danail Petrov Over a year ago

Hm... did I misunderstand the requirement then ... I thought the ID must be within the matching_ids column, but in your example, ID is clearly NOT part of matching_ids ..

Mustafa Aydın Over a year ago

@DanailPetrov Yes, if a row's ID is contained in another row's matching_ids, then the latter (not the former) is included in the dataframe. You can run the OPs code and look at the sample_df to see the result, if you wish; that's how I interpreted.

Danail Petrov Over a year ago

My understanding is that ID should be checked if existing within the row’s matching_ids. Not on a per-column basis ... I might be wrong though

Danail Petrov Over a year ago

No actually thinking about it .. that doesn’t make sense.. I think your understanding is correct and I am wrong..

Collectives™ on Stack Overflow

Nested for-loop optimization while iterating over Dataframes

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related