2

I am fairly new to python and coding. I am looking for a way to optimize a nested for loop. The nested for loop I have written works perfectly fine, but it takes a lot of time to run. I have explained the basic idea behind my original code and what I have tried to do, below:

data = [['a', '35-44', 'male', ['b', 'z', 'x']], ['b', '15-24', 'female', ['a', 'z', 'q']], \
        ['r', '35-44', 'male', ['z', 'a', 'd']], ['q', '15-24', 'female', ['u', 'k', 'b']]]
df = pd.DataFrame(data, columns= ['ID', 'age_group', 'gender', 'matching_ids']) 

df is the Dataframe that I am working on. What I want to do is compare each 'ID' in df with every other 'ID' in the same df and check if it follows certain conditions.

  1. If the age_group is equal.
  2. If the gender is the same.
  3. If the 'ID' is in 'matched_ids'.

If these conditions are met I need to append that row to a separate dataframe (sample_df) This is the code with the nested for loop that works fine:

df_copy = df.copy()
sample_df = pd.DataFrame()
for i in range(len(df)):
    for j in range(len(df)):
        if (i!=j) and (df.iloc[i]['ID'] in df_copy.iloc[j]['matching_ids']) and \
        (df.iloc[i]['gender'] == df_copy.iloc[j]['gender']) and\
        (df.iloc[i]['age_group'] == df_copy.iloc[j]['age_group']):
            sample_df = sample_df.append(df_copy.iloc[[j]])

I tried simplifying it by writing a function and using df.apply(func), but it still takes almost the same amount of time. Below is the code written with using a function:

sample_df_func = pd.DataFrame()
def func_extract(x):
     for k in range(len(df)):
        if (x['ID'] != df_copy.iloc[k]['ID']) and (x['ID'] in df_copy.iloc[k]['matching_ids']) and \
        (x['gender'] == df_copy.iloc[k]['gender']) and\
        (x['age_group'] == df_copy.iloc[k]['age_group']):
            global sample_df_func
            sample_df_func = sample_df_func.append(df_copy.iloc[[k]])
df.apply(func_extract, axis = 1)
sample_df_func

I am looking for ways to simplify this and optimize it further. Forgive me, if the solution to this is very simple and I am not able to figure it out.

Thanks

PS: I've just started coding 2 months back.

1
  • Please provide rows for the ID 'd, k, u, x, z'. Commented May 29, 2021 at 7:29

1 Answer 1

3

We can form groups over age_group and gender to obtain subsets where first two conditions hold automatically. For the third condition, we can explode the matching_ids and then check if any of the ids isin the ID and keep those rows within groups only with boolean indexing:

out = (df.groupby(["age_group", "gender"])
         .apply(lambda s: s[s.matching_ids.explode().isin(s.ID).groupby(level=0).any()])
         .reset_index(drop=True))

where lastly we reset the index to get rid of grouping variables as index,

to get

>>> out

  ID age_group  gender matching_ids
0  b     15-24  female    [a, z, q]
1  q     15-24  female    [u, k, b]
2  r     35-44    male    [z, a, d]
Sign up to request clarification or add additional context in comments.

4 Comments

Hm... did I misunderstand the requirement then ... I thought the ID must be within the matching_ids column, but in your example, ID is clearly NOT part of matching_ids ..
@DanailPetrov Yes, if a row's ID is contained in another row's matching_ids, then the latter (not the former) is included in the dataframe. You can run the OPs code and look at the sample_df to see the result, if you wish; that's how I interpreted.
My understanding is that ID should be checked if existing within the row’s matching_ids. Not on a per-column basis ... I might be wrong though
No actually thinking about it .. that doesn’t make sense.. I think your understanding is correct and I am wrong..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.