0

I'm dealing with a pandas dataframe with 5 columns of data. I need to add filters on each columns to perform certain calculations.

for mfilter in raw_df['Column1'].unique():

        m_filter=raw_df[raw_df['Column1']==mfilter]
        for rfilter in m_filter['Column2'].unique():

            r_filter=m_filter[m_filter['Column2']==rfilter]
            for cfilter in r_filter['Column3'].unique():

                c_filter=r_filter[r_filter['Column3']==cfilter]
                for cafilter in c_filter['Column4'].unique():

                    ca_filter=c_filter[c_filter['Column4']==ca_filter]

                    for part in ca_filter['part_no'].unique():


                        part_df=ca_filter[(category_filter['part_no']==part)]

I have an other column 'Values' on which I will be performing some calculations after entering the 'part' for loop.

Due to very large data, this is taking around 7-8hrs ( around 1 second for each part) of time for complete execution. Is there any better way to reduce the time taken and improve the time efficiency?

Here's some sample data:

Column1 Column2 Column3 part_no Values
A   J   X   1   1
A   K   Y   2   2
B   K   X   3   3
C   L   Y   4   4
C   L   X   5   5
D   J   X   6   6
D   J   X   6   7
D   J   X   6   8
C   L   Y   4   9
C   L   Y   4   10
C   L   Y   4   11

In the dataset if we observe, Values column has certain values for each part( in each category). On obtaining each part data, I have to perform certain calculations with the help of the values of that part_data. I will be pushing this part_df to another function where rest of the task takes place.

4
  • Can you provide a small amount of sample data (e.g. as CSV or actual Pandas code that builds an example DataFrame), describe in English what you're trying to accomplish as your end goal, and show the end results for the sample data? Commented Jun 6, 2020 at 6:06
  • Looks like a groupby(['Column1','Column2','Column3', 'Column4', 'part_no']). Commented Jun 6, 2020 at 6:09
  • Two notes: (1) having variables called mfilter and m_filter which mean different things is an absolute nightmare, and (2) you seem to overwrite part_df in every iteration, so why not just skip to the last iteration and generate the final part_df right away? I'm sure that's not what you want, but that's what your posted code seems to do. Commented Jun 6, 2020 at 6:10
  • @JohnZwinck I have updated the post with more insights. Please have a look at it. Commented Jun 6, 2020 at 6:18

1 Answer 1

1

You can use something like this (I didn't use Column4 because that's not present in your sample data):

df.groupby(['Column1', 'Column2', 'Column3', 'part_no']).apply(print)

It calls the function specified (print in this case) on each group having the same values for the specified columns. The output is:

  Column1 Column2 Column3  part_no  Values
0       A       J       X        1       1
  Column1 Column2 Column3  part_no  Values
1       A       K       Y        2       2
  Column1 Column2 Column3  part_no  Values
2       B       K       X        3       3
  Column1 Column2 Column3  part_no  Values
4       C       L       X        5       5
   Column1 Column2 Column3  part_no  Values
3        C       L       Y        4       4
8        C       L       Y        4       9
9        C       L       Y        4      10
10       C       L       Y        4      11
  Column1 Column2 Column3  part_no  Values
5       D       J       X        6       6
6       D       J       X        6       7
7       D       J       X        6       8

Now all you need to do is define a function containing whatever you had in your inner loop, for example:

def Pothuri(part_df):
    # whatever other code you didn't show us, using part_df['Values'] etc.

Then:

df.groupby(['Column1', 'Column2', 'Column3', 'part_no']).apply(Pothuri)
Sign up to request clarification or add additional context in comments.

4 Comments

You're welcome. If you don't mind, I'd appreciate a reply here when you figure out how much time it takes to run now. It will depend a lot on the data and how long your function actually takes to run.
yea sure, is it possible to pass another dataframe which I have into the same function along with the grouped_data ?
Yes, you can do .apply(Pothuri, arg1, arg2) and it will pass arg1 and arg2 as additional arguments to your function every time. Docs here: pandas.pydata.org/pandas-docs/stable/reference/api/…
Initially the code took 15hrs to run, following this method saved the time by 4hrs ( currently 11hrs)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.