0

I am using the function segmentMatch in which I am sending two dataframes. I am using a for loop through one dataframe and have some condition to check before I merge with another dataframe with the loop variable. It gives me perfect answer but because both dataframes were too big, it is too slow.

Is there any way I can improve the speed.

def segmentMatch(self, df, df_program):

    df_result = []
    for i, rview in df.iterrows():
        df_tmp = []
        df1 = []
        df_tmp = df_program.ix[(df_program.iD == rview['id']) & 
                                (rview['end_time'] >= df_program.START_TIME) &
                                (rview['start_time'] <= df_program.END_TIME)]
        df1 = rview.to_frame().transpose()
        tmp = pd.merge(df1, df_tmp,how='left')
        df_result.append(tmp)


    result = pd.concat(df_result, axis=0)
    del(df1, df_tmp, tmp)
    return result

Please help me. I am using Visual studio code and Python 3.6

Thanks in advance.

3
  • What is size of both DataFrames? Commented Feb 27, 2019 at 12:24
  • 1
    Can you add minimal, complete, and verifiable example ? E.g. for each DataFrame 5 rows with 3 columns? Commented Feb 27, 2019 at 12:25
  • df = 11 columns and more than 10,00,000 rows whereas df_program = 7 columns and 40000 rows Commented Feb 27, 2019 at 12:59

1 Answer 1

1

In general the advise is to never loop through a dataframe if it can be avoided. Looping is super slow compare to any merge or join.

Conditional joins are not great in pandas. They are pretty easy in SQL however. A small lifehack could be to pip install pandasql and actually use SQL. See also here. The example below is not tested.

import pandasql as ps

sqlcode = '''
SELECT *
FROM df
JOIN df ON 1=1 
    AND df_program.iD = df.id 
    AND df.end_time >= df_program.START_TIME
    AND df.start_time <= df_program.END_TIME
'''

new_df = ps.sqldf(sqlcode, locals())

If you prefer to not use pandassqlI would suggest just merging and checking the conditions later. That of course requires a bit more memory, depending on the overlap in IDs. Again, bit tricky without data, but something along the lines of

full_df = df.join(df, on='id', rsuffix='program_')
filtered_df = df.loc[(df.end_time >= df.program_START_TIME) & (df.start_time <= df.program_END_TIME)

If it doesn't fit in memory, you could try to do the same with a dask dataframe.

import dask.dataframe as dd

# Read your CSVs in like this
df = dd.read_csv('')
df_program = dd.read_csv('')

# Now make sure your ids are index in both dataframes

# Join and filter like above 
full_df = df.join(df, on='id', rsuffix='program_')
filtered_df = df.loc[(df.end_time >= df.program_START_TIME) & (df.start_time <= df.program_END_TIME)

# Write the result to a CSV or convert to pandas (if it fits your memory anyway):
df = full_df.compute()
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for your reply. I will try with SQL. Is there any other way to improve my code without using SQL.
Added a suggestion - think you can just join on the equality and filter as a second step :)
Thanks again. I have tried use join(), because it is filtering after join, it removes all the data where condition doesn't match. I want all the data from df + match data from df_program. Could you please help again.
Also, when I try to join all the records it gives me memory error as too big dataframes to join
Added another option :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.