2

I have a dataframe like this:

df = pd.DataFrame({'market':['ES','UK','DE'],
                   'provider':['A','B','C'],
                   'item':['X','Y','Z']})

Then I have a list with the providers and the following loop:

providers_list = ['A','B','C']
for provider in providers_list:
  a = df.loc[df['provider']==provider]

That loop creates a dataframe for each provider, which later on I put into an excel. I would like to use the function apply for speed purposes. I have transformed the code like this:

providers_list = pd.DataFrame({'provider':['A','B','C']})
def report(provider):
 a = df.loc[df['provider']==provider]
providers_list.apply(report)  

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1190, in wrapper raise ValueError("Can only compare identically-labeled "

ValueError: ('Can only compare identically-labeled Series objects', 'occurred at index provider')

Thanks

3
  • 1
    Do you need df.groupby('provider').apply(func) ? Commented Oct 26, 2018 at 8:57
  • 1
    @roganjosh There are no more packages right now. For testing purposes, I am just making the dataframe creation. Commented Oct 26, 2018 at 9:10
  • 1
    @roganjosh You are absolutely right. That's why before replying you I edited my question to include the full traceback, both with pandas and swifter. Sorry for the inconveniences and thanks. I have edited also the question avoiding swifter and just sticking to pandas apply. Commented Oct 26, 2018 at 9:19

2 Answers 2

2

The apply method is generally inefficient. It's nothing more than a glorified loop with some extra functionality. Instead, you can use GroupBy to cycle through each provider:

for prov, df_prov in df.groupby('provider'):
    df_prov.to_excel(f'{prov}.xlsx', index=False)

If you only want to include a pre-defined list of providers in your output, you can define a GroupBy object and iterate your list instead:

providers_list = ['A', 'B', 'C']
grouper = df.groupby('provider')

for prov in providers_list:
    grouper.get_group(prov).to_excel(f'{prov}.xlsx', index=False)

If you're interested in speed for your process as a whole, I strongly advise you avoid Excel: exporting to csv, csv.gz or pkl will all be much more efficient. For large datasets, it's unlikely filtering your dataframe is your bottleneck when exporting to Excel.

Sign up to request clarification or add additional context in comments.

Comments

1

This worked for me with a milllion entries of each provider in under a second:

import pandas as pd
from tqdm import tqdm

tqdm.pandas(desc="Progress:")


df = pd.DataFrame({'market':['ES','UK','DE']*1000000,
                   'provider':['A','B','C']*1000000,
                   'item':['X','Y','Z']*1000000})

grouped = df.groupby("provider")
providers_list = ['A','B','C']

for prov in tqdm(providers_list):
    frame_name = prov
    globals()[frame_name] = pd.DataFrame(grouped.get_group(prov))

print(A)
print(B)
print(C)

100%|██████████| 3/3 [00:00<00:00,  9.59it/s]

3 Comments

Don't use globals() unnecessarily. It's poor practice.
Okay, do you know how to create the named DataFrames in a correct manner? I would edit my answer accordingly.
Use a dictionary of dataframes (if necessary) instead.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.