Optimize python script

Question

I'm trying to make my script less resource heavy or just looking for an easier code for python to process for the following problem:

Example Table (dataset.xlsx):

no order materials status         Status_id
1  1000  100       available       1 
2  1000  200       not available   3 
3  1001  500       Feb-20          2 
4  1002  400       available       1 
5  1002  300       not available   3 
6  1002  600       available       1 
7  1002  900       available       1 
8  1003  700       available       1 
9  1003  800       available       1

I wanted to get the new column that duplicates max Status_id per order.

df=dataset
df.groupby('Status_id').max()
df['Max'] = df.groupby('order')['Status_id'].transform('max')
df

and I get:

no order materials status         Status_id   Max
1  1000  100       available       1          3
2  1000  200       not available   3          3
3  1001  500       Feb-20          2          2
4  1002  400       available       1          3
5  1002  300       not available   3          3
6  1002  600       available       1          3
7  1002  900       available       1          3
8  1003  700       available       1          1
9  1003  800       available       1          1

Although it looks simple and it works with small sets of data, but my actual data has 80k+ rows of data and maximum of 80 status_ids, and so it takes hours to calculate all that.

any suggestions?

IMO for large file I prefer using Dask(dask.org). Dask will automatically prallelize your operations. It also provides an API almost equal as the pandas one, so you are going to feel comfortable with it. — Guillem
– Guillem, Commented Feb 21, 2020 at 7:47

Mykola Zotko · Accepted Answer · 2020-02-21 07:49:13Z

1

You can try to sort by 'Status_id' first and then take the last value from each group:

df = df.sort_values('Status_id')
df['Max'] = df.groupby('order')['Status_id'].transform('last')

answered Feb 21, 2020 at 7:49

Mykola Zotko

18.2k6 gold badges88 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jezrael Over a year ago

Do you have some test? It should be interesting see if it is faster.

Mykola Zotko Over a year ago

@jezrael No. Maybe you can test it.

jezrael Over a year ago

Idea for improve answer, be free use sample dataset from this

Collectives™ on Stack Overflow

Optimize python script

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related