1

I have the following df that comprises of code/product & weeks columns.

code.  Product  .   weeks
123 .  product1 .    1;2
123 .  product1 .    3
321 .  product2 .    4;5;6
321 .  product2 .    7

For those rows that have more than 1 week (eg 1;2 or 4;5;6), I want to repeat these rows. My Desirsed output is as follows:

code.  Product  .   weeks
123 .  product1 .    1
123 .  product1 .    2
123 .  product1 .    3
321 .  product2 .    4
321 .  product2 .    5
321 .  product2 .    6
321 .  product2 .    7

What is the best approach to take using pandas or numpy?

1

2 Answers 2

2

Use:

df = (df.set_index(['code','Product'])['weeks']
       .str.split(';', expand=True)
       .stack()
       .reset_index(level=2, drop=True)
       .reset_index(name='weeks'))
print (df)
   code   Product weeks
0   123  product1     1
1   123  product1     2
2   123  product1     3
3   321  product2     4
4   321  product2     5
5   321  product2     6
6   321  product2     7

Explanation:

  1. First set_index by all repeated columns
  2. Create DataFrame by split
  3. Reshape by stack
  4. Last data cleaning by reset_index

Another solution:

from itertools import chain

weeks = df['weeks'].str.split(';')
lens = weeks.str.len()
df = pd.DataFrame({
    'code' : df['code'].repeat(lens),
    'Product' : df['Product'].repeat(lens),
    'weeks' : list(chain.from_iterable(weeks.values.tolist())), 
})

print (df)
   code   Product weeks
0   123  product1     1
0   123  product1     2
1   123  product1     3
2   321  product2     4
2   321  product2     5
2   321  product2     6
3   321  product2     7

Explanation:

  1. Create lists by split
  2. Get lengths of lsits by len
  3. Last repeat columns and flatten weeks
Sign up to request clarification or add additional context in comments.

3 Comments

The first solution appears to be working correctly except for the fact that in my df, it excludes rows that don't have (' ; ') in the ['weeks'] column value. To relate back to the original df above, data for weeks 3 and week 7 are not appearing. I think it has something to do with the str.split function accidentally excluding the additional data. Your help in finding the last missing piece of code will be most helpful!
@jwlon81 - So first solution failed and second working? If data are not confidental, is possible send me csv to my email in my profile for test it?
Hi Jezrael, I'm not sure what happened but I ran the code again and I now works! Thanks
0
#assume test.xlsx is your data
test = pd.read_excel('test.xlsx')  
test_processed = pd.DataFrame(columns=test.columns)
for index, row in test.iterrows():
   weeks = row['weeks'].split(';')
   for week in weeks:
       test_processed = test_processed.append({'code':row['code'], 'Product':row['Product'],'weeks':week}, ignore_index=True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.