1

I need to generate a dataframe based on another one. There are two steps based on input df.

The input df has 4 columns. The output should be done this way: 1) Take value from col1 to generate that many rows in output, where col opt is rewritten, new_col1 equals f"{value_from_col0}_{loop_iterator_with_limit_from_col1}", column src equals 'src1'. 2) Take value from col2, split with | as a separator. For each split element, find it in the input df, take value from col0 and generate rows in a similar way as in 1). src equals 'src2'.

df = pd.DataFrame([
    ['opt1', 'a', 2, ''],
    ['opt2', 'b', 1, ''],
    ['opt9', 'z', 3, 'a|b'],
    ['opt8', 'y', 3, 'a']],
  columns=['opt', 'col0', 'col1', 'col2'])
out = pd.DataFrame()
new_rows = []
for i, row in df.iterrows():
    for j in range(row['col1']):
        new_row = dict()
        new_row['opt'] = row['opt']
        new_row['new_col'] = f"{row['col0']}_{j+1}"
        new_row['src'] = 'src1'
        new_rows.append(new_row)
    for s in row['col2'].split('|'):
        if s:
            col1_value = df.loc[df['col0'] == s]['col1'].values[0]
            for k in range(col1_value):
                new_row = dict()
                new_row['opt'] = row['opt']
                new_row['new_col'] = f"{s}_{k + 1}"
                new_row['src'] = 'src2'
                new_rows.append(new_row)
out = out.append(new_rows, ignore_index=True)

Below you can find the expected output. I used iterrows() which is pretty slow. I believe there is a more efficient pandas' way to achieve same thing. Of course, it can be sorted in a different way, it doesn't matter.

   new_col   opt   src
0      a_1  opt1  src1
1      a_2  opt1  src1
2      b_1  opt2  src1
3      z_1  opt9  src1
4      z_2  opt9  src1
5      z_3  opt9  src1
6      a_1  opt9  src2
7      a_2  opt9  src2
8      b_1  opt9  src2
9      y_1  opt8  src1
10     y_2  opt8  src1
11     y_3  opt8  src1
12     a_1  opt8  src2
13     a_2  opt8  src2
2

1 Answer 1

1

This is one way to try to use more of vectorized pandas functions, specifically in pandas==0.25. Probably it still has room for improvement, but it shows some performance improvements vs. using iterrows. The steps used are:

  1. Explode col2 by the split strings:
  2. Rename col2 to col0, merge back with df and append to the original df;
  3. Use pandas or numpy repeat to repeat each column by the number of col1

Below in code:

df['col2'] = df['col2'].str.split('|', n=-1, expand=False) #split string in col2
df['src'] = 'src1' #add src1 for original values

### Explode, change col names, merge and append.
df = pd.concat([
            df.explode('col2')[['opt', 'col2']]\ #expand col2
                .rename(columns={'col2': 'col0'})\ #rename to col0
                .merge(df[['col0','col1']], on='col0'), #merge to get new col1
        df], axis=0, sort=False).fillna('src2') #label second val to 'src2'

### Expand based on col1 values
new_df = pd.DataFrame(
            pd.np.repeat(df.values,df['col1'],axis=0), columns=df.columns #repeat the values
         ).drop(['col1','col2'], axis=1)\
         .sort_values(['opt','col0']).rename(columns={'col0':'new_col'})\
         .reset_index(drop=True)

### Relabel new_col to append the order
new_df['new_col'] = new_df['new_col']+'_'+ \
    (new_df.groupby(['opt','new_col']).cumcount()+1).map(str)


Out[1]:
    opt   new_col   src
0   opt1    a_1     src1
1   opt1    a_2     src1
2   opt2    b_1     src1
3   opt8    a_1     src2
4   opt8    a_2     src2
5   opt8    y_1     src1
6   opt8    y_2     src1
7   opt8    y_3     src1
8   opt9    a_1     src2
9   opt9    a_2     src2
10  opt9    b_1     src2
11  opt9    z_1     src1
12  opt9    z_2     src1
13  opt9    z_3     src1

If we test the efficiency vs. iterrows using 100 times this dataframe, we have below:

df = pd.concat([df]*100, ignore_index=True)

%timeit generic(df) #using iterrows (your function)
#162 ms ± 722 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit generic1(df) #using the code above
#33 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.