Optimizing a Pandas DataFrame Transformation to Link two Columns

Question

Given the following df:

   SequenceNumber | ID | CountNumber | Side | featureA | featureB
0   0             | 0  |  3          | Sell |  4       |  2              
1   0             | 1  |  1          | Buy  |  12      |  45
2   0             | 2  |  1          | Buy  |  1       |  4
3   0             | 3  |  1          | Buy  |  3       |  36
4   1             | 0  |  1          | Sell |  5       |  11
5   1             | 1  |  1          | Sell |  7       |  12
6   1             | 2  |  2          | Buy  |  5       |  35

I want to create a new df such that for every SequenceNumber value, it takes the rows with the CountNumber == 1, and creates new rows where if the Side == 'Buy' then put their ID in a column named To. Otherwise put their ID in a column named From. Then the empty column out of From and To will take the ID of the row with the CountNumber > 1 (there is only one per each SequenceNumber value). The rest of the features should be preserved.

NOTE: basically each SequenceNumber represents one transactions that has either one seller and multiple buyers, or vice versa. I am trying to create a database that links the buyers and sellers where From is the Seller ID and To is the Buyer ID.

The output should look like this:

   SequenceNumber | From | To | featureA | featureB
0   0             | 0    |  1 |  12      |  45              
1   0             | 0    |  2 |  1       |  4
2   0             | 0    |  3 |  3       |  36
3   1             | 0    |  2 |  5       |  11
4   1             | 1    |  2 |  7       |  12

I implemented a method that does this, however I am using for loops which takes a long time to run on a large data. I am looking for a faster scalable method. Any suggestions?

Here is the original df:

df = pd.DataFrame({'SequenceNumber ': [0, 0, 0, 0, 1, 1, 1], 
                   'ID': [0, 1, 2, 3, 0, 1, 2], 
                   'CountNumber': [3, 1, 1, 1, 1, 1, 2],
                   'Side': ['Sell', 'Buy', 'Buy', 'Buy', 'Sell', 'Sell', 'Buy'],
                   'featureA': [4, 12, 1, 3, 5, 7, 5],
                   'featureB': [2, 45, 4, 36, 11, 12, 35]})

Can you provide the data in constructor format? (Paste df.to_dict() output.) — ifly6
– ifly6, Commented Feb 7, 2023 at 21:46
For each sequence number, there is always only either (1 sell multiple buys) or (multiple sells 1 buy), correct? — JarroVGIT
– JarroVGIT, Commented Feb 7, 2023 at 21:55
Does it make sense if I suggest splitting this problem in two parts and concatenate the resulting dataframes? Try to solve this for a dataframe where there is always 1 buyer and multiple sellers. Then do the same for a dataframe where there is always 1 seller en multiple buyers. Then concatenate those results. Combining these problems make it quite hard I think — JarroVGIT
– JarroVGIT, Commented Feb 7, 2023 at 22:00
@JarroVGIT yes there is always either (1 sell multiple buys) or (multiple sells 1 buy). I also believe that if you can split it, we can easily generalize it, no? But splitting would also work — devCharaf
– devCharaf, Commented Feb 7, 2023 at 22:05

mozway · Accepted Answer · 2023-02-07 22:50:00Z

You can reshape with a pivot, select the features to keep with a mask and rework the output with groupby.first then concat:

features = list(df.filter(like='feature'))

out = (
   # repeat the rows with CountNumber > 1
 df.loc[df.index.repeat(df['CountNumber'])]
   # rename Sell/Buy into from/to and de-duplicate the rows per group
   .assign(Side=lambda d: d['Side'].map({'Sell': 'from', 'Buy': 'to'}),
           n=lambda d: d.groupby(['SequenceNumber', 'Side']).cumcount()
          )
   # mask the features where CountNumber > 1
   .assign(**{f: lambda d, f=f: d[f].mask(d['CountNumber'].gt(1)) for f in features})
   .drop(columns='CountNumber')
   # reshape with a pivot
   .pivot(index=['SequenceNumber', 'n'], columns='Side')
 )

out = (
 pd.concat([out['ID'], out.drop(columns='ID').groupby(level=0, axis=1).first()], axis=1)
   .reset_index('SequenceNumber')
)

Output:

   SequenceNumber  from  to  featureA  featureB
n                                              
0               0     0   1      12.0      45.0
1               0     0   2       1.0       4.0
2               0     0   3       3.0      36.0
0               1     0   2       5.0      11.0
1               1     1   2       7.0      12.0

atlernative using a `merge` like suggested by ifly6:

features = list(df.filter(like='feature'))

df1 = df.query('Side=="Sell"').copy()
df1[features] = df1[features].mask(df1['CountNumber'].gt(1))

df2 = df.query('Side=="Buy"').copy()
df2[features] = df2[features].mask(df2['CountNumber'].gt(1))

out = (df1.merge(df2, on='SequenceNumber').rename(columns={'ID_x': 'from', 'ID_y': 'to'})
          .set_index(['SequenceNumber', 'from', 'to'])
          .filter(like='feature')
          .pipe(lambda d: d.groupby(d.columns.str.replace('_.*?$', '', regex=True), axis=1).first())
          .reset_index()
      )

Output:

   SequenceNumber  from  to  featureA  featureB
0               0     0   1      12.0      45.0
1               0     0   2       1.0       4.0
2               0     0   3       3.0      36.0
3               1     0   2       5.0      11.0
4               1     1   2       7.0      12.0

Please check the update if you're only interested in the features that match the CountNumber with value 1

ifly6 · Accepted Answer · 2023-02-08 02:51:35Z

1

Initial response. To get the answer half complete. Split the data into sellers and buyers. Then merge it against itself on the sequence number:

ndf = df.query('Side == "Sell"').merge(
    df.query('Side == "Buy"'), on='SequenceNumber', suffixes=['_sell', '_buy']) \
    .rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})

I then drop the side variable.

ndf = ndf.drop(columns=[i for i in ndf.columns if i.startswith('Side')])

This creates a very wide table:

   SequenceNumber  From  CountNumber_sell  featureA_sell  featureB_sell  To  CountNumber_buy  featureA_buy  featureB_buy
0               0     0                 3              4              2   1                1            12            45
1               0     0                 3              4              2   2                1             1             4
2               0     0                 3              4              2   3                1             3            36
3               1     0                 1              5             11   2                2             5            35
4               1     1                 1              7             12   2                2             5            35

This leaves you, however, with two featureA and featureB columns. I don't think your question clearly establishes which one takes precedence. Please provide more information on that.

Is it select the side with the lower CountNumber? Is it when CountNumber == 1? If the latter, then just null out the entries at the merge stage, do the merge, and then forward fill your appropriate columns to recover the proper values.

Re nulling. If you null the portions in featureA and featureB where the CountNumber is not 1, you can then create new version of those columns after the merge by forward filling and selecting.

s = df.query('Side == "Sell"').copy()
s.loc[s['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
b = df.query('Side == "Buy"').copy()
b.loc[b['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan

ndf = s.merge(
    b, on='SequenceNumber', suffixes=['_sell', '_buy']) \
    .rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
ndf['featureA'] = ndf[['featureA_buy', 'featureA_sell']] \
    .ffill(axis=1).iloc[:, -1]
ndf['featureB'] = ndf[['featureB_buy', 'featureB_sell']] \
    .ffill(axis=1).iloc[:, -1]

ndf = ndf.drop(
    columns=[i for i in ndf.columns if i.startswith('Side') 
             or i.endswith('_sell') or i.endswith('_buy')])

The final version of ndf then is:

   SequenceNumber  From  To  featureA  featureB
0               0     0   1      12.0      45.0
1               0     0   2       1.0       4.0
2               0     0   3       3.0      36.0
3               1     0   2       5.0      11.0
4               1     1   2       7.0      12.0

edited Feb 8, 2023 at 2:51

answered Feb 7, 2023 at 22:12

ifly6

5,3903 gold badges28 silver badges52 bronze badges

3 Comments

devCharaf Over a year ago

This is a great answer, regarding the featureA and B, I want to keep that which has a CountNumber == 1. So for the first 3 rows that would be featureA/B_buy and the last 2 rows that would be featureA/B_sell

devCharaf Over a year ago

can you give some details on what you mean by "just null out the entries at the merge stage, do the merge, and then forward fill your appropriate columns"?

ifly6 Over a year ago

@devCharaf edited to include code re forward filling

Jamiu S. · Accepted Answer · 2023-02-07 23:05:06Z

Here is an alternative approach

df1 = df.loc[df['CountNumber'] == 1].copy()
df1['From'] = (df1['ID'].where(df1['Side'] == 'Sell', df1['SequenceNumber']
                .map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
                )
df1['To'] = (df1['ID'].where(df1['Side'] == 'Buy', df1['SequenceNumber']
                .map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
                )

df1 = df1.drop(['ID', 'CountNumber', 'Side'], axis=1)
df1 = df1[['SequenceNumber', 'From', 'To', 'featureA', 'featureB']]
df1.reset_index(drop=True, inplace=True)
print(df1)

   SequenceNumber  From  To  featureA  featureB
0               0     0   1        12        45
1               0     0   2         1         4
2               0     0   3         3        36
3               1     0   2         5        11
4               1     1   2         7        12

Collectives™ on Stack Overflow

Optimizing a Pandas DataFrame Transformation to Link two Columns

3 Answers 3

atlernative using a `merge` like suggested by ifly6:

1 Comment

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

atlernative using a merge like suggested by ifly6:

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related

atlernative using a `merge` like suggested by ifly6: