1

Given the following df:

   SequenceNumber | ID | CountNumber | Side | featureA | featureB
0   0             | 0  |  3          | Sell |  4       |  2              
1   0             | 1  |  1          | Buy  |  12      |  45
2   0             | 2  |  1          | Buy  |  1       |  4
3   0             | 3  |  1          | Buy  |  3       |  36
4   1             | 0  |  1          | Sell |  5       |  11
5   1             | 1  |  1          | Sell |  7       |  12
6   1             | 2  |  2          | Buy  |  5       |  35

I want to create a new df such that for every SequenceNumber value, it takes the rows with the CountNumber == 1, and creates new rows where if the Side == 'Buy' then put their ID in a column named To. Otherwise put their ID in a column named From. Then the empty column out of From and To will take the ID of the row with the CountNumber > 1 (there is only one per each SequenceNumber value). The rest of the features should be preserved.

NOTE: basically each SequenceNumber represents one transactions that has either one seller and multiple buyers, or vice versa. I am trying to create a database that links the buyers and sellers where From is the Seller ID and To is the Buyer ID.

The output should look like this:

   SequenceNumber | From | To | featureA | featureB
0   0             | 0    |  1 |  12      |  45              
1   0             | 0    |  2 |  1       |  4
2   0             | 0    |  3 |  3       |  36
3   1             | 0    |  2 |  5       |  11
4   1             | 1    |  2 |  7       |  12

I implemented a method that does this, however I am using for loops which takes a long time to run on a large data. I am looking for a faster scalable method. Any suggestions?

Here is the original df:

df = pd.DataFrame({'SequenceNumber ': [0, 0, 0, 0, 1, 1, 1], 
                   'ID': [0, 1, 2, 3, 0, 1, 2], 
                   'CountNumber': [3, 1, 1, 1, 1, 1, 2],
                   'Side': ['Sell', 'Buy', 'Buy', 'Buy', 'Sell', 'Sell', 'Buy'],
                   'featureA': [4, 12, 1, 3, 5, 7, 5],
                   'featureB': [2, 45, 4, 36, 11, 12, 35]})
7
  • Can you provide the data in constructor format? (Paste df.to_dict() output.) Commented Feb 7, 2023 at 21:46
  • just added it at the end! Commented Feb 7, 2023 at 21:49
  • 1
    For each sequence number, there is always only either (1 sell multiple buys) or (multiple sells 1 buy), correct? Commented Feb 7, 2023 at 21:55
  • Does it make sense if I suggest splitting this problem in two parts and concatenate the resulting dataframes? Try to solve this for a dataframe where there is always 1 buyer and multiple sellers. Then do the same for a dataframe where there is always 1 seller en multiple buyers. Then concatenate those results. Combining these problems make it quite hard I think Commented Feb 7, 2023 at 22:00
  • @JarroVGIT yes there is always either (1 sell multiple buys) or (multiple sells 1 buy). I also believe that if you can split it, we can easily generalize it, no? But splitting would also work Commented Feb 7, 2023 at 22:05

3 Answers 3

2

You can reshape with a pivot, select the features to keep with a mask and rework the output with groupby.first then concat:

features = list(df.filter(like='feature'))

out = (
   # repeat the rows with CountNumber > 1
 df.loc[df.index.repeat(df['CountNumber'])]
   # rename Sell/Buy into from/to and de-duplicate the rows per group
   .assign(Side=lambda d: d['Side'].map({'Sell': 'from', 'Buy': 'to'}),
           n=lambda d: d.groupby(['SequenceNumber', 'Side']).cumcount()
          )
   # mask the features where CountNumber > 1
   .assign(**{f: lambda d, f=f: d[f].mask(d['CountNumber'].gt(1)) for f in features})
   .drop(columns='CountNumber')
   # reshape with a pivot
   .pivot(index=['SequenceNumber', 'n'], columns='Side')
 )

out = (
 pd.concat([out['ID'], out.drop(columns='ID').groupby(level=0, axis=1).first()], axis=1)
   .reset_index('SequenceNumber')
)

Output:

   SequenceNumber  from  to  featureA  featureB
n                                              
0               0     0   1      12.0      45.0
1               0     0   2       1.0       4.0
2               0     0   3       3.0      36.0
0               1     0   2       5.0      11.0
1               1     1   2       7.0      12.0
atlernative using a merge like suggested by ifly6:
features = list(df.filter(like='feature'))

df1 = df.query('Side=="Sell"').copy()
df1[features] = df1[features].mask(df1['CountNumber'].gt(1))

df2 = df.query('Side=="Buy"').copy()
df2[features] = df2[features].mask(df2['CountNumber'].gt(1))

out = (df1.merge(df2, on='SequenceNumber').rename(columns={'ID_x': 'from', 'ID_y': 'to'})
          .set_index(['SequenceNumber', 'from', 'to'])
          .filter(like='feature')
          .pipe(lambda d: d.groupby(d.columns.str.replace('_.*?$', '', regex=True), axis=1).first())
          .reset_index()
      )

Output:

   SequenceNumber  from  to  featureA  featureB
0               0     0   1      12.0      45.0
1               0     0   2       1.0       4.0
2               0     0   3       3.0      36.0
3               1     0   2       5.0      11.0
4               1     1   2       7.0      12.0
Sign up to request clarification or add additional context in comments.

1 Comment

Please check the update if you're only interested in the features that match the CountNumber with value 1
1

Initial response. To get the answer half complete. Split the data into sellers and buyers. Then merge it against itself on the sequence number:

ndf = df.query('Side == "Sell"').merge(
    df.query('Side == "Buy"'), on='SequenceNumber', suffixes=['_sell', '_buy']) \
    .rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})

I then drop the side variable.

ndf = ndf.drop(columns=[i for i in ndf.columns if i.startswith('Side')])

This creates a very wide table:

   SequenceNumber  From  CountNumber_sell  featureA_sell  featureB_sell  To  CountNumber_buy  featureA_buy  featureB_buy
0               0     0                 3              4              2   1                1            12            45
1               0     0                 3              4              2   2                1             1             4
2               0     0                 3              4              2   3                1             3            36
3               1     0                 1              5             11   2                2             5            35
4               1     1                 1              7             12   2                2             5            35

This leaves you, however, with two featureA and featureB columns. I don't think your question clearly establishes which one takes precedence. Please provide more information on that.

Is it select the side with the lower CountNumber? Is it when CountNumber == 1? If the latter, then just null out the entries at the merge stage, do the merge, and then forward fill your appropriate columns to recover the proper values.


Re nulling. If you null the portions in featureA and featureB where the CountNumber is not 1, you can then create new version of those columns after the merge by forward filling and selecting.

s = df.query('Side == "Sell"').copy()
s.loc[s['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
b = df.query('Side == "Buy"').copy()
b.loc[b['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan

ndf = s.merge(
    b, on='SequenceNumber', suffixes=['_sell', '_buy']) \
    .rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
ndf['featureA'] = ndf[['featureA_buy', 'featureA_sell']] \
    .ffill(axis=1).iloc[:, -1]
ndf['featureB'] = ndf[['featureB_buy', 'featureB_sell']] \
    .ffill(axis=1).iloc[:, -1]

ndf = ndf.drop(
    columns=[i for i in ndf.columns if i.startswith('Side') 
             or i.endswith('_sell') or i.endswith('_buy')])

The final version of ndf then is:

   SequenceNumber  From  To  featureA  featureB
0               0     0   1      12.0      45.0
1               0     0   2       1.0       4.0
2               0     0   3       3.0      36.0
3               1     0   2       5.0      11.0
4               1     1   2       7.0      12.0

3 Comments

This is a great answer, regarding the featureA and B, I want to keep that which has a CountNumber == 1. So for the first 3 rows that would be featureA/B_buy and the last 2 rows that would be featureA/B_sell
can you give some details on what you mean by "just null out the entries at the merge stage, do the merge, and then forward fill your appropriate columns"?
@devCharaf edited to include code re forward filling
1

Here is an alternative approach

df1 = df.loc[df['CountNumber'] == 1].copy()
df1['From'] = (df1['ID'].where(df1['Side'] == 'Sell', df1['SequenceNumber']
                .map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
                )
df1['To'] = (df1['ID'].where(df1['Side'] == 'Buy', df1['SequenceNumber']
                .map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
                )

df1 = df1.drop(['ID', 'CountNumber', 'Side'], axis=1)
df1 = df1[['SequenceNumber', 'From', 'To', 'featureA', 'featureB']]
df1.reset_index(drop=True, inplace=True)
print(df1)

   SequenceNumber  From  To  featureA  featureB
0               0     0   1        12        45
1               0     0   2         1         4
2               0     0   3         3        36
3               1     0   2         5        11
4               1     1   2         7        12

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.