This is a follow up to a previous question: Split pandas dataframe rows into multiple rows
Assuming the following dataframe: from itertools import combinations
df = pd.DataFrame(
{
"IDs": [
["A", "B"],
["A", "B", "C"],
["A", "B", "C", "D"],
],
"pos_x": [[1, 2], [1.3, 2.8, 3], [10, 20, 100, 1000]],
"pos_y": [[3, 4], [1, 5, 3], [2, 0, 0, 4]],
},
index=[
pd.to_datetime("2022-01-01 12:00:00"),
pd.to_datetime("2022-01-01 12:00:01"),
pd.to_datetime("2022-01-01 12:00:02"),
],
)
IDs pos_x pos_y
2022-01-01 12:00:00 [A, B] [1, 2] [3, 4]
2022-01-01 12:00:01 [A, B, C] [1.3, 2.8, 3] [1, 5, 3]
2022-01-01 12:00:02 [A, B, C, D] [10, 20, 100, 1000] [2, 0, 0, 4]
So now I want to obtain the following DataFrame:
from itertools import combinations
desired_df = pd.DataFrame()
for col in df.columns:
df[col] = [[pair for pair in combinations(l, 2)] for l in df[col]]
df = df.explode(list(df.columns))
for col in df.columns:
desired_df[[col+'_1',col+'_2']] = pd.DataFrame(df[col].tolist(), index=df.index)
IDs_1 IDs_2 pos_x_1 pos_x_2 pos_y_1 pos_y_2
2022-01-01 12:00:00 A B 1.0 2.0 3 4
2022-01-01 12:00:01 A B 1.3 2.8 1 5
2022-01-01 12:00:01 A C 1.3 3.0 1 3
2022-01-01 12:00:01 B C 2.8 3.0 5 3
2022-01-01 12:00:02 A B 10.0 20.0 2 0
2022-01-01 12:00:02 A C 10.0 100.0 2 0
2022-01-01 12:00:02 A D 10.0 1000.0 2 4
2022-01-01 12:00:02 B C 20.0 100.0 0 0
2022-01-01 12:00:02 B D 20.0 1000.0 0 4
2022-01-01 12:00:02 C D 100.0 1000.0 0 4
Since I would like to use it on fairly big amount of data, I'd like to know if there is a faster way to obtain the same results. Note: The 'IDs' column contains at least 2 elements and in practice I have more than just 3 columns.
Here is a benchmark I have used on a google colab:
from itertools import combinations
import pandas as pd
import random
import time
random.seed(42)
n_rows = 2000000
list_ids = ["A", "B", "C", "D", "E", "F"]
n_ids = [random.randint(2, len(list_ids)) for k in range(n_rows)]
ids = [random.sample(list_ids, n) for n in n_ids]
pos_x =[[random.randint(0, 10)]*n for n in n_ids]
pos_y =[[random.randint(0, 10)]*n for n in n_ids]
df = pd.DataFrame(
{
"IDs": ids,
"pos_x": pos_x,
"pos_y": pos_y,
},
)
start = time.time()
for col in df.columns:
df[col] = [[pair for pair in combinations(l, 2)] for l in df[col]]
df = df.explode(list(df.columns))
print('Intermediary time:', time.time()-start)
desired_df = pd.DataFrame()
for col in df.columns:
desired_df[[col+'_1', col+'_2']] = pd.DataFrame(df[col].tolist(), index=df.index)
print('Final time:', time.time()-start)
Intermediary time: 36.28665089607239
Final time: 55.56900978088379