1

This is a follow up to a previous question: Split pandas dataframe rows into multiple rows

Assuming the following dataframe: from itertools import combinations

df = pd.DataFrame(
    {
        "IDs": [
            ["A", "B"],
            ["A", "B", "C"],
            ["A", "B", "C", "D"],
        ],
        "pos_x": [[1, 2], [1.3, 2.8, 3], [10, 20, 100, 1000]],
        "pos_y": [[3, 4], [1, 5, 3], [2, 0, 0, 4]],
    },
    index=[
        pd.to_datetime("2022-01-01 12:00:00"),
        pd.to_datetime("2022-01-01 12:00:01"),
        pd.to_datetime("2022-01-01 12:00:02"),
    ],
)
                    IDs             pos_x               pos_y
2022-01-01 12:00:00 [A, B]          [1, 2]              [3, 4]
2022-01-01 12:00:01 [A, B, C]       [1.3, 2.8, 3]       [1, 5, 3]
2022-01-01 12:00:02 [A, B, C, D]    [10, 20, 100, 1000] [2, 0, 0, 4]

So now I want to obtain the following DataFrame:

from itertools import combinations
desired_df = pd.DataFrame()
for col in df.columns:
    df[col] = [[pair for pair in combinations(l, 2)] for l in df[col]]
df = df.explode(list(df.columns))

for col in df.columns:
    desired_df[[col+'_1',col+'_2']] = pd.DataFrame(df[col].tolist(), index=df.index)
                   IDs_1 IDs_2 pos_x_1 pos_x_2  pos_y_1 pos_y_2
2022-01-01 12:00:00 A    B     1.0     2.0      3       4
2022-01-01 12:00:01 A    B     1.3     2.8      1       5
2022-01-01 12:00:01 A    C     1.3     3.0      1       3
2022-01-01 12:00:01 B    C     2.8     3.0      5       3
2022-01-01 12:00:02 A    B     10.0    20.0     2       0
2022-01-01 12:00:02 A    C     10.0    100.0    2       0
2022-01-01 12:00:02 A    D     10.0    1000.0   2       4
2022-01-01 12:00:02 B    C     20.0    100.0    0       0
2022-01-01 12:00:02 B    D     20.0    1000.0   0       4
2022-01-01 12:00:02 C    D     100.0   1000.0   0       4

Since I would like to use it on fairly big amount of data, I'd like to know if there is a faster way to obtain the same results. Note: The 'IDs' column contains at least 2 elements and in practice I have more than just 3 columns.

Here is a benchmark I have used on a google colab:

from itertools import combinations
import pandas as pd
import random
import time

random.seed(42)
n_rows = 2000000
list_ids = ["A", "B", "C", "D", "E",  "F"]
n_ids = [random.randint(2, len(list_ids)) for k in range(n_rows)]
ids = [random.sample(list_ids, n) for n in n_ids]
pos_x =[[random.randint(0, 10)]*n for n in n_ids]
pos_y =[[random.randint(0, 10)]*n for n in n_ids]
df = pd.DataFrame(
    {
        "IDs": ids,
        "pos_x": pos_x,
        "pos_y": pos_y,
    },
)

start = time.time()
for col in df.columns:
    df[col] = [[pair for pair in combinations(l, 2)] for l in df[col]]
df = df.explode(list(df.columns))
print('Intermediary time:', time.time()-start)

desired_df = pd.DataFrame()
for col in df.columns:
    desired_df[[col+'_1', col+'_2']] = pd.DataFrame(df[col].tolist(), index=df.index)
print('Final time:', time.time()-start)
Intermediary time: 36.28665089607239
Final time: 55.56900978088379

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.