Fastest way to split pandas dataframe rows into multiple rows

Ask Question

Asked 3 years, 8 months ago

Modified 3 years, 8 months ago

Viewed 159 times

This is a follow up to a previous question: Split pandas dataframe rows into multiple rows

Assuming the following dataframe: from itertools import combinations

df = pd.DataFrame(
    {
        "IDs": [
            ["A", "B"],
            ["A", "B", "C"],
            ["A", "B", "C", "D"],
        ],
        "pos_x": [[1, 2], [1.3, 2.8, 3], [10, 20, 100, 1000]],
        "pos_y": [[3, 4], [1, 5, 3], [2, 0, 0, 4]],
    },
    index=[
        pd.to_datetime("2022-01-01 12:00:00"),
        pd.to_datetime("2022-01-01 12:00:01"),
        pd.to_datetime("2022-01-01 12:00:02"),
    ],
)

                    IDs             pos_x               pos_y
2022-01-01 12:00:00 [A, B]          [1, 2]              [3, 4]
2022-01-01 12:00:01 [A, B, C]       [1.3, 2.8, 3]       [1, 5, 3]
2022-01-01 12:00:02 [A, B, C, D]    [10, 20, 100, 1000] [2, 0, 0, 4]

So now I want to obtain the following DataFrame:

from itertools import combinations
desired_df = pd.DataFrame()
for col in df.columns:
    df[col] = [[pair for pair in combinations(l, 2)] for l in df[col]]
df = df.explode(list(df.columns))

for col in df.columns:
    desired_df[[col+'_1',col+'_2']] = pd.DataFrame(df[col].tolist(), index=df.index)

                   IDs_1 IDs_2 pos_x_1 pos_x_2  pos_y_1 pos_y_2
2022-01-01 12:00:00 A    B     1.0     2.0      3       4
2022-01-01 12:00:01 A    B     1.3     2.8      1       5
2022-01-01 12:00:01 A    C     1.3     3.0      1       3
2022-01-01 12:00:01 B    C     2.8     3.0      5       3
2022-01-01 12:00:02 A    B     10.0    20.0     2       0
2022-01-01 12:00:02 A    C     10.0    100.0    2       0
2022-01-01 12:00:02 A    D     10.0    1000.0   2       4
2022-01-01 12:00:02 B    C     20.0    100.0    0       0
2022-01-01 12:00:02 B    D     20.0    1000.0   0       4
2022-01-01 12:00:02 C    D     100.0   1000.0   0       4

Since I would like to use it on fairly big amount of data, I'd like to know if there is a faster way to obtain the same results. Note: The 'IDs' column contains at least 2 elements and in practice I have more than just 3 columns.

Here is a benchmark I have used on a google colab:

from itertools import combinations
import pandas as pd
import random
import time

random.seed(42)
n_rows = 2000000
list_ids = ["A", "B", "C", "D", "E",  "F"]
n_ids = [random.randint(2, len(list_ids)) for k in range(n_rows)]
ids = [random.sample(list_ids, n) for n in n_ids]
pos_x =[[random.randint(0, 10)]*n for n in n_ids]
pos_y =[[random.randint(0, 10)]*n for n in n_ids]
df = pd.DataFrame(
    {
        "IDs": ids,
        "pos_x": pos_x,
        "pos_y": pos_y,
    },
)

start = time.time()
for col in df.columns:
    df[col] = [[pair for pair in combinations(l, 2)] for l in df[col]]
df = df.explode(list(df.columns))
print('Intermediary time:', time.time()-start)

desired_df = pd.DataFrame()
for col in df.columns:
    desired_df[[col+'_1', col+'_2']] = pd.DataFrame(df[col].tolist(), index=df.index)
print('Final time:', time.time()-start)

Intermediary time: 36.28665089607239
Final time: 55.56900978088379

edited Mar 21, 2022 at 13:24

asked Mar 21, 2022 at 13:08

bfgt

3171 silver badge8 bronze badges

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Fastest way to split pandas dataframe rows into multiple rows

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked