1

I have a relatively small dataframe like:

    index  ColA  ColB  ColC
    0      A      B     C
    1      D      E     F

and so on.
I am trying to get a list of tuples back that looks like:

[((A,B),C), ((D,E),F)...]

Any assistance anyone can offer?

1
  • Hi all, benchmarking of even larger size of 1000000x size posted. You can take a look. The iteritems approach is catching up with the zip() approach for larger size. Commented Aug 20, 2021 at 22:15

4 Answers 4

2

One solution is to use zip() + list-comprehension:

print([((a, b), c) for a, b, c in zip(df.ColA, df.ColB, df.ColC)])

Prints:

[(('A', 'B'), 'C'), (('D', 'E'), 'F')]
Sign up to request clarification or add additional context in comments.

1 Comment

Seems that this solution runs faster with both original size and 10000x larger size. +1 (original size: 22.1 µs vs 299 µs ; 10000x size: 8.89 ms vs 20.8ms)
1

Use itertuples and list comprehension:

[((t.ColA, t.ColB), t.ColC) for t in df.itertuples()]
# [(('A', 'B'), 'C'), (('D', 'E'), 'F')]

Comments

1

Benchmarking result:

1) Original size:

%%timeit
[((t.ColA, t.ColB), t.ColC) for t in df.itertuples()]

299 µs ± 8.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
[((a, b), c) for a, b, c in zip(df.ColA, df.ColB, df.ColC)]

22.1 µs ± 612 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
[((t.ColA, t.ColB), t.ColC) for _, t in df.iterrows()]

145 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
[*df.set_index(['ColA','ColB'])['ColC'].iteritems()]

1.29 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2. 10000x larger size:

df2 = pd.concat([df] * 10000, ignore_index=True)
%%timeit
[((t.ColA, t.ColB), t.ColC) for t in df2.itertuples()]

19.5 ms ± 668 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
[((a, b), c) for a, b, c in zip(df2.ColA, df2.ColB, df2.ColC)]

8.39 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
[((t.ColA, t.ColB), t.ColC) for _, t in df2.iterrows()]

1.32 s ± 26.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
[*df2.set_index(['ColA','ColB'])['ColC'].iteritems()]

10.4 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

3. 1000000x larger size:

df3 = pd.concat([df] * 1000000, ignore_index=True)
%%timeit
[((t.ColA, t.ColB), t.ColC) for t in df3.itertuples()]

2.05 s ± 51.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
[((a, b), c) for a, b, c in zip(df3.ColA, df3.ColB, df3.ColC)]

961 ms ± 9.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
[((t.ColA, t.ColB), t.ColC) for _, t in df3.iterrows()]

2min 4s ± 1.63 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
[*df3.set_index(['ColA','ColB'])['ColC'].iteritems()]

1.13 s ± 55.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3 Comments

@SeaBean Seems the .iteritems() is comparable to zip() for large dataframes. The .set_index() consumes huge portion of time for small dfs.
@AndrejKesely Yes, it seems so. That's interesting I'm going to test for 1000000x, i.e. 100x times more than the large dataset. Stay tuned.
@AndrejKesely Still the zip() approach fastest but the difference with .iteritems() is narrowing.
0

Let us do iteritems after set_index

[*df.set_index(['ColA','ColB'])['ColC'].iteritems()]
Out[604]: [(('A', 'B'), 'C'), (('D', 'E'), 'F')]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.