1

For example, I have a dataframe where two of the columns are "Zeroes" and "Ones" that contain only zeroes and ones, respectively. If I combine them into one column I get first all the zeroes, then all the ones.

I want to combine them in a way that I get each element from both columns, not all elements from the first column and all elements from the second column. So I don't want the result to be [0, 0, 0, 1, 1, 1], I need it to be [0, 1, 0, 1, 0, 1].

I process 100K+ rows of data. What is the fastest or optimal way to achieve this? Thanks in advance!

2
  • Can you provide some code showing what you have already tried? Commented Nov 3, 2021 at 11:19
  • Well, it isn't hard to do it iteratively, loop through the columns, append the element from the first column, then append element from the second column but I guess there is a faster, more "pandas" way to do it Commented Nov 3, 2021 at 11:25

4 Answers 4

4

Try:

import pandas as pd

df = pd.DataFrame({ "zeroes" : [0, 0, 0], "ones":  [1, 1, 1], "some_other" : list("abc")})
res = df[["zeroes", "ones"]].to_numpy().ravel(order="C")
print(res)

Output

[0 1 0 1 0 1]

Micro-Benchmarks

import pandas as pd
from itertools import chain
df = pd.DataFrame({ "zeroes" : [0] * 10_000, "ones":  [1] * 10_000})
%timeit df[["zeroes", "ones"]].to_numpy().ravel(order="C").tolist()
672 µs ± 8.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [v for vs in zip(df["zeroes"], df["ones"]) for v in vs]
2.57 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(chain.from_iterable(zip(df["zeroes"], df["ones"]))) 
2.11 ms ± 73 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sign up to request clarification or add additional context in comments.

1 Comment

good solution ;)
1

You can use numpy.flatten() like below as alternative:

import numpy as np
import pandas as pd
df[["zeroes", "ones"]].to_numpy().flatten()

Benchmark (runnig on colab):

df = pd.DataFrame({ "zeroes" : [0] * 10_000_000, "ones":  [1] * 10_000_000})

%timeit df[["zeroes", "ones"]].to_numpy().flatten().tolist()
1 loop, best of 5: 320 ms per loop

%timeit df[["zeroes", "ones"]].to_numpy().ravel(order="C").tolist()
1 loop, best of 5: 322 ms per loop

Comments

0

I don't know if this is the most optimal solution but it should solve your case.

df = pd.DataFrame([[0 for x in range(10)], [1 for x in range(10)]]).T
l = [[x, y] for x, y in zip(df[0], df[1])]
l = [x for y in l for x in y]
l

Comments

0

This may help you: Alternate elements of different columns using Pandas

pd.concat(
    [df1, df2], axis=1
).stack().reset_index(1, drop=True).to_frame('C').rename(index='CC{}'.format)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.