2

Imagine I have pandas dataframe:

Column1 Column2

A            D

B            E

C            F

How to get resulting Dataframe in this form?

Column

 A
 D
 B
 E
 C
 F
5
  • Are there empty rows in your starting dataframe? Commented Nov 17, 2020 at 13:55
  • no, it is all filled. Commented Nov 17, 2020 at 13:56
  • 3
    df.stack().reset_index(drop=True) Commented Nov 17, 2020 at 13:58
  • 2
    have you tried df.values.flatten() and then reshaping it? it returns a numpy array but you can turn that back into a dataframe if you want. Relevant answer here: stackoverflow.com/questions/25440008/… Commented Nov 17, 2020 at 13:59
  • Perfect @MichaelSzczesny, it is working Commented Nov 17, 2020 at 14:07

2 Answers 2

5

EDIT: see the benchmark below for a slightly faster solution.

You can do this:

# Import pandas library 
import pandas as pd

# The data
data = [["A", "D"], ["B", "E"], ["C", "F"]]

# Create DataFrame
df = pd.DataFrame(data, columns = ["Column1", "Column2"]) 

# Flatten and convert to DataFrame
new_df = pd.DataFrame(df.to_numpy().flatten())

print(df)

Output:

A
D
B
E
C
F

new_df will be a pandas.DataFrame.

Note the use of df.to_numpy() too.

And as suggested by @Michael Szczesny you can do:

new_series = df.stack().reset_index(drop=True)

Which wil return a pandas.Series.

Addded Benchmark:

Based on @Mayank Porwal's answer I add this benchmark results. I used timeit.repeat with repeat = 7, number = 10000. Sorted from fastest to slowest:

new_df = pd.DataFrame(df.to_numpy().ravel('A')) # 51.0 µs
new_df = pd.DataFrame(df.to_numpy().ravel('K')) # 51.0 µs
new_df = pd.DataFrame(df.to_numpy().ravel('F')) # 51.1 µs
new_df = pd.DataFrame(df.to_numpy().flatten())  # 52.6 µs
new_df = pd.DataFrame(df.to_numpy().ravel('C')) # 53.4 µs
new_series = df.stack().reset_index(drop=True)  # 322.0 µs

Using numpy.ravel is fastest mainly because it returns a view whereas numpy..to_numpy() returns a copy. For details about numpy.ravel see: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ravel.html

In short, "A" will force to read the elements in Fortran-like index order if the array is Fortran contiguous in memory and with "K" it will read the elements in the order they occur in memory.

Sign up to request clarification or add additional context in comments.

Comments

3

Use df.to_numpy with numpy.ravel:

In [2349]: x = pd.DataFrame(df.to_numpy().ravel('F'))

In [2350]: x
Out[2350]: 
     0
0    A
1    B
2    C
3    D
4    E
5    F
dtype: object

Note: This will be quite performant.

Timing comparisons:

In [2369]: dd = pd.concat([df] * 1000)

# Rivers' answers:

In [2369]: %timeit pd.DataFrame(dd.to_numpy().flatten())
95.6 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [2371]: %timeit dd.stack().reset_index(drop=True)
919 µs ± 9.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# My answer:

In [2372]: %timeit pd.DataFrame(dd.to_numpy().ravel('F'))
62 µs ± 577 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

2 Comments

@Augustas please check my answer. It has the best performance.
I didn't thought speed performance could be important for this task, good idea, thanks, I'll edit my answer.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.