Python Pandas - How to create a dataframe from a sequence

Question

I'm trying to create a dataframe populated by repeating rows based on an existing steady sequence.
For example, if I had a sequence increasing in 3s from 6 to 18, the sequence could be generated using np.arange(6, 18, 3) to give array([ 6, 9, 12, 15]).

How would I go about generating a dataframe in this way?

How could I get the below if I wanted 6 repeated rows?

     0   1   2    3
0   6.0 9.0 12.0 15.0   
1   6.0 9.0 12.0 15.0   
2   6.0 9.0 12.0 15.0   
3   6.0 9.0 12.0 15.0   
4   6.0 9.0 12.0 15.0
5   6.0 9.0 12.0 15.0
6   6.0 9.0 12.0 15.0

The reason for creating this matrix is that I then wish to add a pd.sequence row-wise to this matrix

it will depend on whether the sequenced values are a native numpy or immutable type verses a mutable type like a generic object reference. — tdelaney
– tdelaney, Commented Sep 17, 2022 at 17:42

Naveed · Accepted Answer · 2022-09-17 17:25:30Z

2

pd.DataFrame([np.arange(6, 18, 3)]*7)

alternately,


pd.DataFrame(np.repeat([np.arange(6, 18, 3)],7, axis=0))

    0   1   2   3
0   6   9   12  15
1   6   9   12  15
2   6   9   12  15
3   6   9   12  15
4   6   9   12  15
5   6   9   12  15
6   6   9   12  15

edited Sep 17, 2022 at 17:25

answered Sep 17, 2022 at 17:18

Naveed

11.7k2 gold badges16 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

John Zwinck Over a year ago

Very inefficient if the actual data is large.

sander Over a year ago

@JohnZwinck what would be a more efficient solution?

John Zwinck Over a year ago

I've posted a more efficient solution. np.repeat() is not terrible but it means the full data set is allocated twice, once by np.repeat() and once by Pandas. My answer shows how to do it without the extra full-size allocation.

John Zwinck · Accepted Answer · 2022-09-18 06:47:27Z

1

Here is a solution using NumPy broadcasting which avoids Python loops, lists, and excessive memory allocation (as done by np.repeat):

pd.DataFrame(np.broadcast_to(np.arange(6, 18, 3), (6, 4)))

To understand why this is more efficient than other solutions, refer to the np.broadcast_to() docs: https://numpy.org/doc/stable/reference/generated/numpy.broadcast_to.html

more than one element of a broadcasted array may refer to a single memory location.

This means that no matter how many rows you create before passing to Pandas, you're only really allocating a single row, then a 2D array which refers to the data of that row multiple times.

If you assign the above to df, you can say df.values.base is a single row--this is the only storage required no matter how many rows appear in the DataFrame.

edited Sep 18, 2022 at 6:47

answered Sep 17, 2022 at 17:26

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

6 Comments

sander Over a year ago

In this example, np.repeat is 43.6 µs ± 774 ns per loop and np.broadcast is 42.9 µs ± 927 ns per loop. Is that all there's to it, or am I missing something?

John Zwinck Over a year ago

In this example the data is trivially small. If you scale it up you can see a big difference. I saw 10x to 100x speedups in tests with a hundred thousand rows.

tdelaney Over a year ago

Interesting solution - but with a caveat. The numpy array and the dataframe will be read only so assignment within the array or inplace operations won't work. Not that that's a bad thing.

tdelaney Over a year ago

@JohnZwinck pd.DataFrame(np.broadcast_to(np.arange(6, 18, 3), (6, 4)))[0][0] = 100 gives ValueError: assignment destination is read-only.

tdelaney Over a year ago

@JohnZwinck - and I should have said "some inplace operations" because it is a dark art, and what really happens seems to change over time.

|

Collectives™ on Stack Overflow

Python Pandas - How to create a dataframe from a sequence

2 Answers 2

3 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related