2

I'm trying to create a dataframe populated by repeating rows based on an existing steady sequence.
For example, if I had a sequence increasing in 3s from 6 to 18, the sequence could be generated using np.arange(6, 18, 3) to give array([ 6, 9, 12, 15]).

How would I go about generating a dataframe in this way?

How could I get the below if I wanted 6 repeated rows?

     0   1   2    3
0   6.0 9.0 12.0 15.0   
1   6.0 9.0 12.0 15.0   
2   6.0 9.0 12.0 15.0   
3   6.0 9.0 12.0 15.0   
4   6.0 9.0 12.0 15.0
5   6.0 9.0 12.0 15.0
6   6.0 9.0 12.0 15.0

The reason for creating this matrix is that I then wish to add a pd.sequence row-wise to this matrix

1
  • it will depend on whether the sequenced values are a native numpy or immutable type verses a mutable type like a generic object reference. Commented Sep 17, 2022 at 17:42

2 Answers 2

2
pd.DataFrame([np.arange(6, 18, 3)]*7)

alternately,


pd.DataFrame(np.repeat([np.arange(6, 18, 3)],7, axis=0))
    0   1   2   3
0   6   9   12  15
1   6   9   12  15
2   6   9   12  15
3   6   9   12  15
4   6   9   12  15
5   6   9   12  15
6   6   9   12  15
Sign up to request clarification or add additional context in comments.

3 Comments

Very inefficient if the actual data is large.
@JohnZwinck what would be a more efficient solution?
I've posted a more efficient solution. np.repeat() is not terrible but it means the full data set is allocated twice, once by np.repeat() and once by Pandas. My answer shows how to do it without the extra full-size allocation.
1

Here is a solution using NumPy broadcasting which avoids Python loops, lists, and excessive memory allocation (as done by np.repeat):

pd.DataFrame(np.broadcast_to(np.arange(6, 18, 3), (6, 4))) 

To understand why this is more efficient than other solutions, refer to the np.broadcast_to() docs: https://numpy.org/doc/stable/reference/generated/numpy.broadcast_to.html

more than one element of a broadcasted array may refer to a single memory location.

This means that no matter how many rows you create before passing to Pandas, you're only really allocating a single row, then a 2D array which refers to the data of that row multiple times.

If you assign the above to df, you can say df.values.base is a single row--this is the only storage required no matter how many rows appear in the DataFrame.

6 Comments

In this example, np.repeat is 43.6 µs ± 774 ns per loop and np.broadcast is 42.9 µs ± 927 ns per loop. Is that all there's to it, or am I missing something?
In this example the data is trivially small. If you scale it up you can see a big difference. I saw 10x to 100x speedups in tests with a hundred thousand rows.
Interesting solution - but with a caveat. The numpy array and the dataframe will be read only so assignment within the array or inplace operations won't work. Not that that's a bad thing.
@JohnZwinck pd.DataFrame(np.broadcast_to(np.arange(6, 18, 3), (6, 4)))[0][0] = 100 gives ValueError: assignment destination is read-only.
@JohnZwinck - and I should have said "some inplace operations" because it is a dark art, and what really happens seems to change over time.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.