Convert numpy array with indices to a pandas dataframe

Question

I have a numpy array which I want to print with python ggplot's tile. For that I need to have a DataFrame with the columns x, y, value. How can I transform the numpy array efficiently into such a DataFrame. Please consider, that the form of the data I want is in a sparse style, but I want a regular DataFrame. I tried using scipy sparse data structures like in Convert sparse matrix (csc_matrix) to pandas dataframe, but conversions were too slow and memory hungry: My memory was used up.

To clarify what I want:

I start out with a numpy array like

array([[ 1,  3,  7],
       [ 4,  9,  8]])

and I would like to end up with the DataFrame

     x    y    value
0    0    0    1
1    0    1    3
2    0    2    7
3    1    0    4
4    1    1    9
5    1    2    8

cs95 · Accepted Answer · 2017-08-25 10:07:56Z

2

arr = np.array([[1, 3, 7],
                [4, 9, 8]])

df = pd.DataFrame(np.hstack((np.indices(arr.shape).reshape(2, arr.size).T,\
                    arr.reshape(-1, 1))), columns=['x', 'y', 'value'])
print(df)

   x  y  value
0  0  0      1
1  0  1      3
2  0  2      7
3  1  0      4
4  1  1      9
5  1  2      8

You might also consider using the function employed in this answer, as a speedup to np.indices in the solution above:

def indices_merged_arr(arr):
    m,n = arr.shape
    I,J = np.ogrid[:m,:n]
    out = np.empty((m,n,3), dtype=arr.dtype)
    out[...,0] = I
    out[...,1] = J
    out[...,2] = arr
    out.shape = (-1,3)
    return out

array = np.array([[ 1,  3,  7],
                  [ 4,  9,  8]])

df = pd.DataFrame(indices_merged_arr(array), columns=['x', 'y', 'value'])
print(df)

   x  y  value
0  0  0      1
1  0  1      3
2  0  2      7
3  1  0      4
4  1  1      9
5  1  2      8

Performance

arr = np.random.randn(1000, 1000)

%timeit df = pd.DataFrame(np.hstack((np.indices(arr.shape).reshape(2, arr.size).T,\
                         arr.reshape(-1, 1))), columns=['x', 'y', 'value'])
100 loops, best of 3: 15.3 ms per loop

%timeit pd.DataFrame(indices_merged_arr(array), columns=['x', 'y', 'value'])
1000 loops, best of 3: 229 µs per loop

edited Aug 25, 2017 at 10:07

answered Aug 24, 2017 at 8:04

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Make42 Over a year ago

I tried to clarify what I want in the question. I am not sure how your answer is helping.

Make42 Over a year ago

Your first answer works, your second answer throws ValueError: Shape of passed values is (3, 6), indices imply (3, 3).

cs95 Over a year ago

@Make42 Copy paste error. Had to include columns=. It works.

cs95 Over a year ago

@Make42 Interestingly, speed wise they're almost the same. It's a matter of preference as to what you'd want to use.

Make42 Over a year ago

But aren't they doing exactly the same? So what has this to do with preferences?

|

Hamzah Al-Qadasi · Accepted Answer · 2022-03-11 17:46:10Z

1

You can try this solution by using np.ndenumerate:

arr = np.array([[1, 3, 7],
                [4, 9, 8]])

df = pd.DataFrame(np.ndenumerate(arr), columns=["coord","val"])

df[["x","y"]]  = df["coord"].tolist()

df.drop('coord', 1, inplace=True)

df = df[["x","y","val"]]

output

answered Mar 11, 2022 at 17:46

Hamzah Al-Qadasi

10k3 gold badges29 silver badges54 bronze badges

Collectives™ on Stack Overflow

Convert numpy array with indices to a pandas dataframe

2 Answers 2

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related