2

I have a numpy array which I want to print with python ggplot's tile. For that I need to have a DataFrame with the columns x, y, value. How can I transform the numpy array efficiently into such a DataFrame. Please consider, that the form of the data I want is in a sparse style, but I want a regular DataFrame. I tried using scipy sparse data structures like in Convert sparse matrix (csc_matrix) to pandas dataframe, but conversions were too slow and memory hungry: My memory was used up.

To clarify what I want:

I start out with a numpy array like

array([[ 1,  3,  7],
       [ 4,  9,  8]])

and I would like to end up with the DataFrame

     x    y    value
0    0    0    1
1    0    1    3
2    0    2    7
3    1    0    4
4    1    1    9
5    1    2    8
0

2 Answers 2

2
arr = np.array([[1, 3, 7],
                [4, 9, 8]])

df = pd.DataFrame(np.hstack((np.indices(arr.shape).reshape(2, arr.size).T,\
                    arr.reshape(-1, 1))), columns=['x', 'y', 'value'])
print(df)

   x  y  value
0  0  0      1
1  0  1      3
2  0  2      7
3  1  0      4
4  1  1      9
5  1  2      8

You might also consider using the function employed in this answer, as a speedup to np.indices in the solution above:

def indices_merged_arr(arr):
    m,n = arr.shape
    I,J = np.ogrid[:m,:n]
    out = np.empty((m,n,3), dtype=arr.dtype)
    out[...,0] = I
    out[...,1] = J
    out[...,2] = arr
    out.shape = (-1,3)
    return out

array = np.array([[ 1,  3,  7],
                  [ 4,  9,  8]])

df = pd.DataFrame(indices_merged_arr(array), columns=['x', 'y', 'value'])
print(df)

   x  y  value
0  0  0      1
1  0  1      3
2  0  2      7
3  1  0      4
4  1  1      9
5  1  2      8

Performance

arr = np.random.randn(1000, 1000)

%timeit df = pd.DataFrame(np.hstack((np.indices(arr.shape).reshape(2, arr.size).T,\
                         arr.reshape(-1, 1))), columns=['x', 'y', 'value'])
100 loops, best of 3: 15.3 ms per loop

%timeit pd.DataFrame(indices_merged_arr(array), columns=['x', 'y', 'value'])
1000 loops, best of 3: 229 µs per loop
Sign up to request clarification or add additional context in comments.

7 Comments

I tried to clarify what I want in the question. I am not sure how your answer is helping.
Your first answer works, your second answer throws ValueError: Shape of passed values is (3, 6), indices imply (3, 3).
@Make42 Copy paste error. Had to include columns=. It works.
@Make42 Interestingly, speed wise they're almost the same. It's a matter of preference as to what you'd want to use.
But aren't they doing exactly the same? So what has this to do with preferences?
|
1

You can try this solution by using np.ndenumerate:

arr = np.array([[1, 3, 7],
                [4, 9, 8]])

df = pd.DataFrame(np.ndenumerate(arr), columns=["coord","val"])

df[["x","y"]]  = df["coord"].tolist()

df.drop('coord', 1, inplace=True)

df = df[["x","y","val"]]

output

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.