Convert 2D numpy.ndarray to pandas.DataFrame

Question

I have a pretty big numpy.ndarray. Its basically an array of arrays. I want to convert it to a pandas.DataFrame. What I want to do is in the code below

from pandas import DataFrame

cache1 = DataFrame([{'id1': 'ABC1234'}, {'id1': 'NCMN7838'}])
cache2 = DataFrame([{'id2': 3276827}, {'id2': 98567498}, {'id2': 38472837}])

ndarr = [[4.3, 5.6, 6.7], [3.2, 4.5, 2.1]]
arr = []
for idx, i in enumerate(ndarr):
    id1 = cache1.ix[idx].id1
    for idx2, val in enumerate(i):
        id2 = cache2.ix[idx2].id2
        if val > 0:
            arr.append(dict(id1=id1, id2=id2, value=val))
df = DataFrame(arr)
print(df.head())

I am mapping the index of the outer array and the inner array to index of two DataFrames to get certain IDs. cache1 and cache2 are pandas.DataFrame. Each has ~100k rows.

This takes really really long, like a few hours to complete. Is there some way I can speed it up?

I copied the code as is. cache1['A'] was an internal thing (basically a key to the cache), so maybe was confusing. I corrected it now. — y2p
– y2p, Commented Jun 20, 2014 at 22:53
The last entry in cache2, shouldn't it be {'id2': 38472837} instead of {'id': 38472837}? — CT Zhu
– CT Zhu, Commented Jun 20, 2014 at 23:27
@DSM, in that case maybe the multiindex will be an suitable approach, let's see what the OP says. — CT Zhu
– CT Zhu, Commented Jun 21, 2014 at 1:09

CT Zhu · Accepted Answer · 2014-06-21 00:22:19Z

2

I suspect your ndarr, if expressed as a 2d np.array, always has the shape of n,m, where n is the length of cache1.id1 and m is the length of cache2.id2. And the last entry in cache2, should be {'id2': 38472837} instead of {'id': 38472837}. If so, the following simple solution may be all what is needed:

In [30]:

df=pd.DataFrame(np.array(ndarr).ravel(),
             index=pd.MultiIndex.from_product([cache1.id1.values, cache2.id2.values],names=['idx1', 'idx2']),
             columns=['val'])
In [33]:

print df.reset_index()
       idx1      idx2  val
0   ABC1234   3276827  4.3
1   ABC1234  98567498  5.6
2   ABC1234  38472837  6.7
3  NCMN7838   3276827  3.2
4  NCMN7838  98567498  4.5
5  NCMN7838  38472837  2.1

[6 rows x 3 columns]

Actually, I also think, that keep it having the MultiIndex may be a better idea.

answered Jun 21, 2014 at 0:22

CT Zhu

54.6k18 gold badges125 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

DSM · Accepted Answer · 2014-06-20 23:23:27Z

0

Something like this should work:

ndarr = np.asarray(ndarr) # if ndarr is actually an array, skip this
fast_df = pd.DataFrame({"value": ndarr.ravel()})
i1, i2 = [i.ravel() for i in np.indices(ndarr.shape)]
fast_df["id1"] = cache1["id1"].loc[i1].values
fast_df["id2"] = cache2["id2"].loc[i2].values

which gives

>>> fast_df
   value       id1       id2
0    4.3   ABC1234   3276827
1    5.6   ABC1234  98567498
2    6.7   ABC1234       NaN
3    3.2  NCMN7838   3276827
4    4.5  NCMN7838  98567498
5    2.1  NCMN7838       NaN

And then if you really want to drop the zero values, you can keep only the nonzero ones using fast_df = fast_df[fast_df['value'] != 0].

edited Jun 20, 2014 at 23:23

answered Jun 20, 2014 at 23:16

DSM

355k67 gold badges606 silver badges504 bronze badges

Collectives™ on Stack Overflow

Convert 2D numpy.ndarray to pandas.DataFrame

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related