2

I have a pretty big numpy.ndarray. Its basically an array of arrays. I want to convert it to a pandas.DataFrame. What I want to do is in the code below

from pandas import DataFrame

cache1 = DataFrame([{'id1': 'ABC1234'}, {'id1': 'NCMN7838'}])
cache2 = DataFrame([{'id2': 3276827}, {'id2': 98567498}, {'id2': 38472837}])

ndarr = [[4.3, 5.6, 6.7], [3.2, 4.5, 2.1]]
arr = []
for idx, i in enumerate(ndarr):
    id1 = cache1.ix[idx].id1
    for idx2, val in enumerate(i):
        id2 = cache2.ix[idx2].id2
        if val > 0:
            arr.append(dict(id1=id1, id2=id2, value=val))
df = DataFrame(arr)
print(df.head())

I am mapping the index of the outer array and the inner array to index of two DataFrames to get certain IDs. cache1 and cache2 are pandas.DataFrame. Each has ~100k rows.

This takes really really long, like a few hours to complete. Is there some way I can speed it up?

4
  • I copied the code as is. cache1['A'] was an internal thing (basically a key to the cache), so maybe was confusing. I corrected it now. Commented Jun 20, 2014 at 22:53
  • The last entry in cache2, shouldn't it be {'id2': 38472837} instead of {'id': 38472837}? Commented Jun 20, 2014 at 23:27
  • @CTZhu: you're almost certainly right. Commented Jun 20, 2014 at 23:28
  • @DSM, in that case maybe the multiindex will be an suitable approach, let's see what the OP says. Commented Jun 21, 2014 at 1:09

2 Answers 2

2

I suspect your ndarr, if expressed as a 2d np.array, always has the shape of n,m, where n is the length of cache1.id1 and m is the length of cache2.id2. And the last entry in cache2, should be {'id2': 38472837} instead of {'id': 38472837}. If so, the following simple solution may be all what is needed:

In [30]:

df=pd.DataFrame(np.array(ndarr).ravel(),
             index=pd.MultiIndex.from_product([cache1.id1.values, cache2.id2.values],names=['idx1', 'idx2']),
             columns=['val'])
In [33]:

print df.reset_index()
       idx1      idx2  val
0   ABC1234   3276827  4.3
1   ABC1234  98567498  5.6
2   ABC1234  38472837  6.7
3  NCMN7838   3276827  3.2
4  NCMN7838  98567498  4.5
5  NCMN7838  38472837  2.1

[6 rows x 3 columns]

Actually, I also think, that keep it having the MultiIndex may be a better idea.

Sign up to request clarification or add additional context in comments.

Comments

0

Something like this should work:

ndarr = np.asarray(ndarr) # if ndarr is actually an array, skip this
fast_df = pd.DataFrame({"value": ndarr.ravel()})
i1, i2 = [i.ravel() for i in np.indices(ndarr.shape)]
fast_df["id1"] = cache1["id1"].loc[i1].values
fast_df["id2"] = cache2["id2"].loc[i2].values

which gives

>>> fast_df
   value       id1       id2
0    4.3   ABC1234   3276827
1    5.6   ABC1234  98567498
2    6.7   ABC1234       NaN
3    3.2  NCMN7838   3276827
4    4.5  NCMN7838  98567498
5    2.1  NCMN7838       NaN

And then if you really want to drop the zero values, you can keep only the nonzero ones using fast_df = fast_df[fast_df['value'] != 0].

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.