I have a pretty big numpy.ndarray. Its basically an array of arrays. I want to convert it to a pandas.DataFrame. What I want to do is in the code below
from pandas import DataFrame
cache1 = DataFrame([{'id1': 'ABC1234'}, {'id1': 'NCMN7838'}])
cache2 = DataFrame([{'id2': 3276827}, {'id2': 98567498}, {'id2': 38472837}])
ndarr = [[4.3, 5.6, 6.7], [3.2, 4.5, 2.1]]
arr = []
for idx, i in enumerate(ndarr):
id1 = cache1.ix[idx].id1
for idx2, val in enumerate(i):
id2 = cache2.ix[idx2].id2
if val > 0:
arr.append(dict(id1=id1, id2=id2, value=val))
df = DataFrame(arr)
print(df.head())
I am mapping the index of the outer array and the inner array to index of two DataFrames to get certain IDs.
cache1 and cache2 are pandas.DataFrame. Each has ~100k rows.
This takes really really long, like a few hours to complete. Is there some way I can speed it up?
cache1['A']was an internal thing (basically a key to the cache), so maybe was confusing. I corrected it now.cache2, shouldn't it be{'id2': 38472837}instead of{'id': 38472837}?multiindexwill be an suitable approach, let's see what the OP says.