Efficient way to load file into 2d numpy array

Question

I'm trying to take a list of transactional data and sum it to a 2d numpy array. My data looks like the following:

person, product, date, val
A, x, 1/1/2013, 10
A, x, 1/10/2013, 10
B, x, 1/2/2013, 20
B, y, 1/4/2013, 15
A, y, 1/8/2013, 20
C, z, 2/12/2013, 40

I need to get the output into a 2d array, with each person as a row, and the product as columns. The date will be dropped, and the values are summed.

The output will look like this:

[[20, 20, 0],[20, 15, 0],[0, 0, 40]]

Here's what I have that functions, but it is really slow (I've got 110,000,000 records):

import numpy as np
from collections import defaultdict
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import pandas as pd
from scipy import sparse
import os
import assoc


#read in data to a dict object - sums scripts by tuple (doc, drug)
dictObj = {}
rawData = 'subset.txt'
with open(rawData) as infile:
for line in infile:
    parts = line.split(',')
    key = (parts[0],parts[1])
    val = float(parts[3])
    if key in dictObj:
        dictObj[key] += val
    else:
        dictObj[key] = val
infile.close()

print "stage 1 done"
#get the number of doctors and the number of drugs
keys =  dictObj.keys()
docs = list(set([x[0] for x in keys]))
drugs = sorted(list(set([x[1] for x in keys])))

#read through the dict and build out a 2d numpy array 
docC = 0
mat = np.empty([len(docs),len(drugs)])
for doc in docs:
drugC = 0
for drug in drugs:
    key = (doc,drug)
    if key in dictObj:
        mat[(docC,drugC)] = dictObj[(key)]
            else:
        mat[(docC,drugC)] = 0
    drugC += 1
docC+=1

I had posted a similar thread earlier (here - Transformation of transactions to numpy array), and everyone responded that Pandas was the way to go, but I can't for the life of me get the Pandas output into the right format. I can't pass a Pandas dataFrame to the kmeans or apriori algorithms I have, and no matter how I arrange the dataFrame, the df.values gets me to a multiIndex series (which simplifies down to 1 long array!). Any pointers would be greatly appreciated!

DSM · Accepted Answer · 2013-11-25 05:48:30Z

4

I might do something like

>>> df = pd.read_csv("trans.csv", skipinitialspace=True)
>>> w = df.groupby(["person", "product"])["val"].sum().reset_index()
>>> w
  person product  val
0      A       x   20
1      A       y   20
2      B       x   20
3      B       y   15
4      C       z   40
>>> w.pivot("person", "product").fillna(0)
         val        
product    x   y   z
person              
A         20  20   0
B         20  15   0
C          0   0  40
>>> w.pivot("person", "product").fillna(0).values
array([[ 20.,  20.,   0.],
       [ 20.,  15.,   0.],
       [  0.,   0.,  40.]])

which IIUC is the 2-D array you're after. Note that you don't have to read the entire file into memory at once, you can use the chunksize parameter (see the docs here) and accumulate your table piece by piece.

answered Nov 25, 2013 at 5:48

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

flyingmeatball Over a year ago

Thanks! That's exactly what I needed. I couldn't for the life of me get that pivot to work before.

hpaulj · Accepted Answer · 2013-11-25 06:08:21Z

recfromcsv (or recfromtxt) will load your data into a record array

data=np.recfromcsv('stack20179393.txt')

rec.array([('A', ' x', ' 1/1/2013', 10), ('A', ' x', ' 1/10/2013', 10),
       ('B', ' x', ' 1/2/2013', 20), ('B', ' y', ' 1/4/2013', 15),
       ('A', ' y', ' 1/8/2013', 20), ('C', ' z', ' 2/12/2013', 40)], 
      dtype=[('person', 'S1'), ('product', 'S2'), ('date', 'S10'), ('val', '<i4')])

data.person
# chararray((['A', 'A', 'B', 'B', 'A', 'C'], dtype='|S1')

data.val
# array([10, 10, 20, 15, 20, 40])

Since person can occur in any order, and with different frequency (3A, 2B, 1C), you can't readily turn this into a 2D array. So you may still need to iterate through the records, collecting values in something like a dictionary - I'd recommend a collections.defaultdict. itertools.groupby is also a handy tool for collecting values into groups. However, it would require sorting your records.

with a defaultdict

from collections import defaultdict
dd = defaultdict(list)
for row in data:
    dd[row[0]].append(row[-1])
print dd
# defaultdict(<type 'list'>, {'A': [10, 10, 20], 'C': [40], 'B': [20, 15]})
d = {}
for k,v in dd.items(): d[k] = sum(v)
print d
# {'A': 40, 'B': 35, 'C': 40}

or

dd = defaultdict(float)
for row in data:
    dd[row[0]].append(row[-1])
print dd
defaultdict(<type 'float'>, {'A': 40.0, 'C': 40.0, 'B': 35.0})

A sparse approach takes advantage of how csr_matrix sums repeated indexes

from scipy import sparse  
row=np.array([ord(a) for a in data.person])-65
col=np.zeros(row.shape)
sparse.csr_matrix((data.val,(row,col))).T.A
# array([[40, 35, 40]])

Ryan Saxe · Accepted Answer · 2013-11-25 04:29:08Z

0

Based on the end of your problem, it seems that you just need to get a pandas DataFrame to a numpy array. Here is how you do that:

#df is your DataFrame
data = np.asarray(df)

So now you shouldn't have a problem with using pandas!

answered Nov 25, 2013 at 4:29

Ryan Saxe

17.9k23 gold badges85 silver badges130 bronze badges

1 Comment

flyingmeatball Over a year ago

thanks for the response - this doesn't quite get me there though. In order to sum on the DataFrame, I've had to call the groupby function, which returns a series, not a dataframe. I tried both of the following: grouped = df2.groupby(['person','product']).sum() and df.groupby(['person']).apply(lambda x: pd.Series(x.groupby('product').sum()['amount'])). If I use the asarray function you suggested on that output it's still 1 dimensional.

Steve Barnes · Accepted Answer · 2013-11-25 05:12:08Z

0

Looking at your code and the size of you data then I should think it would be very slow 110,000,000 records, presumably consisting of a string, (doctor), a long string, (drug), date (dropped) and a value which is a float value. Lets say 20 chars for doctor, (possibly not enough), and 30 for drug, (probably not enough), 4 bytes for value that is 5.5 Gi before any overheads, then you are duplicating it into a 2D matrix.

Unless you are running on a mainframe or a cluster I would strongly suggest restructuring to either sum as you read or stage 1 being read into a database.

You could also take a look at the possibility of using pytables if Pandas is not working for you.

answered Nov 25, 2013 at 5:12

Steve Barnes

28.6k6 gold badges68 silver badges80 bronze badges

Collectives™ on Stack Overflow

Efficient way to load file into 2d numpy array

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related