I'm trying to take a list of transactional data and sum it to a 2d numpy array. My data looks like the following:
person, product, date, val
A, x, 1/1/2013, 10
A, x, 1/10/2013, 10
B, x, 1/2/2013, 20
B, y, 1/4/2013, 15
A, y, 1/8/2013, 20
C, z, 2/12/2013, 40
I need to get the output into a 2d array, with each person as a row, and the product as columns. The date will be dropped, and the values are summed.
The output will look like this:
[[20, 20, 0],[20, 15, 0],[0, 0, 40]]
Here's what I have that functions, but it is really slow (I've got 110,000,000 records):
import numpy as np
from collections import defaultdict
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import pandas as pd
from scipy import sparse
import os
import assoc
#read in data to a dict object - sums scripts by tuple (doc, drug)
dictObj = {}
rawData = 'subset.txt'
with open(rawData) as infile:
for line in infile:
parts = line.split(',')
key = (parts[0],parts[1])
val = float(parts[3])
if key in dictObj:
dictObj[key] += val
else:
dictObj[key] = val
infile.close()
print "stage 1 done"
#get the number of doctors and the number of drugs
keys = dictObj.keys()
docs = list(set([x[0] for x in keys]))
drugs = sorted(list(set([x[1] for x in keys])))
#read through the dict and build out a 2d numpy array
docC = 0
mat = np.empty([len(docs),len(drugs)])
for doc in docs:
drugC = 0
for drug in drugs:
key = (doc,drug)
if key in dictObj:
mat[(docC,drugC)] = dictObj[(key)]
else:
mat[(docC,drugC)] = 0
drugC += 1
docC+=1
I had posted a similar thread earlier (here - Transformation of transactions to numpy array), and everyone responded that Pandas was the way to go, but I can't for the life of me get the Pandas output into the right format. I can't pass a Pandas dataFrame to the kmeans or apriori algorithms I have, and no matter how I arrange the dataFrame, the df.values gets me to a multiIndex series (which simplifies down to 1 long array!). Any pointers would be greatly appreciated!