I have a list of daily transactional data in the following format:
person, itemCode, transDate, amount
I would like to sum the amount column by person and itemCode and transform my results to a numpy array. I'm dropping the date field. I have 1.5gb of data, so the more efficiently I can do this the better...
Here's a small example of how I would like the algorithm to go:
print input
A, 1, 2013-10-10, .5
A, 1, 2013-10-18, .75
A, 2, 2013-10-20, 2.5
B, 1, 2013-10-09, .25
B, 2, 2014-10-20, .8
myArray = transform(input)
print myArray
[[1.25,2.5],[.25,.8]]
Any thoughts on how to efficiently sum these records would be greatly appreciated!
EDIT: Here's my code so far:
from collections import defaultdict
dictObj = {}
rawData = 'subset.txt'
with open(rawData) as infile:
for line in infile:
parts = line.split(',')
key = (parts[0],parts[1])
val = float(parts[3])
if key in dictObj:
dictObj[key] += val
else:
dictObj[key] = val
print dictObj
numpy? I find thatpandastends to be more convenient for this kind of groupby-sum operation.