1

I have a list of daily transactional data in the following format:

person, itemCode, transDate, amount

I would like to sum the amount column by person and itemCode and transform my results to a numpy array. I'm dropping the date field. I have 1.5gb of data, so the more efficiently I can do this the better...

Here's a small example of how I would like the algorithm to go:

 print input
 A, 1, 2013-10-10, .5
 A, 1, 2013-10-18, .75
 A, 2, 2013-10-20, 2.5
 B, 1, 2013-10-09, .25
 B, 2, 2014-10-20, .8

 myArray = transform(input)
 print myArray
 [[1.25,2.5],[.25,.8]]

Any thoughts on how to efficiently sum these records would be greatly appreciated!

EDIT: Here's my code so far:

from collections import defaultdict

dictObj = {}

rawData = 'subset.txt'

with open(rawData) as infile:
for line in infile:
    parts = line.split(',')
    key = (parts[0],parts[1])
    val = float(parts[3])
    if key in dictObj:
        dictObj[key] += val
    else:
        dictObj[key] = val
 print dictObj
4
  • 2
    Are you wedded to numpy? I find that pandas tends to be more convenient for this kind of groupby-sum operation. Commented Nov 24, 2013 at 18:58
  • Please show what you have tried so far and how it was not efficient enough. Note that if you have an 1.5 GB txt file, it is not a very large amount of data, so even a suboptimal solution will work within a reasonable time. Commented Nov 24, 2013 at 18:59
  • Yes, go for Pandas or throw it into a database and use some good old fashioned SQL Commented Nov 24, 2013 at 19:15
  • I'm not wedded to numpy, it's just what I know :). I've added my code above to show where I am so far - I have a 2d dict that needs transformation into something else. my end goal is to run some sklearn algorithms, which I know is easy from a numpy array. Commented Nov 24, 2013 at 19:38

1 Answer 1

2

As @DSM said, this operations is looks like a job for pandas:

>>> from StringIO import StringIO
>>> import pandas as pd
>>> data = '''A, 1, 2013-10-10, .5
... A, 1, 2013-10-18, .75
... A, 2, 2013-10-20, 2.5
... B, 1, 2013-10-09, .25
... B, 2, 2014-10-20, .8'''
... 
>>> df = pd.read_csv(StringIO(data), names=['person','itemCode','transDate','amount'], skiprows=0)
>>> df
  person  itemCode    transDate  amount
0      A         1   2013-10-10    0.50
1      A         1   2013-10-18    0.75
2      A         2   2013-10-20    2.50
3      B         1   2013-10-09    0.25
4      B         2   2014-10-20    0.80
>>> grouped = df.groupby(['person'])
>>> res = df.groupby(['person']).apply(lambda x: pd.Series(x.groupby('itemCode').sum()['amount']))
>>> res
itemCode     1    2
person             
A         1.25  2.5
B         0.25  0.8

The result is pandas.DataFrame, but if you want to get it as numpy array, you can use values attribute:

>>> res.values
array([[ 1.25,  2.5 ],
       [ 0.25,  0.8 ]])
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Roman - That looks much easier than what I was going to try and do, I'll have to spend a little time on the syntax for Pandas, but based on all the comments that seems like the way to go!
@flyingmeatball yes, definitely take a look, data transformations become very fun to do :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.