Transformation of transactions to numpy array

Question

I have a list of daily transactional data in the following format:

person, itemCode, transDate, amount

I would like to sum the amount column by person and itemCode and transform my results to a numpy array. I'm dropping the date field. I have 1.5gb of data, so the more efficiently I can do this the better...

Here's a small example of how I would like the algorithm to go:

 print input
 A, 1, 2013-10-10, .5
 A, 1, 2013-10-18, .75
 A, 2, 2013-10-20, 2.5
 B, 1, 2013-10-09, .25
 B, 2, 2014-10-20, .8

 myArray = transform(input)
 print myArray
 [[1.25,2.5],[.25,.8]]

Any thoughts on how to efficiently sum these records would be greatly appreciated!

EDIT: Here's my code so far:

from collections import defaultdict

dictObj = {}

rawData = 'subset.txt'

with open(rawData) as infile:
for line in infile:
    parts = line.split(',')
    key = (parts[0],parts[1])
    val = float(parts[3])
    if key in dictObj:
        dictObj[key] += val
    else:
        dictObj[key] = val
 print dictObj

Are you wedded to numpy? I find that pandas tends to be more convenient for this kind of groupby-sum operation. — DSM
– DSM, Commented Nov 24, 2013 at 18:58
Please show what you have tried so far and how it was not efficient enough. Note that if you have an 1.5 GB txt file, it is not a very large amount of data, so even a suboptimal solution will work within a reasonable time. — leeladam
– leeladam, Commented Nov 24, 2013 at 18:59
Yes, go for Pandas or throw it into a database and use some good old fashioned SQL — YXD
– YXD, Commented Nov 24, 2013 at 19:15
I'm not wedded to numpy, it's just what I know :). I've added my code above to show where I am so far - I have a 2d dict that needs transformation into something else. my end goal is to run some sklearn algorithms, which I know is easy from a numpy array. — flyingmeatball
– flyingmeatball, Commented Nov 24, 2013 at 19:38

roman · Accepted Answer · 2013-11-24 19:37:20Z

2

As @DSM said, this operations is looks like a job for pandas:

>>> from StringIO import StringIO
>>> import pandas as pd
>>> data = '''A, 1, 2013-10-10, .5
... A, 1, 2013-10-18, .75
... A, 2, 2013-10-20, 2.5
... B, 1, 2013-10-09, .25
... B, 2, 2014-10-20, .8'''
... 
>>> df = pd.read_csv(StringIO(data), names=['person','itemCode','transDate','amount'], skiprows=0)
>>> df
  person  itemCode    transDate  amount
0      A         1   2013-10-10    0.50
1      A         1   2013-10-18    0.75
2      A         2   2013-10-20    2.50
3      B         1   2013-10-09    0.25
4      B         2   2014-10-20    0.80
>>> grouped = df.groupby(['person'])
>>> res = df.groupby(['person']).apply(lambda x: pd.Series(x.groupby('itemCode').sum()['amount']))
>>> res
itemCode     1    2
person             
A         1.25  2.5
B         0.25  0.8

The result is pandas.DataFrame, but if you want to get it as numpy array, you can use values attribute:

>>> res.values
array([[ 1.25,  2.5 ],
       [ 0.25,  0.8 ]])

answered Nov 24, 2013 at 19:37

roman

118k30 gold badges205 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

flyingmeatball Over a year ago

Thanks Roman - That looks much easier than what I was going to try and do, I'll have to spend a little time on the syntax for Pandas, but based on all the comments that seems like the way to go!

roman Over a year ago

@flyingmeatball yes, definitely take a look, data transformations become very fun to do :)

Collectives™ on Stack Overflow

Transformation of transactions to numpy array

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related