2

Problem Statement:

I have a file as below.

name | date | count
John | 201406 | 1
John | 201410 | 2
Mary | 201409 | 180
Mary | 201410 | 154
Mary | 201411 | 157
Mary | 201412 | 153
Mary | 201501 | 223
Mary | 201502 | 166
Mary | 201503 | 163
Mary | 201504 | 169
Mary | 201505 | 157
Tara | 201505 | 2

The file shows count data for three people John, Mary and Tara for a couple of months. I would like to analyze this data and come up with a status tag for each person i.e. active, inactive or new.

A person is active if they have entries for 201505 and other previous months - like Mary

A person is inactive if they do not have entries for 201505 - like John

A person is new if they ONLY have 1 entry for 201505 - like Tara.

Furthermore, if a person is active, I would like to get a median of their last 5 counts. For example, for Mary, I would like to get the mean as ((157 + 169 + 163 + 166 + 223 ) / 5).

Question:

I would like to understand how I should read this file in Python 2.7 in order to fulfill my requirements. I started with the following but was not sure how I could get previous entries (i.e. previous lines in file) for a particular person.

for line in data:
    col = line.split('\t')
    name = col[0]
    date = col[1]
    count = col[2]
1
  • Consider using Pandas, then you can use the .groupby('name') function to look at each person individually. Commented Jun 3, 2015 at 20:39

2 Answers 2

3
import pandas as pd:
df = pd.read_csv('input_csv.csv') # This assumes you have a csv format file
names = {}
for name, subdf in df.groupby('name'):
    if name not in names:
        names[name] = {}
    if (subdf['date']==201505).any():
        if subdf['count'].count()==1:
            names[name]['status'] = 'new'
        else:
            names[name]['status'] = 'active'
            names[name]['last5median'] = subdf['count'].tail().median()
    else:
        names[name]['status'] = 'inactive'


>>>
{'John': {'status': 'inactive'},
 'Mary': {'last5median': 166.0, 'status': 'active'},
 'Tara': {'status': 'new'}}
Sign up to request clarification or add additional context in comments.

6 Comments

Does this assume that we have a header row in the txt file? What about the case where there is no header file?
Yes, this assumes the file has a header row. If there are no headers and you would like to explicitly provide column names, read in the file as follows: df = pd.read_csv('input_csv.csv', names=['name','date','count'])
Thanks. Where are we telling the program to only get the median of last 5 in names[name]['last5median'] = subdf['count'].tail().median() ? What if I want for last 8?
The .tail() part returns the last 5 entries (5 is the default for tail). If you want 8, for eg., you can do .tail(8). You get the point. The median is calculated using, well, the .median() part for the last X entries you give it in .tail(X).
Glad it helped! Hope this answers your question.
|
2

I think that you can solve your problem with dict.

import re

spl = """name | date | count
John | 201406 | 1
John | 201410 | 2
Mary | 201409 | 180
Mary | 201410 | 154
Mary | 201411 | 157
Mary | 201412 | 153
Mary | 201501 | 223
Mary | 201502 | 166
Mary | 201503 | 163
Mary | 201504 | 169
Mary | 201505 | 157
Tara | 201505 | 2"""

dicto = {}

listo = re.split("\\||\n",spl)
listo = [x.strip() for x in listo]
for x in range(3,len(listo),3):
    try:
        dicto[listo[x]].append([listo[x+1],listo[x+2]])
    except KeyError:
        dicto[listo[x]]= []
        dicto[listo[x]].append([listo[x+1],listo[x+2]])

print (dicto.get('John'))

Output:

[['201406', '1'], ['201410', '2']]

So, now you have all data, for all users in your dict of dicts and you can do with them what you want

1 Comment

If my source file is tab delimited text file, how would I read this into the variable spl and how would the re.split function change? Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.