Python - Read specific lines in a text file based on a condition

Question

Problem Statement:

I have a file as below.

name | date | count
John | 201406 | 1
John | 201410 | 2
Mary | 201409 | 180
Mary | 201410 | 154
Mary | 201411 | 157
Mary | 201412 | 153
Mary | 201501 | 223
Mary | 201502 | 166
Mary | 201503 | 163
Mary | 201504 | 169
Mary | 201505 | 157
Tara | 201505 | 2

The file shows count data for three people John, Mary and Tara for a couple of months. I would like to analyze this data and come up with a status tag for each person i.e. active, inactive or new.

A person is active if they have entries for 201505 and other previous months - like Mary

A person is inactive if they do not have entries for 201505 - like John

A person is new if they ONLY have 1 entry for 201505 - like Tara.

Furthermore, if a person is active, I would like to get a median of their last 5 counts. For example, for Mary, I would like to get the mean as ((157 + 169 + 163 + 166 + 223 ) / 5).

Question:

I would like to understand how I should read this file in Python 2.7 in order to fulfill my requirements. I started with the following but was not sure how I could get previous entries (i.e. previous lines in file) for a particular person.

for line in data:
    col = line.split('\t')
    name = col[0]
    date = col[1]
    count = col[2]

Consider using Pandas, then you can use the .groupby('name') function to look at each person individually. — vk1011
– vk1011, Commented Jun 3, 2015 at 20:39

vk1011 · Accepted Answer · 2015-06-04 03:57:43Z

3

import pandas as pd:
df = pd.read_csv('input_csv.csv') # This assumes you have a csv format file
names = {}
for name, subdf in df.groupby('name'):
    if name not in names:
        names[name] = {}
    if (subdf['date']==201505).any():
        if subdf['count'].count()==1:
            names[name]['status'] = 'new'
        else:
            names[name]['status'] = 'active'
            names[name]['last5median'] = subdf['count'].tail().median()
    else:
        names[name]['status'] = 'inactive'


>>>
{'John': {'status': 'inactive'},
 'Mary': {'last5median': 166.0, 'status': 'active'},
 'Tara': {'status': 'new'}}

edited Jun 4, 2015 at 3:57

answered Jun 3, 2015 at 20:59

vk1011

7,2297 gold badges29 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

activelearner Over a year ago

Does this assume that we have a header row in the txt file? What about the case where there is no header file?

vk1011 Over a year ago

Yes, this assumes the file has a header row. If there are no headers and you would like to explicitly provide column names, read in the file as follows: df = pd.read_csv('input_csv.csv', names=['name','date','count'])

activelearner Over a year ago

Thanks. Where are we telling the program to only get the median of last 5 in names[name]['last5median'] = subdf['count'].tail().median() ? What if I want for last 8?

vk1011 Over a year ago

The .tail() part returns the last 5 entries (5 is the default for tail). If you want 8, for eg., you can do .tail(8). You get the point. The median is calculated using, well, the .median() part for the last X entries you give it in .tail(X).

vk1011 Over a year ago

Glad it helped! Hope this answers your question.

|

Aleksander Monk · Accepted Answer · 2015-06-03 20:53:02Z

2

I think that you can solve your problem with dict.

import re

spl = """name | date | count
John | 201406 | 1
John | 201410 | 2
Mary | 201409 | 180
Mary | 201410 | 154
Mary | 201411 | 157
Mary | 201412 | 153
Mary | 201501 | 223
Mary | 201502 | 166
Mary | 201503 | 163
Mary | 201504 | 169
Mary | 201505 | 157
Tara | 201505 | 2"""

dicto = {}

listo = re.split("\\||\n",spl)
listo = [x.strip() for x in listo]
for x in range(3,len(listo),3):
    try:
        dicto[listo[x]].append([listo[x+1],listo[x+2]])
    except KeyError:
        dicto[listo[x]]= []
        dicto[listo[x]].append([listo[x+1],listo[x+2]])

print (dicto.get('John'))

Output:

[['201406', '1'], ['201410', '2']]

So, now you have all data, for all users in your dict of dicts and you can do with them what you want

answered Jun 3, 2015 at 20:53

Aleksander Monk

2,9172 gold badges23 silver badges32 bronze badges

1 Comment

activelearner Over a year ago

If my source file is tab delimited text file, how would I read this into the variable spl and how would the re.split function change? Thanks!

Collectives™ on Stack Overflow

Python - Read specific lines in a text file based on a condition

2 Answers 2

6 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related