Python script not working as required

Question

Hello I am writing a python script to generate count of monthly and daily visits for web pages. Input file:

ArticleName Date        Hour    Count/Visit
Aa   20130601    10000   1
Aa   20130601    10000   1
Ew   20130601    10000   1
H    20130601    10000   2
H    20130602    10000   1
R    20130601    20000   2
R    20130602    10000   1
Ra   20130601    0   1
Ra   20130601    10000   2
Ra   20130602    10000   1
Ram  20130601    0   2
Ram  20130601    10000   3
Ram  20130602    10000   4
Re   20130601    20000   1
Re   20130602    10000   3
Rz   20130602    10000   1

I need to count total Monthly and Daily page views of each page.

Output:

ArticleName Date     DailyView MonthlyView
Aa   20130601 2 2
Ew   20130601 1 1
H    20130601 2 2
H    20130602 1 3
R    20130601 2 2
R    20130602 1 4
Ra   20130601 5 5
Ra   20130602 1 6
Ram  20130601 5 5
Ram  20130602 4 9
Re   20130601 1 1
Re   20130602 3 4
Rz   20130602 1 1

My Script:

#!/usr/bin/python

import sys

last_date = 20130601
last_hour = 0
last_count = 0
last_article = None
monthly_count = 0
daily_count = 0

for line in sys.stdin:
  article, date, hour, count = line.split()
  count = int(count)
  date = int(date)
  hour = int(hour)

  #Articles match and date match
  if last_article == article and last_date == date:
      daily_count = count+last_count
      monthly_count = count+last_count
      # print '%s\t%s\t%s\t%s' % (article, date, daily_count, monthly_count)
  #Article match but date doesn't match 
  if last_article == article and last_date != date:
          monthly_count = count
          daily_count=count
          print '%s\t%s\t%s\t%s' % (article, date, daily_count, monthly_count)


  #Article doesn't match
  if last_article != article:
          last_article = article
          last_count = count
          monthly_count = count
          daily_count=count
          last_date = date
          print '%s\t%s\t%s\t%s' % (article, date, daily_count, monthly_count)

I am able to get most of the output but my output is wrong for two condition: 1. Couldn't get a way to sum up the ArticleName if ArticleName and ArticleDate are same. For eg this script gives output for row Ra: Ra 20130601 1 1 Ra 20130601 3 3 Ra 20130602 1 1 So at the end Ra should print 1+3+1=5 as final total monthly count instead of 1.

Since I display in the 3rd if condition all the articles which are not equal to last article I get the value of an article with same article name and date twice. Like: Ra 20130601 1 1 should not have been printed. Does anybody know how to correct this? Let me know if you need any more information.

All data is for Jun2013 but all articles are different. I need to find out how much time each article was visited daily and monthly. — CtrlV
– CtrlV, Commented Sep 15, 2013 at 6:22

falsetru · Accepted Answer · 2013-09-15 06:36:42Z

1

Try following:

import itertools
import operator
import sys

lines = (line.split() for line in sys.stdin)
prev_name, prev_month = '', '99999999'
month_view = 0
for (name,date), grp in itertools.groupby(lines, key=operator.itemgetter(0,1)):
    view = sum(int(row[-1]) for row in grp)
    if prev_name == name and date.startswith(prev_month):
        month_view += view
    else:
        prev_name = name
        prev_month = date[:6]
        month_view = view
    print '{}\t{}\t{}\t{}'.format(name, date, view, month_view)

Used itertools.groupby, operator.itemgetter.

Output is different:

Aa      20130601        2       2
Ew      20130601        1       1
H       20130601        2       2
H       20130602        1       3
R       20130601        2       2
R       20130602        1       3
Ra      20130601        3       3
Ra      20130602        1       4
Ram     20130601        5       5
Ram     20130602        4       9
Re      20130601        1       1
Re      20130602        3       4
Rz      20130602        1       1

answered Sep 15, 2013 at 6:36

falsetru

371k69 gold badges769 silver badges659 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

CtrlV Over a year ago

Thanks @falsetru. Problem in understanding what this code is doing in python: for (name,date), grp in itertools.groupby(lines, key=operator.itemgetter(0,2)): view = sum(int(row[-1]) for row in grp) I can understand that I am getting two values name and date from lineswhich are 1st and third value in a row and rest of the line as grp. Am I right? Also in the second line view is summed up when [-1] row in grp of the line? If yes what -1 row indicate, does the row start from -1 in Python

falsetru Over a year ago

@CtrlV, xs[-1] retrieves last value of xs.

falsetru Over a year ago

@CtrlV, I used groupby(lines, key=operator.itemgetter(0,1)) not 0,2 ..: It groups lines by the first, second fields. (ArticleName, Date). For each loop, name, date reference ArticleName, Date. And grp reference iterator which yields grouped lines.

falsetru Over a year ago

@CtrlV, For more information about groupby, follow the link I provided in the answer.

Tarik · Accepted Answer · 2013-09-16 05:42:35Z

A better way to achieve what you want is to use the map - reduce functions found in itertools: http://docs.python.org/2/howto/functional.html

import itertools
from itertools import groupby
from itertools import dropwhile
import sys
import datetime

# Convert list of words found in one line into
# a tuple consisting of a name, date/time and number of visits
def get_record(w):
    name = w[0]
    date = datetime.datetime.strptime((w[1] + ('%0*d' % (6, int(w[2])))), "%Y%m%d%H%M%S")
    visits = int(w[3])
    return (name, date, visits)

# Takes a tuple representing a single record and returns a tuple
# consisting of a name, year and month on which the records will
# be grouped.
def get_key_by_month((name, date, visits)):
    return (name, date.year, date.month)

# Takes a tuple representing a single record and returns a tuple
# consisting of a name, year, month and day on which the records will
# be grouped.
def get_key_by_day((name, date, visits)):
    return (name, date.year, date.month, date.day)

# Get a list containing lines, each line containing
# a list of words, skipping the first line
words = (line.split() for line in sys.stdin)
words = dropwhile(lambda x: x[0]<1, enumerate(words))
words = map(lambda x: x[1], words)

# Convert to tuples containg name, date/time and count 
records = list(get_record(w) for w in words)

# Group by name, month
groups = groupby(records, get_key_by_month)

# Sum visits in each group
print('Visits per month')
for (name, year, month), g in groups:
    visits = sum(map(lambda (name,date,visits): visits, g))
    print name, year, month, visits

# Group by name, day
groups = groupby(records, get_key_by_day)

# Sum visits in each group
print ('\nVisits per day')
for (name, year, month, day), g in groups:
    visits = sum(map(lambda (name,date,visits): visits, g))
    print name, year, month, day, visits

Python 3 version of the above code:

import itertools
from itertools import groupby
from itertools import dropwhile
import sys
import datetime

# Convert list of words found in one line into
# a tuple consisting of a name, date/time and number of visits
def get_record(w):
    name = w[0]
    date = datetime.datetime.strptime((w[1] + ('%0*d' % (6, int(w[2])))), "%Y%m%d%H%M%S")
    visits = int(w[3])
    return (name, date, visits)

# Takes a tuple representing a single record and returns a tuple
# consisting of a name, year and month on which the records will
# be grouped.
def get_key_by_month(rec):
    return (rec[0], rec[1].year, rec[1].month)

# Takes a tuple representing a single record and returns a tuple
# consisting of a name, year, month and day on which the records will
# be grouped.
def get_key_by_day(rec):
    return (rec[0], rec[1].year, rec[1].month, rec[1].day)

# Get a list containing lines, each line containing
# a list of words, skipping the first line
words = (line.split() for line in sys.stdin)
words = dropwhile(lambda x: x[0]<1, enumerate(words))
words = map(lambda x: x[1], words)

# Convert to tuples containg name, date/time and count 
records = list(get_record(w) for w in words)

# Group by name, month
groups = groupby(records, get_key_by_month)

# Sum visits in each group
print('Visits per month')
for (name, year, month), g in groups:
    visits = sum(map(lambda rec: rec[2], g))
    print(name, year, month, visits)

# Group by name, day
groups = groupby(records, get_key_by_day)

# Sum visits in each group
print ('\nVisits per day')
for (name, year, month, day), g in groups:
    visits = sum(map(lambda rec: rec[2], g))
    print(name, year, month, day, visits)

This code can be shortened by using lambda expressions and moving the call to groupby() directly in the for loop. I preferred to do things a step at a time for clarity and ease of debugging.
Improved the code by creating an iterator that skips the header line instead of creating a list and popping the first element, causing the whole data to be stored in memory.

asafpr · Accepted Answer · 2013-09-15 06:36:05Z

0

The easy way to do it would be IMHO to build a double dictionary with the page name as key and value is a dictionary from date to number of views, iterate the list and buils the dictionary and then iterate over the dictionary for each page and count the number of pages for each month.

answered Sep 15, 2013 at 6:36

asafpr

3571 silver badge5 bronze badges

Collectives™ on Stack Overflow

Python script not working as required

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related