0

Hello I am writing a python script to generate count of monthly and daily visits for web pages. Input file:

ArticleName Date        Hour    Count/Visit
Aa   20130601    10000   1
Aa   20130601    10000   1
Ew   20130601    10000   1
H    20130601    10000   2
H    20130602    10000   1
R    20130601    20000   2
R    20130602    10000   1
Ra   20130601    0   1
Ra   20130601    10000   2
Ra   20130602    10000   1
Ram  20130601    0   2
Ram  20130601    10000   3
Ram  20130602    10000   4
Re   20130601    20000   1
Re   20130602    10000   3
Rz   20130602    10000   1

I need to count total Monthly and Daily page views of each page.

Output:

ArticleName Date     DailyView MonthlyView
Aa   20130601 2 2
Ew   20130601 1 1
H    20130601 2 2
H    20130602 1 3
R    20130601 2 2
R    20130602 1 4
Ra   20130601 5 5
Ra   20130602 1 6
Ram  20130601 5 5
Ram  20130602 4 9
Re   20130601 1 1
Re   20130602 3 4
Rz   20130602 1 1

My Script:

#!/usr/bin/python

import sys

last_date = 20130601
last_hour = 0
last_count = 0
last_article = None
monthly_count = 0
daily_count = 0

for line in sys.stdin:
  article, date, hour, count = line.split()
  count = int(count)
  date = int(date)
  hour = int(hour)

  #Articles match and date match
  if last_article == article and last_date == date:
      daily_count = count+last_count
      monthly_count = count+last_count
      # print '%s\t%s\t%s\t%s' % (article, date, daily_count, monthly_count)
  #Article match but date doesn't match 
  if last_article == article and last_date != date:
          monthly_count = count
          daily_count=count
          print '%s\t%s\t%s\t%s' % (article, date, daily_count, monthly_count)


  #Article doesn't match
  if last_article != article:
          last_article = article
          last_count = count
          monthly_count = count
          daily_count=count
          last_date = date
          print '%s\t%s\t%s\t%s' % (article, date, daily_count, monthly_count)

I am able to get most of the output but my output is wrong for two condition: 1. Couldn't get a way to sum up the ArticleName if ArticleName and ArticleDate are same. For eg this script gives output for row Ra: Ra 20130601 1 1 Ra 20130601 3 3 Ra 20130602 1 1 So at the end Ra should print 1+3+1=5 as final total monthly count instead of 1.

  1. Since I display in the 3rd if condition all the articles which are not equal to last article I get the value of an article with same article name and date twice. Like: Ra 20130601 1 1 should not have been printed. Does anybody know how to correct this? Let me know if you need any more information.
3
  • All data is for Jun2013 but all articles are different. I need to find out how much time each article was visited daily and monthly. Commented Sep 15, 2013 at 6:22
  • I got it. They are cumulative count? right? Commented Sep 15, 2013 at 6:23
  • 1
    R 20130602 1 4 should be R 20130602 1 3 ? Commented Sep 15, 2013 at 6:34

3 Answers 3

1

Try following:

import itertools
import operator
import sys

lines = (line.split() for line in sys.stdin)
prev_name, prev_month = '', '99999999'
month_view = 0
for (name,date), grp in itertools.groupby(lines, key=operator.itemgetter(0,1)):
    view = sum(int(row[-1]) for row in grp)
    if prev_name == name and date.startswith(prev_month):
        month_view += view
    else:
        prev_name = name
        prev_month = date[:6]
        month_view = view
    print '{}\t{}\t{}\t{}'.format(name, date, view, month_view)

Used itertools.groupby, operator.itemgetter.

Output is different:

Aa      20130601        2       2
Ew      20130601        1       1
H       20130601        2       2
H       20130602        1       3
R       20130601        2       2
R       20130602        1       3
Ra      20130601        3       3
Ra      20130602        1       4
Ram     20130601        5       5
Ram     20130602        4       9
Re      20130601        1       1
Re      20130602        3       4
Rz      20130602        1       1
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks @falsetru. Problem in understanding what this code is doing in python: for (name,date), grp in itertools.groupby(lines, key=operator.itemgetter(0,2)): view = sum(int(row[-1]) for row in grp) I can understand that I am getting two values name and date from lineswhich are 1st and third value in a row and rest of the line as grp. Am I right? Also in the second line view is summed up when [-1] row in grp of the line? If yes what -1 row indicate, does the row start from -1 in Python
@CtrlV, xs[-1] retrieves last value of xs.
@CtrlV, I used groupby(lines, key=operator.itemgetter(0,1)) not 0,2 ..: It groups lines by the first, second fields. (ArticleName, Date). For each loop, name, date reference ArticleName, Date. And grp reference iterator which yields grouped lines.
@CtrlV, For more information about groupby, follow the link I provided in the answer.
1

A better way to achieve what you want is to use the map - reduce functions found in itertools: http://docs.python.org/2/howto/functional.html

import itertools
from itertools import groupby
from itertools import dropwhile
import sys
import datetime

# Convert list of words found in one line into
# a tuple consisting of a name, date/time and number of visits
def get_record(w):
    name = w[0]
    date = datetime.datetime.strptime((w[1] + ('%0*d' % (6, int(w[2])))), "%Y%m%d%H%M%S")
    visits = int(w[3])
    return (name, date, visits)

# Takes a tuple representing a single record and returns a tuple
# consisting of a name, year and month on which the records will
# be grouped.
def get_key_by_month((name, date, visits)):
    return (name, date.year, date.month)

# Takes a tuple representing a single record and returns a tuple
# consisting of a name, year, month and day on which the records will
# be grouped.
def get_key_by_day((name, date, visits)):
    return (name, date.year, date.month, date.day)

# Get a list containing lines, each line containing
# a list of words, skipping the first line
words = (line.split() for line in sys.stdin)
words = dropwhile(lambda x: x[0]<1, enumerate(words))
words = map(lambda x: x[1], words)

# Convert to tuples containg name, date/time and count 
records = list(get_record(w) for w in words)

# Group by name, month
groups = groupby(records, get_key_by_month)

# Sum visits in each group
print('Visits per month')
for (name, year, month), g in groups:
    visits = sum(map(lambda (name,date,visits): visits, g))
    print name, year, month, visits

# Group by name, day
groups = groupby(records, get_key_by_day)

# Sum visits in each group
print ('\nVisits per day')
for (name, year, month, day), g in groups:
    visits = sum(map(lambda (name,date,visits): visits, g))
    print name, year, month, day, visits

Python 3 version of the above code:

import itertools
from itertools import groupby
from itertools import dropwhile
import sys
import datetime

# Convert list of words found in one line into
# a tuple consisting of a name, date/time and number of visits
def get_record(w):
    name = w[0]
    date = datetime.datetime.strptime((w[1] + ('%0*d' % (6, int(w[2])))), "%Y%m%d%H%M%S")
    visits = int(w[3])
    return (name, date, visits)

# Takes a tuple representing a single record and returns a tuple
# consisting of a name, year and month on which the records will
# be grouped.
def get_key_by_month(rec):
    return (rec[0], rec[1].year, rec[1].month)

# Takes a tuple representing a single record and returns a tuple
# consisting of a name, year, month and day on which the records will
# be grouped.
def get_key_by_day(rec):
    return (rec[0], rec[1].year, rec[1].month, rec[1].day)

# Get a list containing lines, each line containing
# a list of words, skipping the first line
words = (line.split() for line in sys.stdin)
words = dropwhile(lambda x: x[0]<1, enumerate(words))
words = map(lambda x: x[1], words)

# Convert to tuples containg name, date/time and count 
records = list(get_record(w) for w in words)

# Group by name, month
groups = groupby(records, get_key_by_month)

# Sum visits in each group
print('Visits per month')
for (name, year, month), g in groups:
    visits = sum(map(lambda rec: rec[2], g))
    print(name, year, month, visits)

# Group by name, day
groups = groupby(records, get_key_by_day)

# Sum visits in each group
print ('\nVisits per day')
for (name, year, month, day), g in groups:
    visits = sum(map(lambda rec: rec[2], g))
    print(name, year, month, day, visits)

2 Comments

This code can be shortened by using lambda expressions and moving the call to groupby() directly in the for loop. I preferred to do things a step at a time for clarity and ease of debugging.
Improved the code by creating an iterator that skips the header line instead of creating a list and popping the first element, causing the whole data to be stored in memory.
0

The easy way to do it would be IMHO to build a double dictionary with the page name as key and value is a dictionary from date to number of views, iterate the list and buils the dictionary and then iterate over the dictionary for each page and count the number of pages for each month.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.