0

I have a csv file like this:

[email protected], 01-05-2014
[email protected], 01-05-2014
[email protected], 01-05-2014
[email protected], 01-05-2014

I am reading the above csv file and extracting domain name and also the count of emails address by domain name and date as well. All these things I need to insert into MySQL table called domains which I am able to do it successfully.

Problem Statement:- Now I need to use the same table to report the top 50 domains by count sorted by percentage growth of the last 30 days compared to the total. And this is what I am not able to understand how can I do it?

Below is the code in which I am successfully able to insert into MySQL database but not able to do above reporting task as I am not able to understand how to achieve this task?

#!/usr/bin/python
import fileinput
import csv
import os
import sys
import time
import MySQLdb

from collections import defaultdict, Counter

domain_counts = defaultdict(Counter)

# ======================== Defined Functions ======================
def get_file_path(filename):
    currentdirpath = os.getcwd()  
    # get current working directory path
    filepath = os.path.join(currentdirpath, filename)
    return filepath
# ===========================================================
def read_CSV(filepath):

    with open('emails.csv') as f:
        reader = csv.reader(f)
        for row in reader:
            domain_counts[row[0].split('@')[1].strip()][row[1]] += 1

    db = MySQLdb.connect(host="localhost", # your host, usually localhost
                         user="root", # your username
                         passwd="abcdef1234", # your password
                         db="test") # name of the data base
    cur = db.cursor()

    q = """INSERT INTO domains(domain_name, cnt, date_of_entry) VALUES(%s, %s, STR_TO_DATE(%s, '%%d-%%m-%%Y'))"""


    for domain, data in domain_counts.iteritems():
        for email_date, email_count in data.iteritems():
             cur.execute(q, (domain, email_count, email_date))

    db.commit()

# ======================= main program =======================================
path = get_file_path('emails.csv') 
read_CSV(path) # read the input file

What is the right way to do the reporting task while using domains table.

Update:

Here is my domains table:

mysql> describe domains;
+----------------+-------------+------+-----+---------+----------------+
| Field          | Type        | Null | Key | Default | Extra          |
+----------------+-------------+------+-----+---------+----------------+
| id             | int(11)     | NO   | PRI | NULL    | auto_increment |
| domain_name    | varchar(20) | NO   |     | NULL    |                |
| cnt            | int(11)     | YES  |     | NULL    |                |
| date_of_entry  | date        | NO   |     | NULL    |                |
+-------------+-------------+------+-----+---------+----------------+

And here is data I have in them:

mysql> select * from domains;
+----+---------------+-------+------------+
| id | domain_name   | count | date_entry |
+----+---------------+-------+------------+
|  1 | wawa.com      |     2 | 2014-04-30 |
|  2 | wawa.com      |     2 | 2014-05-01 |
|  3 | wawa.com      |     3 | 2014-05-31 |
|  4 | uwaterloo.ca  |     4 | 2014-04-30 |
|  5 | uwaterloo.ca  |     3 | 2014-05-01 |
|  6 | uwaterloo.ca  |     1 | 2014-05-31 |
|  7 | anonymous.com |     2 | 2014-04-30 |
|  8 | anonymous.com |     4 | 2014-05-01 |
|  9 | anonymous.com |     8 | 2014-05-31 |
| 10 | hotmail.com   |     4 | 2014-04-30 |
| 11 | hotmail.com   |     1 | 2014-05-01 |
| 12 | hotmail.com   |     3 | 2014-05-31 |
| 13 | gmail.com     |     6 | 2014-04-30 |
| 14 | gmail.com     |     4 | 2014-05-01 |
| 15 | gmail.com     |     8 | 2014-05-31 |
+----+---------------+-------+------------+
5
  • Normally, you'd import all the data, and then run queries as required to extract a desired result. You wouldn't create more tables to store derived data; that's redundancy. Commented Oct 18, 2015 at 9:27
  • Yeah this was for my learning experience to see how we can use MySQL with Python.. I do understand here mysql database is redundant, we can do directly by reading it through CSV file but I wanted to learn MySQL from Python so trying to do this way. By this I will learn a lot.. Commented Oct 18, 2015 at 9:29
  • The database is not necessarily redundant; just the formation of additional tables - beyond those required for normalisation to 3NF Commented Oct 18, 2015 at 9:39
  • Anyway, assuming we are going to use an RDBMS for this task, if I was you I'd remove all the application level code and just focus on the dataset (the domains table) and the desired result. Commented Oct 18, 2015 at 9:43
  • yeah understood. Just to make question more clear, I added full description and code of what I am trying to do. Commented Oct 18, 2015 at 10:05

2 Answers 2

1

Your needed report can be done in SQL on the MySQL side and Python can be used to call the query, import the resultset, and print out the results.

Consider the following aggregate query with subquery and derived table which follow the percentage growth formula:

((this month domain total cnt) - (last month domain total cnt))
 / (last month all domains total cnt)

SQL

SELECT  domain_name, pct_growth
FROM (

SELECT t1.domain_name,  
         # SUM OF SPECIFIC DOMAIN'S CNT BETWEEN TODAY AND 30 DAYS AGO  
        (Sum(CASE WHEN t1.date_of_entry >= (CURRENT_DATE - INTERVAL 30 DAY) 
                  THEN t1.cnt ELSE 0 END)               
         -
         # SUM OF SPECIFIC DOMAIN'S CNT AS OF 30 DAYS AGO
         Sum(CASE WHEN t1.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY) 
                  THEN t1.cnt ELSE 0 END) 
        ) /   
        # SUM OF ALL DOMAINS' CNT AS OF 30 DAYS AGO
        (SELECT SUM(t2.cnt) FROM domains t2 
          WHERE t2.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY))
         As pct_growth   

FROM domains t1
GROUP BY t1.domain_name
) As derivedTable

ORDER BY pct_growth DESC
LIMIT 50;

Python

cur = db.cursor()
sql = "SELECT * FROM ..."  # SEE ABOVE 

cur.execute(sql)

for row in cur.fetchall():
   print(row)
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for suggestion. When I ran the query you suggested on domains table, pct_growth came out NULL for all the domain_name somehow. Not sure what's wrong. Do you see any issue?
See edit. Try wrapping the cnt with IFNULL() to replace with zero. MySQL cannot run arithmetic expressions with null values. Also, it just occurred to me, cnt sum last 30 days may be much smaller than cnt sum before 30 days ago since latter had no bottom limit. To account for all cnt to date, simply remove the first CASE WHEN expression and just use Sum(t1.cnt).
I see.. I just tried your new edit as well and still it gives back NULL in pct_growth column for all the domain_name somehow. In my domains table, I don't have any zero or negative counts and my date is in this format ` 2014-04-30`.
Can you post a sample of your domains table data? You only have how the csv data looks. I know I tested this query. Check if cnt is a numeric column and date_of_entry is a date column. They may look like these types but are stored in string columns.
Just caught the issue. Your data has no cnt between today and 30 days ago. So the first case resulted in NULLS. Simply add the ELSE 0 in the both CASE WHEN statements. Here is a SQL Fiddle using your data.
|
0

If I understand correctly, you just need the ratio of the past thirty days to the total count. You can get this using conditional aggregation. So, assuming that cnt is always greater than 0:

select d.domain_name,
       sum(cnt) as CntTotal,
       sum(case when date_of_entry >= date_sub(now(), interval 1 month) then cnt else 0 end) as Cnt30Days,
       (sum(case when date_of_entry >= date_sub(now(), interval 1 month) then cnt else 0 end) / sum(cnt)) as Ratio30Days
from domains d
group by d.domain_name
order by Ratio30Days desc;

2 Comments

Thanks Gordon.. Yeah cnt will always be greater than zero. Also how do I show the result of this query in an output? What structure I should follow here?
That's a rather broad question. There are multiple ways to run queries in Python. Any of them should work for this query.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.