2

I have a set of dates:

dates1 = {'21/5/2015', '4/4/2015', '15/6/2015', '30/1/2015', '19/3/2015', '25/2/2015', '25/5/2015', '8/2/2015', '6/6/2015', '15/3/2015', '15/1/2015', '30/5/2015'}

the same dates appear in a text ('data' from now on). It´s a pretty long text. I want to loop over the text and get the number of times each date appear in the text, then i print the 5 dates with more occurances.

what i have now is this:

def dates(data, dates1):
    lines = data.split("\n")
    dict_days = {}
    for day in dates1:
        count = 0
        for line in lines:
            if day in line:
                count += 1
        dict_days[day] = count

    newA = heapq.nlargest(5, dict_days, key=dict_days.get)

    print(newA)

I split the tex in lines, create a dict, for every date in the list it looks for it in every line and if it finds it adds 1 to count.

this works fine, BUT it´s taking a looong time running this method.

So what i am asking is if someone knows a more efficient way to do exactly the same

Any help will be really appreciated

Edit

I will try every single answer and let you know, thanks in advance

4
  • 3
    Warning: if day in line: is dangerous, because if day == '1/1/2015' it'll be in a line which is '21/1/2015'. Commented Sep 9, 2015 at 19:50
  • Use regular expressions instead of if day in line and surround the tokens with \b if they would occur as whole words. Commented Sep 9, 2015 at 19:52
  • fantastic catch @DSM Commented Sep 9, 2015 at 19:59
  • Yes, perfect catch, How should i improve this? @DSM Commented Sep 9, 2015 at 20:00

4 Answers 4

7

Loop over the lines once, extracting any date, check if the date is in the set, if so increment the count using a Counter dict for the counts, at the end call Counter.most_common to get the 5 most common dates:

dates1 = {'21/5/2015', '4/4/2015', '15/6/2015', '30/1/2015', '19/3/2015', '25/2/2015', '25/5/2015', '8/2/2015', '6/6/2015', '15/3/2015', '15/1/2015', '30/5/2015'}


from collections import Counter
import re

def dates(data, dates1):
    lines = data.split("\n")
    dict_days = Counter()
    r = re.compile("\d+/\d+/\d+")
    for line in lines:
        match = r.search(line)
        if match:
            dte = match.group()
            if dte in dates1:
                dict_days[dte] += 1
    return dict_days.most_common(5)

This does a single pass over the list of lines as opposed to one pass for every dates in dates1.

For 100k lines with the date string at the end of a string with 200+ chars:

In [9]: from random import choice

In [10]: dates1 = {'21/5/2015', '4/4/2015', '15/6/2015', '30/1/2015', '19/3/2015', '25/2/2015', '25/5/2015', '8/2/2015', '6/6/2015', '15/3/2015', '15/1/2015', '30/5/2015'}

In [11]: dtes = list(dates1)

In [12]: s = "the same dates appear in a text ('data' from now on). It's a pretty long text. I want to loop over the text and get the number of times each date appear in the text, then i print the 5 dates with more occurances. "

In [13]: data = "\n".join([s+ choice(dtes) for _ in range(100000)])

In [14]: timeit dates(data,dates1)
1 loops, best of 3: 662 ms per loop

If more than one date can appear per line you can use findall:

def dates(data, dates1):
    lines = data.split("\n")
    r = re.compile("\d+/\d+/\d+")
    dict_days = Counter(dt for line in lines
                        for dt in r.findall(line) if dt in dates1)
    return dict_days.most_common(5)

If data is not actually a file like object and is a single string, just search the string itself:

def dates(data, dates1):
    r = re.compile("\d+/\d+/\d+")
    dict_days = Counter((dt for dt in r.findall(data) if dt in dates1))
    return dict_days.most_common(5)

compiling the dates on the test data seems to be the fastest approach, splitting each substring is pretty close to the search implementation:

def dates_split(data, dates1):
    lines = data.split("\n")
    dict_days = Counter(dt for line in lines
                        for dt in line.split() if dt in dates1)
    return dict_days.most_common(5)

def dates_comp_date1(data, dates1):
    lines = data.split("\n")
    r = re.compile("|".join(dates1))
    dict_days = Counter(dt for line in lines for dt in r.findall(line))
    return dict_days.most_common(5)

Using the functions above:

In [63]: timeit dates(data, dates1)
1 loops, best of 3: 640 ms per loop

In [64]: timeit dates_split(data, dates1)
1 loops, best of 3: 535 ms per loop

In [65]: timeit dates_comp_date1(data, dates1)
1 loops, best of 3: 368 ms per loop
Sign up to request clarification or add additional context in comments.

7 Comments

It looks great. Let me try this and i will let you know sir.
I'm not used to re but r = r.search(line) ? Wouldn't this prevent all lines but the first one to be scanned?
@MathiasEttinger, we are looping over lines, getting each line at a time, we extract the date substring from each line if it is there and use that. If there can be more than one date substring per line, the OP can use findall and loop over that
I understand the code, my concerned is about overriding the regexp by its result
@MathiasEttinger, you are referring to the use of r? That was a typo I changed to match
|
4
Counter(word for word in my_text if word in my_dates)

I think would work quickly .... well O(N) (ish)

Comments

1

Use a regular expression to extract the data, and a collections.Counter to find the most common:

import re
import collections

def dates(data, dates1):
    dates1 = '|'.join(x for x in dates1)
    dates1 = re.findall(dates1, data)
    dates1 = collections.Counter(dates1)
    print dates1.most_common(5)

dates1 = {'21/5/2015', '4/4/2015', '15/6/2015'}
data = 'Today is 21/5/2015. Yesterday is 4/4/2015.\nMy birthday is 4/4/2015'

dates(data, dates1)

Comments

0

Why not just do:

dates = {'21/5/2015':0, '4/4/2015':0, '15/6/2015':0, '30/1/2015':0, '19/3/2015':0, '25/2/2015':0, '25/5/2015':0, '8/2/2015':0, '6/6/2015':0, '15/3/2015':0, '15/1/2015':0, '30/5/2015':0}

def processDates(data):
    lines = data.split("\n")
    for line in lines:
        if line in dates:
           dates[line] += 1

Then just sort dates by value

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.