Performance - searching a string in a text file - Python

Question

I have a set of dates:

dates1 = {'21/5/2015', '4/4/2015', '15/6/2015', '30/1/2015', '19/3/2015', '25/2/2015', '25/5/2015', '8/2/2015', '6/6/2015', '15/3/2015', '15/1/2015', '30/5/2015'}

the same dates appear in a text ('data' from now on). It´s a pretty long text. I want to loop over the text and get the number of times each date appear in the text, then i print the 5 dates with more occurances.

what i have now is this:

def dates(data, dates1):
    lines = data.split("\n")
    dict_days = {}
    for day in dates1:
        count = 0
        for line in lines:
            if day in line:
                count += 1
        dict_days[day] = count

    newA = heapq.nlargest(5, dict_days, key=dict_days.get)

    print(newA)

I split the tex in lines, create a dict, for every date in the list it looks for it in every line and if it finds it adds 1 to count.

this works fine, BUT it´s taking a looong time running this method.

So what i am asking is if someone knows a more efficient way to do exactly the same

Any help will be really appreciated

Edit

I will try every single answer and let you know, thanks in advance

Warning: if day in line: is dangerous, because if day == '1/1/2015' it'll be in a line which is '21/1/2015'. — DSM
– DSM, Commented Sep 9, 2015 at 19:50
Use regular expressions instead of if day in line and surround the tokens with \b if they would occur as whole words. — mpcabd
– mpcabd, Commented Sep 9, 2015 at 19:52

Padraic Cunningham · Accepted Answer · 2015-09-09 20:59:03Z

7

Loop over the lines once, extracting any date, check if the date is in the set, if so increment the count using a Counter dict for the counts, at the end call Counter.most_common to get the 5 most common dates:

dates1 = {'21/5/2015', '4/4/2015', '15/6/2015', '30/1/2015', '19/3/2015', '25/2/2015', '25/5/2015', '8/2/2015', '6/6/2015', '15/3/2015', '15/1/2015', '30/5/2015'}


from collections import Counter
import re

def dates(data, dates1):
    lines = data.split("\n")
    dict_days = Counter()
    r = re.compile("\d+/\d+/\d+")
    for line in lines:
        match = r.search(line)
        if match:
            dte = match.group()
            if dte in dates1:
                dict_days[dte] += 1
    return dict_days.most_common(5)

This does a single pass over the list of lines as opposed to one pass for every dates in dates1.

For 100k lines with the date string at the end of a string with 200+ chars:

In [9]: from random import choice

In [10]: dates1 = {'21/5/2015', '4/4/2015', '15/6/2015', '30/1/2015', '19/3/2015', '25/2/2015', '25/5/2015', '8/2/2015', '6/6/2015', '15/3/2015', '15/1/2015', '30/5/2015'}

In [11]: dtes = list(dates1)

In [12]: s = "the same dates appear in a text ('data' from now on). It's a pretty long text. I want to loop over the text and get the number of times each date appear in the text, then i print the 5 dates with more occurances. "

In [13]: data = "\n".join([s+ choice(dtes) for _ in range(100000)])

In [14]: timeit dates(data,dates1)
1 loops, best of 3: 662 ms per loop

If more than one date can appear per line you can use findall:

def dates(data, dates1):
    lines = data.split("\n")
    r = re.compile("\d+/\d+/\d+")
    dict_days = Counter(dt for line in lines
                        for dt in r.findall(line) if dt in dates1)
    return dict_days.most_common(5)

If data is not actually a file like object and is a single string, just search the string itself:

def dates(data, dates1):
    r = re.compile("\d+/\d+/\d+")
    dict_days = Counter((dt for dt in r.findall(data) if dt in dates1))
    return dict_days.most_common(5)

compiling the dates on the test data seems to be the fastest approach, splitting each substring is pretty close to the search implementation:

def dates_split(data, dates1):
    lines = data.split("\n")
    dict_days = Counter(dt for line in lines
                        for dt in line.split() if dt in dates1)
    return dict_days.most_common(5)

def dates_comp_date1(data, dates1):
    lines = data.split("\n")
    r = re.compile("|".join(dates1))
    dict_days = Counter(dt for line in lines for dt in r.findall(line))
    return dict_days.most_common(5)

Using the functions above:

In [63]: timeit dates(data, dates1)
1 loops, best of 3: 640 ms per loop

In [64]: timeit dates_split(data, dates1)
1 loops, best of 3: 535 ms per loop

In [65]: timeit dates_comp_date1(data, dates1)
1 loops, best of 3: 368 ms per loop

edited Sep 9, 2015 at 20:59

answered Sep 9, 2015 at 19:59

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

NachoMiguel Over a year ago

It looks great. Let me try this and i will let you know sir.

301_Moved_Permanently Over a year ago

I'm not used to re but r = r.search(line) ? Wouldn't this prevent all lines but the first one to be scanned?

Padraic Cunningham Over a year ago

@MathiasEttinger, we are looping over lines, getting each line at a time, we extract the date substring from each line if it is there and use that. If there can be more than one date substring per line, the OP can use findall and loop over that

301_Moved_Permanently Over a year ago

I understand the code, my concerned is about overriding the regexp by its result

Padraic Cunningham Over a year ago

@MathiasEttinger, you are referring to the use of r? That was a typo I changed to match

|

Joran Beasley · Accepted Answer · 2015-09-09 19:59:37Z

4

Counter(word for word in my_text if word in my_dates)

I think would work quickly .... well O(N) (ish)

answered Sep 9, 2015 at 19:59

Joran Beasley

114k13 gold badges167 silver badges187 bronze badges

Comments

Robᵩ · Accepted Answer · 2015-09-09 20:04:07Z

1

Use a regular expression to extract the data, and a collections.Counter to find the most common:

import re
import collections

def dates(data, dates1):
    dates1 = '|'.join(x for x in dates1)
    dates1 = re.findall(dates1, data)
    dates1 = collections.Counter(dates1)
    print dates1.most_common(5)

dates1 = {'21/5/2015', '4/4/2015', '15/6/2015'}
data = 'Today is 21/5/2015. Yesterday is 4/4/2015.\nMy birthday is 4/4/2015'

dates(data, dates1)

answered Sep 9, 2015 at 20:04

Robᵩ

170k20 gold badges251 silver badges323 bronze badges

Comments

Gillespie · Accepted Answer · 2015-09-09 20:03:41Z

0

Why not just do:

dates = {'21/5/2015':0, '4/4/2015':0, '15/6/2015':0, '30/1/2015':0, '19/3/2015':0, '25/2/2015':0, '25/5/2015':0, '8/2/2015':0, '6/6/2015':0, '15/3/2015':0, '15/1/2015':0, '30/5/2015':0}

def processDates(data):
    lines = data.split("\n")
    for line in lines:
        if line in dates:
           dates[line] += 1

Then just sort dates by value

answered Sep 9, 2015 at 20:03

Gillespie

6,6263 gold badges38 silver badges71 bronze badges

Collectives™ on Stack Overflow

Performance - searching a string in a text file - Python

4 Answers 4

7 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

7 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related