Pandas: how to make algorithm faster

Question

I have the task: I should find some data in big file and add this data to some file. File, where I search data is 22 million string and I divide it using chunksize. In other file I have column with 600 id of users and I find info about every users in big file. The first I divide data to interval and next search information about every user in all of this files. I use timer to know, how many time it spend to write to file and average time to find information in df size 1 million string and write it to file is 1.7 sec. And after count all time of program I get 6 hours. (1.5 sec * 600 id * 22 interval). I want to do it faster, but I don't know any way besides chunksize. I add my code

el = pd.read_csv('df2.csv', iterator=True, chunksize=1000000)
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
dates1 = buys['date']
ids1 = buys['id']
for i in el:
    i['used_at'] = pd.to_datetime(i['used_at'])
    df = i.sort_values(['ID', 'used_at'])
    dates = df['used_at']
    ids = df['ID']
    urls = df['url']
    for i, (id, date, url, id1, date1) in enumerate(zip(ids, dates, urls, ids1, dates1)):
        start = time.time()
        df1 = df[(df['ID'] == ids1[i]) & (df['used_at'] < (dates1[i] + dateutil.relativedelta.relativedelta(days=5)).replace(hour=0, minute=0, second=0)) & (df['used_at'] > (dates1[i] - dateutil.relativedelta.relativedelta(months=1)).replace(day=1, hour=0, minute=0, second=0))]
        df1 = DataFrame(df1)
        if df1.empty:
            continue
        else:
            with open('3.csv', 'a') as f:
                df1.to_csv(f, header=False)
                end = time.time()
                print(end - start)

ptrj · Accepted Answer · 2016-07-07 13:50:50Z

1

There are some issues in your code

zip takes arguments that may be of different length

dateutil.relativedelta may not be compatible with pandas Timestamp. With pandas 0.18.1 and python 3.5, I'm getting this:

now = pd.Timestamp.now()
now
Out[46]: Timestamp('2016-07-06 15:32:44.266720')
now + dateutil.relativedelta.relativedelta(day=5)
Out[47]: Timestamp('2016-07-05 15:32:44.266720')

So it's better to use pd.Timedelta

now + pd.Timedelta(5, 'D')
Out[48]: Timestamp('2016-07-11 15:32:44.266720')

But it's somewhat inaccurate for months:

now - pd.Timedelta(1, 'M')
Out[49]: Timestamp('2016-06-06 05:03:38.266720')

This is a sketch of code. I didn't test and I may be wrong about what you want. The crucial part is to merge the two data frames instead of iterating row by row.

# 1) convert to datetime here 
# 2) optionally, you can select only relevant cols with e.g. usecols=['ID', 'used_at', 'url']
# 3) iterator is prob. superfluous
el = pd.read_csv('df2.csv', chunksize=1000000, parse_dates=['used_at'])

buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
# consider loading only relevant columns to buys

# compute time intervals here (not in a loop!)
buys['date_min'] = (buys['date'] - pd.TimeDelta(1, unit='M')
buys['date_min'] = (buys['date'] + pd.TimeDelta(5, unit='D')

# now replace (probably it needs to be done row by row)
buys['date_min'] = buys['date_min'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
buys['date_max'] = buys['date_max'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))

# not necessary
# dates1 = buys['date']
# ids1 = buys['id']

for chunk in el:
    # already converted to datetime
    # i['used_at'] = pd.to_datetime(i['used_at'])

    # defer sorting until later
    # df = i.sort_values(['ID', 'used_at'])

    # merge!
    # (option how='inner' selects only rows that have the same id in both data frames; it's default)
    merged = pd.merge(chunk, buys, left_on='ID', right_on='id', how='inner')
    bool_idx = (merged['used_at'] < merged['date_max']) & (merged['used_at'] > merged['date_min'])
    selected = merged.loc[bool_idx]

    # probably don't need additional columns from buys, 
    # so either drop them or select the ones from chunk (beware of possible duplicates in names)
    selected = selected[chunk.columns]

    # sort now (possibly a smaller frame)
    selected = selected.sort_values(['ID', 'used_at'])

    if selected.empty:
        continue
    with open('3.csv', 'a') as f:
        selected.to_csv(f, header=False)

Hope this helps. Please double check the code and adjust to your needs.

Please, take a look at the docs to understand the options of merge.

edited Jul 7, 2016 at 13:50

answered Jul 6, 2016 at 19:46

ptrj

5,23221 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

NineWasps Over a year ago

in this string should be not select = select[chunk.columns] and should be select = selected[chunk.columns]?

NineWasps Over a year ago

In output file I get id, that not in buys. It's connect with work of merge. I change it to merged = pd.merge(buys, chunk, left_on='id', right_on='ID') because it turn KeyError

ptrj Over a year ago

@ldevyataykina Ah, yes, select is a typo; should be selected everywhere. I also accidentally swapped id and ID. (Updated the answer.) But I don't understand why you're getting id's which are not in buys - it shouldn't happen. Maybe try adding how='inner' to merge explicitly.

NineWasps Over a year ago

Can you say, Where can I have a problem? I try to run this code with anither algorithm and quantity of ID is more, than quantity with this code

NineWasps Over a year ago

I can't understand, how it check data? If data in this interval, it add it?

|

Collectives™ on Stack Overflow

Pandas: how to make algorithm faster

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related