I have the task: I should find some data in big file and add this data to some file.
File, where I search data is 22 million string and I divide it using chunksize.
In other file I have column with 600 id of users and I find info about every users in big file.
The first I divide data to interval and next search information about every user in all of this files.
I use timer to know, how many time it spend to write to file and average time to find information in df size 1 million string and write it to file is 1.7 sec. And after count all time of program I get 6 hours. (1.5 sec * 600 id * 22 interval).
I want to do it faster, but I don't know any way besides chunksize.
I add my code
el = pd.read_csv('df2.csv', iterator=True, chunksize=1000000)
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
dates1 = buys['date']
ids1 = buys['id']
for i in el:
i['used_at'] = pd.to_datetime(i['used_at'])
df = i.sort_values(['ID', 'used_at'])
dates = df['used_at']
ids = df['ID']
urls = df['url']
for i, (id, date, url, id1, date1) in enumerate(zip(ids, dates, urls, ids1, dates1)):
start = time.time()
df1 = df[(df['ID'] == ids1[i]) & (df['used_at'] < (dates1[i] + dateutil.relativedelta.relativedelta(days=5)).replace(hour=0, minute=0, second=0)) & (df['used_at'] > (dates1[i] - dateutil.relativedelta.relativedelta(months=1)).replace(day=1, hour=0, minute=0, second=0))]
df1 = DataFrame(df1)
if df1.empty:
continue
else:
with open('3.csv', 'a') as f:
df1.to_csv(f, header=False)
end = time.time()
print(end - start)