I'm crawling data of 25GB of bz2 files. Right now I'm processing the zip file, open it, get the data of the sensors, get the median, then after I finish processing all the files, write them to excel file. It takes a full day to process those files, which is not bearable.
I want to make the process faster, so I want to fire as many threads, but how would I approach that problem ? A Pseudo code of the idea would be good.
The problem that I'm thinking of is I have time stamps for each day of the zip file. So for example I have day1 at 20:00, I need to process it's file then save it in a list, while other threads can process other days, but I need to sync the data to be in sequence in the written file in disk.
Basically I want to accelerate it faster.
Here is a pseudo code of the process file as shown by the answer
def proc_file(directoary_names):
i = 0
try:
for idx in range(len(directoary_names)):
print(directoary_names[idx])
process_data(directoary_names[idx], i, directoary_names)
i = i + 1
except KeyboardInterrupt:
pass
print("writing data")
general_pd['TimeStamp'] = timeStamps
general_pd['S_strain_HOY'] = pd.Series(S1)
general_pd['S_strain_HMY'] = pd.Series(S2)
general_pd['S_strain_HUY'] = pd.Series(S3)
general_pd['S_strain_ROX'] = pd.Series(S4)
general_pd['S_strain_LOX'] = pd.Series(S5)
general_pd['S_strain_LMX'] = pd.Series(S6)
general_pd['S_strain_LUX'] = pd.Series(S7)
general_pd['S_strain_VOY'] = pd.Series(S8)
general_pd['S_temp_HOY'] = pd.Series(T1)
general_pd['S_temp_HMY'] = pd.Series(T2)
general_pd['S_temp_HUY'] = pd.Series(T3)
general_pd['S_temp_LOX'] = pd.Series(T4)
general_pd['S_temp_LMX'] = pd.Series(T5)
general_pd['S_temp_LUX'] = pd.Series(T6)
writer = pd.ExcelWriter(r'c:\ahmed\median_data_meter_12.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
general_pd.to_excel(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Sx to Tx are sesnor values..