1

I'm crawling data of 25GB of bz2 files. Right now I'm processing the zip file, open it, get the data of the sensors, get the median, then after I finish processing all the files, write them to excel file. It takes a full day to process those files, which is not bearable.

I want to make the process faster, so I want to fire as many threads, but how would I approach that problem ? A Pseudo code of the idea would be good.

The problem that I'm thinking of is I have time stamps for each day of the zip file. So for example I have day1 at 20:00, I need to process it's file then save it in a list, while other threads can process other days, but I need to sync the data to be in sequence in the written file in disk.

Basically I want to accelerate it faster.

Here is a pseudo code of the process file as shown by the answer

def proc_file(directoary_names):
    i = 0

    try:

        for idx in range(len(directoary_names)):
            print(directoary_names[idx])
            process_data(directoary_names[idx], i, directoary_names)
            i = i + 1
    except KeyboardInterrupt:
       pass

    print("writing data")
    general_pd['TimeStamp'] = timeStamps
    general_pd['S_strain_HOY'] = pd.Series(S1)
    general_pd['S_strain_HMY'] = pd.Series(S2)
    general_pd['S_strain_HUY'] = pd.Series(S3)
    general_pd['S_strain_ROX'] = pd.Series(S4)
    general_pd['S_strain_LOX'] = pd.Series(S5)
    general_pd['S_strain_LMX'] = pd.Series(S6)
    general_pd['S_strain_LUX'] = pd.Series(S7)
    general_pd['S_strain_VOY'] = pd.Series(S8)
    general_pd['S_temp_HOY'] = pd.Series(T1)
    general_pd['S_temp_HMY'] = pd.Series(T2)
    general_pd['S_temp_HUY'] = pd.Series(T3)
    general_pd['S_temp_LOX'] = pd.Series(T4)
    general_pd['S_temp_LMX'] = pd.Series(T5)
    general_pd['S_temp_LUX'] = pd.Series(T6)
    writer = pd.ExcelWriter(r'c:\ahmed\median_data_meter_12.xlsx', engine='xlsxwriter')
    # Convert the dataframe to an XlsxWriter Excel object.
    general_pd.to_excel(writer, sheet_name='Sheet1')
    # Close the Pandas Excel writer and output the Excel file.
    writer.save()

Sx to Tx are sesnor values..

4
  • Why are you duplicating code? Replace everything in the 'except' block with a 'pass'. Open the zip files directly with gzip.open() will save you a lot of time Commented Nov 7, 2018 at 15:15
  • sorry I'm beginner in python Commented Nov 7, 2018 at 15:16
  • We all started somewhere... :) Commented Nov 7, 2018 at 15:17
  • In python - multithread splits a single process in separate threads, multiprocessing runs separate processes (which can be on different CPU's). rule of thumb is to multi-thread tasks that do a lot of waiting and mutliprocess tasks that would benefit from more CPU's. Commented Nov 7, 2018 at 15:19

1 Answer 1

3

Use multiprocessing, it seem you have a pretty straightforward task.

from multiprocessing import Pool, Manager

manager = Manager()
l = manager.list()

def proc_file(file):
    # Process it
    l.append(median)

p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)

# somehow save l to excel. 

Update: Since you want to keep the file names, perhaps as a pandas column, here's how:

from multiprocessing import Pool, Manager

manager = Manager()
d = manager.dict()

def proc_file(file):
    # Process it
    d[file] = median # assuming file given as string. if your median (or whatever you want) is a list, this works as well.

p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)

s = pd.Series(d)
# if your 'median' is a list
# s = pd.DataFrame(d).T
writer = pd.ExcelWriter(path)
s.to_excel(writer, 'sheet1')
writer.save() # to excel format.

If each of your file will produce multiple values, you can create a dictionary where each element is a list that contains those values

Sign up to request clarification or add additional context in comments.

20 Comments

hold on .. you can use a dict instead of list if that's the case.
Each process should just be by itself - taking one file. Your pool of process will take care of iterating over the files.
I'm a little confused after reading your snippet... but if I'm getting it correctly, you're creating a bunch of list, in multiprocessing case, each of those list would be a global manager.dict. with file name since you want to keep order intact. You would then be able to merge those dictionary on the key and turn the resulting dictionary to pandas dataframe.
Now I think a better way would be puttting all of your values created under each file to a list and then put it into a dictionary like my snippet above. So instead of putting median to each entry, putting a list that contains all you need.
you shouldn't need to declare it as global. it will auto register as global when you declare it on the file scope. Also, I can't access that website for my reason, if you have any questions, you probably also want to consult docs docs.python.org/2/library/multiprocessing.html
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.