Crawling data faster in python

Question

I'm crawling data of 25GB of bz2 files. Right now I'm processing the zip file, open it, get the data of the sensors, get the median, then after I finish processing all the files, write them to excel file. It takes a full day to process those files, which is not bearable.

I want to make the process faster, so I want to fire as many threads, but how would I approach that problem ? A Pseudo code of the idea would be good.

The problem that I'm thinking of is I have time stamps for each day of the zip file. So for example I have day1 at 20:00, I need to process it's file then save it in a list, while other threads can process other days, but I need to sync the data to be in sequence in the written file in disk.

Basically I want to accelerate it faster.

Here is a pseudo code of the process file as shown by the answer

def proc_file(directoary_names):
    i = 0

    try:

        for idx in range(len(directoary_names)):
            print(directoary_names[idx])
            process_data(directoary_names[idx], i, directoary_names)
            i = i + 1
    except KeyboardInterrupt:
       pass

    print("writing data")
    general_pd['TimeStamp'] = timeStamps
    general_pd['S_strain_HOY'] = pd.Series(S1)
    general_pd['S_strain_HMY'] = pd.Series(S2)
    general_pd['S_strain_HUY'] = pd.Series(S3)
    general_pd['S_strain_ROX'] = pd.Series(S4)
    general_pd['S_strain_LOX'] = pd.Series(S5)
    general_pd['S_strain_LMX'] = pd.Series(S6)
    general_pd['S_strain_LUX'] = pd.Series(S7)
    general_pd['S_strain_VOY'] = pd.Series(S8)
    general_pd['S_temp_HOY'] = pd.Series(T1)
    general_pd['S_temp_HMY'] = pd.Series(T2)
    general_pd['S_temp_HUY'] = pd.Series(T3)
    general_pd['S_temp_LOX'] = pd.Series(T4)
    general_pd['S_temp_LMX'] = pd.Series(T5)
    general_pd['S_temp_LUX'] = pd.Series(T6)
    writer = pd.ExcelWriter(r'c:\ahmed\median_data_meter_12.xlsx', engine='xlsxwriter')
    # Convert the dataframe to an XlsxWriter Excel object.
    general_pd.to_excel(writer, sheet_name='Sheet1')
    # Close the Pandas Excel writer and output the Excel file.
    writer.save()

Sx to Tx are sesnor values..

Why are you duplicating code? Replace everything in the 'except' block with a 'pass'. Open the zip files directly with gzip.open() will save you a lot of time — Paul Smith
– Paul Smith, Commented Nov 7, 2018 at 15:15
In python - multithread splits a single process in separate threads, multiprocessing runs separate processes (which can be on different CPU's). rule of thumb is to multi-thread tasks that do a lot of waiting and mutliprocess tasks that would benefit from more CPU's. — Paul Smith
– Paul Smith, Commented Nov 7, 2018 at 15:19

Rocky Li · Accepted Answer · 2018-11-07 15:36:12Z

3

Use multiprocessing, it seem you have a pretty straightforward task.

from multiprocessing import Pool, Manager

manager = Manager()
l = manager.list()

def proc_file(file):
    # Process it
    l.append(median)

p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)

# somehow save l to excel.

Update: Since you want to keep the file names, perhaps as a pandas column, here's how:

from multiprocessing import Pool, Manager

manager = Manager()
d = manager.dict()

def proc_file(file):
    # Process it
    d[file] = median # assuming file given as string. if your median (or whatever you want) is a list, this works as well.

p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)

s = pd.Series(d)
# if your 'median' is a list
# s = pd.DataFrame(d).T
writer = pd.ExcelWriter(path)
s.to_excel(writer, 'sheet1')
writer.save() # to excel format.

If each of your file will produce multiple values, you can create a dictionary where each element is a list that contains those values

edited Nov 7, 2018 at 15:36

answered Nov 7, 2018 at 14:26

Rocky Li

5,9862 gold badges21 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

20 Comments

Rocky Li Over a year ago

hold on .. you can use a dict instead of list if that's the case.

Rocky Li Over a year ago

Each process should just be by itself - taking one file. Your pool of process will take care of iterating over the files.

Rocky Li Over a year ago

I'm a little confused after reading your snippet... but if I'm getting it correctly, you're creating a bunch of list, in multiprocessing case, each of those list would be a global manager.dict. with file name since you want to keep order intact. You would then be able to merge those dictionary on the key and turn the resulting dictionary to pandas dataframe.

Rocky Li Over a year ago

Now I think a better way would be puttting all of your values created under each file to a list and then put it into a dictionary like my snippet above. So instead of putting median to each entry, putting a list that contains all you need.

Rocky Li Over a year ago

you shouldn't need to declare it as global. it will auto register as global when you declare it on the file scope. Also, I can't access that website for my reason, if you have any questions, you probably also want to consult docs docs.python.org/2/library/multiprocessing.html

|

Collectives™ on Stack Overflow

Crawling data faster in python

1 Answer 1

20 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

20 Comments

Your Answer

Sign up or log in

Post as a guest

Related