0

I have written a python code which convert raw data (STM Microscope) into png format and it run perfectly on my Macbook Pro.

Below is the simplified Python Code:

for root, dirs, file in os.walk(path):
    for dir in dirs:
        fpath = path +'/'+ dir
        os.chdir(fpath)
        spaths=savepath +'/'+ dir
        if os.path.exists(spaths) ==False:
           os.mkdir(spaths)

         for files in glob.glob("*.sm4"):
             for file in files:     
                 data_conv (files, file, spaths)

But it does take 30 - 40 mins for100 files.

Now, I wanted to reduce processing time using multithreading technique (using “concurrent future” library). Was trying to modify python code using YouTube video on “Python Threading Tutorial” as an example.

But I have to pass too many arguments such as “root”, “dirs.”, “file” in the executor.map() method. I don’t know how to resolve this further.

Below this the simplified multithreading Python code

def raw_data (root, dirs, file):
    for dir in dirs:
        fpath = path +'/'+ dir
        os.chdir(fpath)
        spaths=savepath +'/'+ dir
        if os.path.exists(spaths)==False:
            os.mkdir(spaths)

        for files in glob.glob("*.sm4"):
            for file in files:
                data_conv(files, file, spaths)

with concurrent.futures.ThreadPoolExecutor() as executor:
     executor.map(raw_data, root, dirs, file)

NameError: name 'root' is not defined

Any suggestion is appreciated, Thank You.

4
  • 1
    If the workload is CPU bound you should use concurrent.futures.ProcessPoolExecutor instead since Python threads will not run concurrently due to the GIL. Do you need to wrap your call to executor.map with for root, dirs, file in os.walk(path):? Commented Aug 25, 2021 at 14:18
  • Sorry I am not an expert here, I don't know what is GIL. But, I need to reduce the processing time using by Multithreading or multiprocessing. ............. {Do you need to wrap your call to executor.map with for root, dirs, file in os.walk(path):?} YES Commented Aug 25, 2021 at 14:28
  • 1
    Unless you are IO bound (lots of network/API calls, writing/reading files) multiprocessing is your best bet. The GIL prevents threads from running concurrently (at the same time) Commented Aug 25, 2021 at 14:29
  • Any example or suggestion would be helpful to understand to implement the code. Commented Aug 25, 2021 at 14:33

2 Answers 2

1

Thanks for the advice Iain Shelvington & Thenoneman.

Pathlib does reduces the clutter I was having in my code.

"ProcessPoolExecutor" worked in my CPU intense function.

  with concurrent.futures.ProcessPoolExecutor() as executor:
        executor.map(raw_data, os.walk(path))
Sign up to request clarification or add additional context in comments.

Comments

0

First of all, as Iain Shelvington pointed out, data_conv seems like a CPU intensive function, therefore you won't notice improvement with ThreadPoolExecutor, use ProcessPoolExecutor. Second, you have to pass parameters to each instance of function call, i.e. pass lists of arguments to raw_data. Assuming root and file are the same and dirs is a list:

with concurrent.futures.ProcessPoolExecutor() as executor:
    results = executor.map(raw_data, [root]*len(dirs), dirs, [file]*len(dirs)
    for result in results:
        # Collect you results

As a sidenote, you may find working with filesystem more pleasing with pathlib, which is also built-in since Python 3.4

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.