2

I am trying to build a big dataframe using a function that takes some arguments and return partial dataframes. I have a long list of documents to parse and extract relevant information that will go the big dataframe, and I am trying to do it using multiprocessing.Pool to make the process go faster.

My code looks like this :

    from multiprocessing import *
    from settings import *
    
    server_url = settings.SERVER_URL_UAT
    username =   settings.USERNAME
    password =   settings.PASSWORD
    
    def wrapped_visit_detail(args): 
        
        global server_url
        global username
        global password
        
        # visit_detail return a dataframe after consuming args
    
        return visit_detail(server_url, args, username, password) 
    
    # Trying to pass a list of arguments to wapped_visit_detail
    
    visits_url = [doc1, doc2, doc3, doc4]
    
    df = pd.DataFrame()
    pool = Pool(cpu_count())
    df = pd.concat( [ df,
                      pool.map( wrapped_visit_detail,
                                visits_url
                                )
                      ],
                    ignore_index = True
                    )

When I run this, I got this error

multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f2c88a43208>'. Reason: 'TypeError("can't pickle _thread._local objects",)'

EDIT

To illustrate my problem I created this simple figure

enter image description here

This is painfully slow and not scalable at all

And I am looking to make the code not serial but rather as parallelized as possible

enter image description here

Thank you all for your great comments so far, yes, I a using shared variable as parameters to this function that pulls the files and extract the individuals dataframes, it seems ot be my issue indeed

I am suspecting something wrong in the way I call pool.map()

Any tip would be really welcome

4
  • Which OS are you using? I was never able to use pool.map() on Windows. Commented Mar 3, 2022 at 13:08
  • 1
    Hi Ssayan, I am using Ubuntu Commented Mar 3, 2022 at 13:14
  • 1
    I can't fully reproduce as I don't have what you import but pool.map() returns a list so I would suggest you do this df = pd.concat(pool.map(wrapped_visit_detail, visits_url), ignore_index=True). When I use what you did I get an error but not the same so I don't think that would solve fully your issue. According to this thread your error may come from the use of global variables. Commented Mar 3, 2022 at 13:29
  • I also use multithreading in the context of processing pandas.DataFrame and never got an error like this. Your error (about pickeling) indicates that the data transfered between to processes is not pickable. See docs.python.org/library/pickle.html to find out which data types are pickable). Maybe there is more then just a DataFrame? Or the cells in the DF containes unpickable types. Commented Mar 3, 2022 at 13:57

1 Answer 1

3

You may have realised on your own, that there is actually no "sharing" possible, among Python-interpreter and its sub-processes. No sharing, only absolutely independent and "disconnected" replicas of the __main__'s original, (in Windows a complete, top-down) state-full copy of the Python-interpreter, with all its internal variables, modules and whatsoever. Any change of there copies is not propagated back into the __main__ or elsewhere. Once more, if you try to compose "Big dataframe" from ~ +100k individually pre-produced "partial dataframes", you will get an awfully if not unacceptably low performance.

Losing advantages from partial-producers' latency-masking plus headbanging into the RAM-allocation costs and potentially even a need to turn memory-I/O many times worse, falling into a trap of 10,000x slower physical/virtual-memory swapping, as in-RAM capacities ceased to be able to hold all data - as might happen upon ex-post attempt to
pd.concat( _a_HUGE_in_RAM_store_first_LIST_of_ALL_100k_plus_partial_DFs_, ... )
which is (unless you test Stack Overflow sponsors of Knowledge sense of humour and patience of others)
a no go ANTI-pattern.


Q :
" Any tip ... "

A :
No matter how simple this one is, here, due to Python-interpreted, process-to-process communication constraints

( unhandled EXC reporter says) :

multiprocessing.
           pool.MaybeEncodingError:

Error sending result:
'<multiprocessing.pool.ExceptionWithTraceback object at 0x7f2c88a43208>'.

Reason:
'TypeError("can't pickle _thread._local objects",)'

So,
here we are. Python-interpreter has to use "pickle"-like SER/DES whenever it tries to send/receive as single bit of data from one process ( typ. the __main__ upon sending launch parameters, or worker-processes upon sending their remote results' objects back to the __main__ ) to another.

That's fine, whenever the SER/DES-serialisation is possible ( here not being the case )

Options :

  • Best avoid any and all object-passing ( objects are prone to SER/DES-failures )
  • If obsessed with passing them, try some more capable SER/DES-encoder, replacing a default pickle with import dill as pickle has saved me many times ( yet not in every case, see above )
  • Design better problem-solving strategy, that does not rely on ill-assumed "shared"-use of Pandas dataframe amongst more (fully independent) sub-processes, that cannot and do not access "the same" dataframe, but theirs locally-isolated replica thereof (re-read more about sub-process independence and memory-space separation)

Nota bene:
School-book SLOC-s are nice in school-book sized examples, yet are awfully expensive performance ANTI-patterns in real-world use cases, the more in production-grade code-execution.

Tip:

If performance is The Target,
produce independent file-based results and join them afterwards. "Big dataframe" composition from "partial dataframes" represent a mix of sins, that cause you many performance ANTI-pattern problems, the failure to SER/DES-pickle being just a one, a small one (visible to naked eye)

Decisions ( & the costs associated with making them ) are in all cases on you.

You might like some further reads on how to boost performance.

Sign up to request clarification or add additional context in comments.

3 Comments

I'm the author of dill. With regard to your second "option": I'd suggest trying multiprocess, a fork of multiprocessing that uses dill, and can utilize dill.settings for serialization variants.
Sure,I know @MikeMcKerns,we've already discussed dill-related issues here already in past & I keep recommending dill for using as a smart SER/DES replacement IIRC since ~2008 (the time is so fast...). Here, the O/P problem is to re-factor the overall strategy of processing, if performance boost is to get achieved (doable). A move from multiprocessing to multiprocess or joblib does not help for achieving a reasonable End-2-End processing performance (HPC-grade processing speeds being far, indeed far ahead. The O/P struggles instead with a 1st visible show-stopper, RAM-swaps are next )
Yeah, I just mentioned it more due to the fact the OP is attempting to use multiprocessing and you mentioned object-sharing across processes. Were it me, and I had an array (or pandas.Series) to share, I might instead use a shared memory array from multiprocess -- which can be quite fast. You have to use Process and Lock objects and the like, but it's worth it. I'm sure it's been done with a pandas.DataFrame, but I've never tried it (though I've used a 2D numpy array). There's also the alternative of using a map (or similar) built on top of mpi4py. Ship near-zero data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.