0

I am looking at biological features called Modules, which are made up in turn of features called smCOGs.

I would like to convert a nested list of Module/smCOG data (input) to a nested array (output) containing row elements that correspond to processed list objects in the nested list. Each list in the input nested list contains the module ID at index 0, a list of component smCOG IDs, and a list of nans for the smCOGs that are not present. I would like to convert a given internal list to an array with values of N at index N-1, where N is the integer smcog ID of a component smcog, and 0 at all other indexes (or some other filler value like nan).

As an example of my processing, assuming there are a total of 3 modules (rather than 185000ish) made up of 5 possible smCOG objects (rather than 22000ish):

input (nested list)

[['Module1', 'SMCOG1', 'SMCOG3', 'SMCOG5', np.nan,   np.nan],
 ['Module2', 'SMCOG1', 'SMCOG2', 'SMCOG4', np.nan,   np.nan],
 ['Module3', 'SMCOG1', 'SMCOG2', 'SMCOG3', 'SMCOG5', np.nan]

output (nested numpy array, 3 rows x 5 columns)

[[1, 0, 3, 0, 5],
 [1, 2, 0, 4, 0],
 [1, 2, 3, 0, 5]

It's currently taking a long time (about two hours on my machine I believe). Can someone tell me a better way to go about this? I believe the key time sink is the way I am appending arrays to my nested array, as np.stack seems to work faster than np.vstack (described at the bootom of the comments). This in turn suggests that the preceding functions aren't a big drag.

Thanks! Tim

count = 0

#initial array to append my other arrays too
nested_array = np.zeros(22754)
time_start = time.time()

#module can be any module in a 185000ish-object list.  
#Each module is made up of SMCOG features ranging from smCOG #1 to 22754.  
for module in modules_list:
    #module = ['Module_ID', smcogA, smcogB, ..., np.nan, np.nan]
    #A and B = ints cooresponding to smCOGs included in module.
    #np.nan = one per each smCOG in 22754 possible smCOGs that are not present in the module.  
    #np.nan sequence always comes after sequence of smcogs, the sequences are not mixed together
    
    #JOB - convert module to an array of ints where the smcogs are index-based i.e. smcog A is at array index A instead of module list index 1 (if A is not 1)

    #most of the smcog features are not in a given module, so initialise an individual module to all 0's
    array = np.zeros(22754)
    
    #make array indexes corresponding to an smCOG feature non-zero.
    for obj in module:
        
        #If obj is np.nan then all smCOGs for module are found
        if isinstance(obj, float):
            break
        #if it has smCOG in it then its a module feature and object at corresponding index should be updated
        if 'SMCOG' in obj:
            smcog_number = int(obj[obj.index('SMCOG') + 5 :])
            array[smcog_number-1] = smcog_number
            
    #I suspect this is my timesink?  I need to add my module array as new row to nested array.  
    #np.append throws errors when array and nested array have differnet dimensions (i.e. when I've added a row to nested array)
    #if I do np.stack([nested_array, array], axis = 0) instead of nested_array = np.vstack([nested_array, array]), it all takes 10 seconds.  I'm guessing this is due to a combination of me not assigning the stack output to a variable (so it takes less memory) and stack potentially being more efficient?  If I do nested_array = np.stack([nested_array, array=, axis = 0) I get ValueError: all input arrays must have the same shape 
    nested_array = np.vstack([nested_array, array])
    
    count += 1
    if count % 1000 == 0:
        print (f'done {count} modules in {round(time.time() - time_start, 2)} seconds') #about 1-200 seconds per 1000 arrays

EDIT

Making a nested list and converting this to an array as suggested by hpaulj was much quicker (about 10 seconds). I also ran into memory issues due to my array size, so also set my array dtype to boolean which eats less memory (this is discussed in point #3 here).

Working code:

count = 0
nested_list = []
time_start = time.time()
  
for module in modules_list:
    array = np.zeros(22754, dtype=bool)#may help with memory issues
    
    #make array indexes corresponding to an smCOG feature non-zero.
    for obj in module:
        
        #If obj is np.nan then all smCOGs for module are found
        if isinstance(obj, float):
            break
        #if it has smCOG in it then its a module feature and object at corresponding index should be updated
        if 'SMCOG' in obj:
            smcog_number = int(obj[obj.index('SMCOG') + 5 :])
            array[smcog_number-1] = 1 #boolean True
            
    #nested_array = np.vstack([nested_array, array])
    nested_list.append(array)

    count += 1
    if count % 1000 == 0:
        print (f'done {count} modules in {round(time.time() - time_start, 2)} seconds') #about 100 seconds per 1000 arrays
        
nested_array = np.array(nested_list)
2
  • 1
    I haven't read your code in detail, but I see you are using vstack repeatedly in a loop. Try not to do this!. vstack, stack (and np.append) all use np.concatenate, and are best used with a big list of arrays. They make a new array, with all the required copying. Use list append method to collect the arrays in a list, and use the appropriate form of concatenate just once, to join them all into one array. Commented Jan 14, 2022 at 7:24
  • Excellent, thanks very much - i just made a nested list, then called np.array(nested_list_of_arrays) on that and it takes 10 seconds (see edit for anyone with similar issue). Can you you make your comment an answer @hpaulj ? Commented Jan 14, 2022 at 8:13

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.