Efficient loop through pandas dataframe

Question

I have the following problem I need help with. I have 310 records in a csv file that contains some information about bugs. In another csv file I have 800 thousand records containing statistics about the bags (events that possibly led to the bugs).

With the script below, I am trying to

Loop through the bugs and select one.
loop through the statistics records and check some conditions
If there is a match, add a column from the bugs records to the statistics records.
Save the new file

My question is if I could archieve this in a more efficient way using numpy or anything else. The current method is taking forever to run because of the size of the statistics

Any help or tips in the right direction will be appreciated. thanx in adavance

dataset = pd.read_csv('310_records.csv')
dataset1 = pd.read_csv('800K_records.csv')
cols_error = dataset.iloc[:, [0, 1, 2, 3, 4, 5, 6]]
cols_stats = dataset1.iloc[:, [1, 2, 3, 4, 5, 6, 7, 8, 9]]
cols_stats['Fault'] = ''
cols_stats['Created'] = ''

for i, error in cols_error.iterrows():    
    fault_created = error [0]
    fault_ucs = error [1]
    fault_dn = error [2]
    fault_epoch_end = error [3]
    fault_epoch_begin = error [4]
    fault_code = error [6]    

    for index, stats in cols_stats.iterrows():
        stats_epoch = stats[0]
        stats_ucs = stats[5]        
        stats_dn = stats[7]
        print("error:", i, " Stats:", index)        

        if(stats_epoch >= fault_epoch_begin and stats_epoch <= fault_epoch_end):
            if(stats_dn == fault_dn):
                if(stats_ucs == fault_ucs):
                    cols_stats.iloc[index, 9] = fault_code
                    cols_stats.iloc[index, 10] = fault_created

        else:
            cols_stats.iloc[index, 9] = 0
            cols_stats.iloc[index, 10] = fault_created

cols_stats.to_csv('datasets/dim_stats_error.csv', sep=',', encoding='utf-8')

Would probably be a lot faster without that print statement. — Stefan Pochmann
– Stefan Pochmann, Commented Apr 26, 2017 at 11:33
Could you do a search in the 800k_records and isolate all bugs in a 3rd files (keep a trace of the locations). And do the matching in this 3rd files? Also, doing multiple search in parallel would help — pwnsauce
– pwnsauce, Commented Apr 26, 2017 at 11:41
@pwnsauce I'm not sure I understood what you mean by "isolate all bugs in a 3rd files". Won't that mean creating file for each bug in my bugs table? — Makten
– Makten, Commented Apr 26, 2017 at 11:55
You should try to use some fancy pandas indexing methods, instead of doing the cross product — BlackBear
– BlackBear, Commented Apr 26, 2017 at 12:11

Andras Deak -- Слава Україні · Accepted Answer · 2017-04-27 00:16:07Z

First of all: are you sure that your code does what you want it to do? As I see it, you keep looping over your statistics, so if you found a matching bug with bug #1, you can later overwrite the corresponding appendix to the statistics data with bug #310. It is unclear what you should be doing with statistics events that don't have a matching bug event, but currently you're storing fault_created columns for these data points somewhat arbitrarily. Not to mention the extra work done for checking every event for every bug every time.

The reason for the slowness is that you're note making use of the power of pandas at all. Both in numpy and in pandas part of the performance comes from memory management, and the rest from vectorization. By pushing most of your work from native python loops to vectorized functions (running compiled code), you start seeing huge speed improvements.

I'm unsure whether there's an advanced way to vectorize all of your work, but since you're looking at 310 vs 800k items, it seems perfectly reasonable to keep the loop over your bugs and vectorize the inner loop. The key is logical indexing, using which you can address all 800k items at once:

for i, error in cols_error.iterrows():
    created, ucs, dn, epoch_end, epoch_begin, _, code = error

    inds = ( (epoch_begin <= cols_stats['epoch']) &
             (cols_stats['epoch'] <= epoch_end) &
             (cols_stats['dn'] == dn) &
             (cols_stats['ucs'] == ucs) )
    cols_stats['Fault'][inds] = code
    cols_stats['Created'][inds] = created

cols_stats.to_csv('datasets/dim_stats_error.csv', sep=',', encoding='utf-8')

Note that the above will not set the unmatched columns to something non-trivial, because I don't think you have a reasonable example in your question. Whatever defaults you want to set should be independent from the list of bugs, so you should set these values before the whole matching ordeal.

Note that I made some facelifts for your code. You can use an unpacking assignment to set all those values from error, and removing the prefix of those variables makes this clearer. We can dispose of the prefix since we don't define separate variables for the statistics dataframe.

As you can see, your conditions for finding all the matching statistics items for a given bug can be defined in terms of a vectorized logical indexing operation. The resulting pandas Series called inds has a bool for each row of your statistics dataframe. This can be used to assign to a subset of your columns named 'Fault' and 'Created'. Note that you can (and probably should) index your columns by name, at least I find this much more clear and convenient.

Since for each bug your code and created are scalars (probably strings), the vectorized assignments cols_stats['Fault'][inds] = code and cols_stats['Created'][inds] = created set every indexed item of cols_stats to these scalars.

I believe the result should be the same as before, but much faster, at the cost of increased memory use.

Further simplifications could be made in your initialization, but without an MCVE it's hard to say specifics. At the very least you can use slice notation:

cols_error = dataset.iloc[:, :7]
cols_stats = dataset1.iloc[:, 1:10]

But odds are you're only ignoring a few columns, in which case it's probably clearer to drop those instead. For instance, if in dataset you have a single seventh column called 'junk' that you're ignoring, you can just set

cols_error = dataset.drop('junk', axis=1)

Thank you so much for the detailed explaination. Makes a lot of sense. My reputation is not high enough to accept this answer but it does what I wanted. thumbs up ;)
@Makten I'm glad it works. But you don't need reputation to accept answers: that's the one thing that any adker can do:) You just have to click the tick mark to the left of the answer.

Collectives™ on Stack Overflow

Efficient loop through pandas dataframe

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related