Optimizing Python Code - Pandas/Numpy Usage

Question

I have a python program that takes a long time to run, most likely because I am using loops, and I am hoping that I can get some help using Pandas or Numpy in a section to speed it up. It seems like the first FOR loop could be optimized a little with pandas or numpy. That said, I am not all that familiar with the intricacies of pandas or numpy to achieve what this loop does. Any help is appreciated and please let me know if there are any questions, thank you!

df = data below    
df2 = pandas.DataFrame()

for i in df.index:
    if df.V[i]>1:
        for f in range(0,df.V[i]):
            df2 = df2.append(df.loc[i],ignore_index=True)
    elif df.V[i]==1:
        df2 = df2.append(df.loc[i],ignore_index=True)


df2.V = 1
df2['Grouper']=""


bv=10
y=bv
x=len(df2)


for d in range(0,x,y):
    z = d+y
    df2['Grouper'][d:z]=d


df3 = df2.groupby('Grouper').agg({'Date_Time':'first','L1':'last','H':'max','L2':'min','O':'first'})
df3 = df3.reset_index(drop=True)
df3 = df3[['Date_Time','O','H','L1','L2']]

This is a sample of the data I am using with this program(df):

                Date_Time   O      H       L1     L2          V
0     2016-10-13 17:00:00  50.39  50.39  50.39  50.39       1
1     2016-10-13 17:00:02  50.39  50.39  50.39  50.39      27
2     2016-10-13 17:00:04  50.38  50.38  50.38  50.38       1
3     2016-10-13 17:00:09  50.38  50.38  50.38  50.38       1
4     2016-10-13 17:00:10  50.38  50.38  50.38  50.38       6
5     2016-10-13 17:00:14  50.38  50.38  50.38  50.38      19
6     2016-10-13 17:00:15  50.38  50.38  50.38  50.38       3
7     2016-10-13 17:00:20  50.37  50.38  50.37  50.38       5
8     2016-10-13 17:00:21  50.38  50.38  50.38  50.38       2
9     2016-10-13 17:00:22  50.38  50.38  50.37  50.37       3

Could you include the original and intended data? The given dataframe won't work, because having two columns with the same name (which you should avoid at all cost) causes the groupby to crash). It would also be helpful if you explain what you're trying to achieve in words, because it may be possible to try a different approach. — Ken Wei
– Ken Wei, Commented Dec 2, 2016 at 17:39
@KenWei I applied the changes you recommended. The objective is to breakdown the data by the 'V' column, so row 2 in df would have 27 entries of the same data in df2. I am then regrouping the data in df2 to create buckets based on a sum of entries in column 'V'. Essentially trying to create constant volume bars. — BROB1
– BROB1, Commented Dec 2, 2016 at 18:03

Ken Wei · Accepted Answer · 2016-12-02 18:53:34Z

3

The for loop is certainly very slow: for a start, indexing into a dataframe is rather costly in terms of computational time. There is also another small performance loss through chained indexing in df.V[i]; things would be slightly faster if you did df.loc[i,'V'] instead. Nevertheless, iterating through a dataframe by its index is very slow and can usually be avoided for most problems (and if you absolutely have to, df.iterrows() gives you an iterator which is slightly faster). The other source of the slowness in your code is that you are creating a copy of your dataframe every time you call the .append() method, which gets unwieldy for big data sets. For this example, we can avoid doing pretty much all of this.

I'm going to guess that you have some time series data indexed by integers; pandas can handle this sort of problem (resampling) very well when you have data indexed by time, so we will force a time format onto the data and use the resample method.

df['v'] = pd.to_datetime(df.V.cumsum()) # assign times to each datapoint; in this case, nanoseconds after the start of Unix time.
r = df.set_index('v').resample('10N') # the 'N' stands for nanoseconds
df3 = r.agg({'Date_Time':'first','L1':'last','H':'max','L2':'min','O':'first'})
df3.interpolate(method = 'zero', inplace = True)
df3.reset_index(drop = True, inplace = True)
to_fix = df3.index[df3.Date_Time.isnull()]
for i in to_fix:
    df3.loc[i,'Date_Time'] = df3.loc[i-1,'Date_Time']

Some remarks:

The interpolate method is needed for when you have large V that 'crosses' aggregation periods; the argument method='zero' tells pandas to just take the row preceding it. However, it doesn't work for strings, hence you need to manually replace these; the best I can think of is a for loop that hopefully will not iterate too many times.
inplace = True arguments mean the data is modified directly, as opposed to creating a modified copy of the dataframe and replacing the old one with it
The .agg() method allows you to use more than one function on the thing you wish to aggregate, so if you wanted to the 'min' and 'last' of L, you can pass the argument {'L':['min','last']} or {'L':{'custom_colname1':'min','custom_colname2':'min'}}. The upshot is that this produces a nested column index, where the simplest method of flattening would be to modify the .columns attribute of the dataframe directly (but be careful about the order of names, as the dictionary passed into the .agg() method has no ordering!

edited Dec 2, 2016 at 18:53

answered Dec 2, 2016 at 18:39

Ken Wei

3,1381 gold badge12 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

BROB1 Over a year ago

I might not be using the code correctly, can't seem to get the results to match. Results I would expect from code and sample data. Date_Time O H L1 L2 0 2016-10-13 17:00:00 50.39 50.39 50.39 50.39 1 2016-10-13 17:00:02 50.39 50.39 50.39 50.39 2 2016-10-13 17:00:02 50.39 50.39 50.38 50.38 3 2016-10-13 17:00:10 50.38 50.38 50.38 50.38 4 2016-10-13 17:00:14 50.38 50.38 50.38 50.38 5 2016-10-13 17:00:14 50.38 50.38 50.37 50.38 6 2016-10-13 17:00:20 50.37 50.38 50.37 50.37

BROB1 Over a year ago

This link better explains what I am trying to do: help.cqg.com/cqgic/14/default.htm#!Documents/…

Collectives™ on Stack Overflow

Optimizing Python Code - Pandas/Numpy Usage

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related