0

I have a python program that takes a long time to run, most likely because I am using loops, and I am hoping that I can get some help using Pandas or Numpy in a section to speed it up. It seems like the first FOR loop could be optimized a little with pandas or numpy. That said, I am not all that familiar with the intricacies of pandas or numpy to achieve what this loop does. Any help is appreciated and please let me know if there are any questions, thank you!

df = data below    
df2 = pandas.DataFrame()

for i in df.index:
    if df.V[i]>1:
        for f in range(0,df.V[i]):
            df2 = df2.append(df.loc[i],ignore_index=True)
    elif df.V[i]==1:
        df2 = df2.append(df.loc[i],ignore_index=True)


df2.V = 1
df2['Grouper']=""


bv=10
y=bv
x=len(df2)


for d in range(0,x,y):
    z = d+y
    df2['Grouper'][d:z]=d


df3 = df2.groupby('Grouper').agg({'Date_Time':'first','L1':'last','H':'max','L2':'min','O':'first'})
df3 = df3.reset_index(drop=True)
df3 = df3[['Date_Time','O','H','L1','L2']]

This is a sample of the data I am using with this program(df):

                Date_Time   O      H       L1     L2          V
0     2016-10-13 17:00:00  50.39  50.39  50.39  50.39       1
1     2016-10-13 17:00:02  50.39  50.39  50.39  50.39      27
2     2016-10-13 17:00:04  50.38  50.38  50.38  50.38       1
3     2016-10-13 17:00:09  50.38  50.38  50.38  50.38       1
4     2016-10-13 17:00:10  50.38  50.38  50.38  50.38       6
5     2016-10-13 17:00:14  50.38  50.38  50.38  50.38      19
6     2016-10-13 17:00:15  50.38  50.38  50.38  50.38       3
7     2016-10-13 17:00:20  50.37  50.38  50.37  50.38       5
8     2016-10-13 17:00:21  50.38  50.38  50.38  50.38       2
9     2016-10-13 17:00:22  50.38  50.38  50.37  50.37       3
2
  • Could you include the original and intended data? The given dataframe won't work, because having two columns with the same name (which you should avoid at all cost) causes the groupby to crash). It would also be helpful if you explain what you're trying to achieve in words, because it may be possible to try a different approach. Commented Dec 2, 2016 at 17:39
  • @KenWei I applied the changes you recommended. The objective is to breakdown the data by the 'V' column, so row 2 in df would have 27 entries of the same data in df2. I am then regrouping the data in df2 to create buckets based on a sum of entries in column 'V'. Essentially trying to create constant volume bars. Commented Dec 2, 2016 at 18:03

1 Answer 1

3

The for loop is certainly very slow: for a start, indexing into a dataframe is rather costly in terms of computational time. There is also another small performance loss through chained indexing in df.V[i]; things would be slightly faster if you did df.loc[i,'V'] instead. Nevertheless, iterating through a dataframe by its index is very slow and can usually be avoided for most problems (and if you absolutely have to, df.iterrows() gives you an iterator which is slightly faster). The other source of the slowness in your code is that you are creating a copy of your dataframe every time you call the .append() method, which gets unwieldy for big data sets. For this example, we can avoid doing pretty much all of this.

I'm going to guess that you have some time series data indexed by integers; pandas can handle this sort of problem (resampling) very well when you have data indexed by time, so we will force a time format onto the data and use the resample method.

df['v'] = pd.to_datetime(df.V.cumsum()) # assign times to each datapoint; in this case, nanoseconds after the start of Unix time.
r = df.set_index('v').resample('10N') # the 'N' stands for nanoseconds
df3 = r.agg({'Date_Time':'first','L1':'last','H':'max','L2':'min','O':'first'})
df3.interpolate(method = 'zero', inplace = True)
df3.reset_index(drop = True, inplace = True)
to_fix = df3.index[df3.Date_Time.isnull()]
for i in to_fix:
    df3.loc[i,'Date_Time'] = df3.loc[i-1,'Date_Time']

Some remarks:

  1. The interpolate method is needed for when you have large V that 'crosses' aggregation periods; the argument method='zero' tells pandas to just take the row preceding it. However, it doesn't work for strings, hence you need to manually replace these; the best I can think of is a for loop that hopefully will not iterate too many times.
  2. inplace = True arguments mean the data is modified directly, as opposed to creating a modified copy of the dataframe and replacing the old one with it
  3. The .agg() method allows you to use more than one function on the thing you wish to aggregate, so if you wanted to the 'min' and 'last' of L, you can pass the argument {'L':['min','last']} or {'L':{'custom_colname1':'min','custom_colname2':'min'}}. The upshot is that this produces a nested column index, where the simplest method of flattening would be to modify the .columns attribute of the dataframe directly (but be careful about the order of names, as the dictionary passed into the .agg() method has no ordering!
Sign up to request clarification or add additional context in comments.

2 Comments

I might not be using the code correctly, can't seem to get the results to match. Results I would expect from code and sample data. Date_Time O H L1 L2 0 2016-10-13 17:00:00 50.39 50.39 50.39 50.39 1 2016-10-13 17:00:02 50.39 50.39 50.39 50.39 2 2016-10-13 17:00:02 50.39 50.39 50.38 50.38 3 2016-10-13 17:00:10 50.38 50.38 50.38 50.38 4 2016-10-13 17:00:14 50.38 50.38 50.38 50.38 5 2016-10-13 17:00:14 50.38 50.38 50.37 50.38 6 2016-10-13 17:00:20 50.37 50.38 50.37 50.37
This link better explains what I am trying to do: help.cqg.com/cqgic/14/default.htm#!Documents/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.