2

I'm using the Pandas Python library to compare two dataframes, each consisting of a column of dates and two columns of values. One of the dataframes, call it LongDF, consists of more dates than the other, call it ShortDF. Both dataframes are indexed by the date using pandas.tseries.index.DatetimeIndex See below (I've shortened both up just to demonstrate).

LongDF

╔════════════╦════════╦════════╗
║ Date       ║ Value1 ║ Value2 ║
╠════════════╬════════╬════════╣
║ 1990-03-17 ║ 6.84   ║ 1.77   ║
║ 1990-03-18 ║ 0.99   ║ 7.00   ║
║ 1990-03-19 ║ 4.90   ║ 8.48   ║
║ 1990-03-20 ║ 2.57   ║ 2.41   ║
║ 1990-03-21 ║ 4.10   ║ 8.33   ║
║ 1990-03-22 ║ 8.86   ║ 1.31   ║
║ 1990-03-23 ║ 6.01   ║ 6.22   ║
║ 1990-03-24 ║ 0.74   ║ 1.69   ║
║ 1990-03-25 ║ 5.56   ║ 7.30   ║
║ 1990-03-26 ║ 8.05   ║ 1.67   ║
║ 1990-03-27 ║ 8.87   ║ 8.22   ║
║ 1990-03-28 ║ 9.00   ║ 6.83   ║
║ 1990-03-29 ║ 1.34   ║ 6.00   ║
║ 1990-03-30 ║ 1.69   ║ 0.40   ║
║ 1990-03-31 ║ 8.71   ║ 3.26   ║
║ 1990-04-01 ║ 4.05   ║ 4.53   ║
║ 1990-04-02 ║ 9.75   ║ 4.79   ║
║ 1990-04-03 ║ 7.74   ║ 0.44   ║
╚════════════╩════════╩════════╝

ShrotDF

╔════════════╦════════╦════════╗
║ Date       ║ Value1 ║ Value2 ║
╠════════════╬════════╬════════╣
║ 1990-03-25 ║ 1.98   ║ 3.92   ║
║ 1990-03-26 ║ 3.37   ║ 3.40   ║
║ 1990-03-27 ║ 2.93   ║ 7.93   ║
║ 1990-03-28 ║ 2.35   ║ 5.34   ║
║ 1990-03-29 ║ 1.41   ║ 7.62   ║
║ 1990-03-30 ║ 9.85   ║ 3.17   ║
║ 1990-03-31 ║ 9.95   ║ 0.35   ║
║ 1990-04-01 ║ 4.42   ║ 7.11   ║
║ 1990-04-02 ║ 1.33   ║ 6.47   ║
║ 1990-04-03 ║ 6.63   ║ 1.78   ║
╚════════════╩════════╩════════╝

What I'd like to do is reference the data occurring on the same day in each dataset, put data from both sets into one formula and, if it's greater than some number, paste the date and values into another dataframe.

I assume I should use something like for row in ShortDF.iterrows(): to iterate through each date on ShortDF but I can't figure out how to select the corresponding row on LongDF, using the DatetimeIndex.

Any help would be appreciated

8
  • Are you comparing each row in each df only within the same df or are you comparing same date in both df? If so are you looking at just dates that exist in both? Commented Apr 24, 2014 at 22:10
  • @EdChum Thanks for the response, I should probably make that a little clearer above. I'm comparing between dfs. In this scenario, I happen to know that all dates in ShortDF exist in LongDF but, to the general point, I am only interested in looking at dates that exist in both sets. Commented Apr 24, 2014 at 22:14
  • In that case merge them and then depending on the complexity of your function either use a lambda or define your function and just apply it row-wise so merged = df.merge(df1, on='Date') then merged.apply(myfunc, axis=1) or merged.apply(lambda row: myfunc(row), axis=1) I'd need to see your function first though before deciding the best approach, also it's getting late here in blighty so I may not answer Commented Apr 24, 2014 at 22:17
  • In fact what I would do is merge and then perform boolean masking on the merged df: merged[merged[['Value1','Value2']].max(axis=1) > my_val] this will return the highest values for each row that are higher than your threshold value. When performing the merge you may get duplicated columns where Value1 from both dfs don't match, by default they will have suffix _x and _y, you can rename or not care seeing as you just want the highest value Commented Apr 24, 2014 at 22:24
  • @EdChum Thanks for the response. I gave it a try and got a huge string of errors. Let me see if I understand it correctly: I want to merge the two dataframes prior to using any function, correct? I would do this using merged=ShortDF.merge(LongDF, on='Date'). Am I understanding that properly? Commented Apr 24, 2014 at 22:32

2 Answers 2

1

OK I'm awake now and using your data you can do this:

In [425]:
# key here is to tell the merge to use both sides indices
merged = df1.merge(df2,left_index=True, right_index=True)
# the resultant merged dataframe will have duplicate columns, this is fine
merged
Out[425]:
            Value1_x  Value2_x  Value1_y  Value2_y
Date                                              
1990-03-25      5.56      7.30      1.98      3.92
1990-03-26      8.05      1.67      3.37      3.40
1990-03-27      8.87      8.22      2.93      7.93
1990-03-28      9.00      6.83      2.35      5.34
1990-03-29      1.34      6.00      1.41      7.62
1990-03-30      1.69      0.40      9.85      3.17
1990-03-31      8.71      3.26      9.95      0.35
1990-04-01      4.05      4.53      4.42      7.11
1990-04-02      9.75      4.79      1.33      6.47
1990-04-03      7.74      0.44      6.63      1.78

[10 rows x 4 columns]
In [432]:
# now using boolean indexing we want just the rows where there are values larger than 9 and then select the highest value
merged[merged.max(axis=1) > 9].max(axis=1)
Out[432]:
Date
1990-03-30    9.85
1990-03-31    9.95
1990-04-02    9.75
dtype: float64
Sign up to request clarification or add additional context in comments.

Comments

0

OK, so sometimes I like to think of pandas DataFrames as nothing more than dictionaries. This is because working with dictionaries is so easy and thinking of them like simple dicts often means you can find a solution to an issue without having to get too deep into pandas.

So in your example, say, I would just create a list of common dates if the values of the DataFrames pass some value test, and then create a new data frame using those dates to access the values in the existing data frames. In my example the test is whether value 1 in DF1 + value2 in DF2 is greater than 10:

import pandas as pd
import random 
random.seed(123)

#Create some data
DF1 = pd.DataFrame({'Date'      :   ['1990-03-17', '1990-03-18', '1990-03-19', 
                                     '1990-03-20', '1990-03-21', '1990-03-22', 
                                     '1990-03-23', '1990-03-24', '1990-03-25', 
                                     '1990-03-26', '1990-03-27', '1990-03-28',
                                     '1990-03-29', '1990-03-30', '1990-03-31', 
                                     '1990-04-01', '1990-04-02', '1990-04-03'],
                    'Value1'    :   [round(random.uniform(1, 10), 2) 
                                     for x in xrange(18)],
                    'Value2'    :   [round(random.uniform(1, 10), 2) 
                                     for x in xrange(18)]
                   })

DF2 = pd.DataFrame({'Date'      :   ['1990-03-25', '1990-03-26', '1990-03-27', 
                                     '1990-03-28', '1990-03-29', '1990-03-30', 
                                     '1990-03-31', '1990-04-01', '1990-04-02',  
                                     '1990-04-03'],
                    'Value1'    :   [round(random.uniform(1, 10), 2) 
                                     for x in xrange(10)],
                    'Value2'    :   [round(random.uniform(1, 10), 2) 
                                     for x in xrange(10)]
                   })

DF1.set_index('Date', inplace = True)
DF2.set_index('Date', inplace = True)

#Create a list of common dates, where the values of DF1.Value1  summed 
#with DF.Value2 is greater than 10
Common_Set = list(DF1.index.intersection(DF2.index))
Common_Dates =  [date for date in Common_Set if 
             DF1.Value1[date] + DF2.Value1[date] > 10]

#And now create the data frame I think you want using the Common_Dates

DF_Output = pd.DataFrame({'L_Value1' : [DF1.Value1[date] for date in Common_Dates],
                          'L_Value2' : [DF1.Value2[date] for date in Common_Dates],
                          'S_Value1' : [DF2.Value1[date] for date in Common_Dates],
                          'S_Value2' : [DF2.Value2[date] for date in Common_Dates]
                         }, index = Common_Dates)

This is definitely do-able in pandas as the comment suggest, but to me this is a simple solution. The Common_Dates operations could easily be done in a one line, but I didn't for clarity.

Of course, it might be a massive pain to write out the DF_Output DataFrame constructor if you have lots of columns in both data frames. If that is the case then you could do this:

DF1_Out = {'L' + col : [DF1[col][date] for date in Common_Dates] 
            for col in DF1.columns}
DF2_Out = {'S' + col : [DF2[col][date] for date in Common_Dates] 
            for col in DF2.columns}

DF_Out = {}
DF_Out.update(DF1_Out)
DF_Out.update(DF2_Out)

DF_Output2 = pd.DataFrame(DF_Out, index = Common_Dates)

Both methods give me this:

            LValue1  LValue2  SValue1  SValue2
1990-03-25     8.67     6.16     3.84     4.37
1990-03-27     4.03     8.54     7.92     7.79
1990-03-29     3.21     4.09     7.16     8.38
1990-03-31     4.93     2.86     7.00     6.92
1990-04-01     1.79     6.48     9.01     2.53
1990-04-02     6.38     5.74     5.38     4.03

This won't satisfy a lot of people I imagine, but it is the way I would tackle it. p.s. it would be great if you could do the leg work re: creating data frames in subsequent questions.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.