2

I am trying to join two pandas data frames with an inner join.

my_df = pd.merge(df1, df2, how = 'inner', left_on = ['date'], right_on = ['myDate'])

However I am getting the following error:

KeyError: 'myDate' TypeError: an integer is required

I believe joining on dates is valid, however I cannot make this simple join work?

DF2 was created using the following

df2 = idf.groupby(lambda x: (x.year,x.month,x.day)).mean()

Can someone please advise? Thanks a lot.

df1
type    object
id      object
date    object
value   float64 

    type    id          date       value
0   CAR     PSTAT001    15/07/15    42
1   BIKE    PSTAT001    16/07/15    42
2   BIKE    PSTAT001    17/07/15    42
3   BIKE    PSTAT004    18/07/15    42
4   BIKE    PSTAT001    19/07/15    32

df2 
myDate  object
val1    float64
val2    float64
val3    float64

    myDate     val1         val2           val3
0   (2015,7,13) 1074        1871.666667    2800.777778
1   (2015,7,14) 347.958333  809.416667     1308.458333
2   (2015,7,15) 202.625     597.375        1008.666667
3   (2015,7,16) 494.958333  1192           1886.916667

DF1.info()

<class  'pandas.core.frame.DataFrame'>              
Int64Index: 3040    entries,    0   to  3039
Data    columns (total  4   columns):   
type    3040    non-null    object      
id      3040    non-null    object      
date    3040    non-null    object      
value   3040    non-null    float64     
dtypes: float64(1), object(3)           
memory  usage:  118.8+  KB  

DF2.info()

<class  'pandas.core.frame.DataFrame'>              
Int64Index: 16  entries,    0   to  15
Data    columns (total  4   columns):   
myDate  16  non-null    object      
val1    16  non-null    float64     
val2    16  non-null    float64     
val3    16  non-null    float64     
dtypes: float64(3), object(1)           
memory  usage:  640.0+  bytes   
9
  • Your df2['myDate'] looks like a tuple with ints, can you post the output from df1.info() and df2.info() Commented Jul 13, 2015 at 13:05
  • DF1<class 'pandas.core.frame.DataFrame'> Int64Index: 3040 entries, 0 to 3039 Data columns (total 4 columns): type 3040 non-null object id 3040 non-null object date 3040 non-null object value 3040 non-null float64 dtypes: float64(1), object(3) memory usage: 118.8+ KB Commented Jul 13, 2015 at 13:11
  • DF2: <class 'pandas.core.frame.DataFrame'> Int64Index: 16 entries, 0 to 15 Data columns (total 4 columns): myDate 16 non-null object val1 16 non-null float64 val2 16 non-null float64 val3 16 non-null float64 dtypes: float64(3), object(1) memory usage: 640.0+ bytes Commented Jul 13, 2015 at 13:12
  • Please edit your question, formatting is lost in comments, thanks Commented Jul 13, 2015 at 13:12
  • Apologies for lack of formatting, still learning. Commented Jul 13, 2015 at 13:12

2 Answers 2

2

Your date columns are not datetime dtype, df1 looks like a str whilst the other is a tuple so you need to convert these first and then the merge will work:

In [75]:
df1['date'] = pd.to_datetime(df1['date'])
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 4 columns):
type     5 non-null object
id       5 non-null object
date     5 non-null datetime64[ns]
value    5 non-null int64
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 200.0+ bytes

In [76]:
import datetime as dt
df2['myDate'] = df2['myDate'].apply(lambda x: dt.datetime(x[0], x[1], x[2]))
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
myDate    4 non-null datetime64[ns]
val1      4 non-null float64
val2      4 non-null float64
val3      4 non-null float64
dtypes: datetime64[ns](1), float64(3)
memory usage: 160.0 bytes

In [78]:    
my_df=  pd.merge(df1, df2, how = 'inner', left_on = ['date'], right_on = ['myDate'])
my_df

Out[78]:
   type        id       date  value     myDate        val1      val2  \
0   CAR  PSTAT001 2015-07-15     42 2015-07-15  202.625000   597.375   
1  BIKE  PSTAT001 2015-07-16     42 2015-07-16  494.958333  1192.000   

          val3  
0  1008.666667  
1  1886.916667 
Sign up to request clarification or add additional context in comments.

2 Comments

Ed Chum you are a legend that worked a treat. Many thanks fro your help.
i just upvoted as well. :) that's what i get for being getting my coffee before writing my answer.
2

As the comments say, the lack of matching is coming from differing data formats. You have df1's 'date' field as a an object, but df2's 'myDate' as a object represented as a tuple.

First let's convert df1 'date' into datetime, as @EdChum suggests.

df1 = pd.DataFrame(data = np.array([['CAR', 'PSTAT001', '15/07/15', 42]]), \
    columns = ['type','id','date','value'])

df1['date']=pd.to_datetime(df1['date'])

Then, again as @EdChum suggests, we convert the tuple into the string using the datetime library.

df2 = pd.DataFrame(data = np.array([[(2015,7,15) ,202.625 ,597.375,1008.666667]]), \
    columns = ['myDate','val1','val2','val3'])

df2['myDate'] = df2['myDate'].apply(lambda x: datetime(x[0], x[1], x[2]))

And from there our merge works. I used only row 0 of df1 and row3 to make things simpler in my ide.

my_df=  pd.merge(df1, df2, how = 'inner', left_on = ['date'], right_on = ['myDate'])

my_df[:1]
Out[21]: 
  type        id       date value     myDate     val1     val2      val3
0  CAR  PSTAT001 2015-07-15    42 2015-07-15  202.625  597.375  1008.667

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.