1

I am sorry I didnt really know how to word the title of this question. I do not work with Python too often and I am just starting to work with the pandas and numpy packages.

I am getting unexpected results when trying to concatenate and append a pandas dataframe in a for loop.

I have a data set that I got from sql and put into a pandas dataframe (df):

print(df.head())

   date visitor  visitor_score home  home_score   W   L
0  20160405     BOS              6  CLE           2  94  67
1  20160406     BOS              6  CLE           7  94  67
2  20160408     BOS              8  TOR           7  89  73
3  20160409     BOS              8  TOR           4  89  73
4  20160410     BOS              0  TOR           3  89  73

I have another data set from sql that I also put in a pandas data frame (dfBostonStats):

print(dfBostonStats.head())

  teamID    ab     h   2b  3b   hr    so  sb   ra   er  era  IPouts    HA  \
0    BOS  5670  1598  343  25  208  1160  83  694  640  4.0    4319  1342

   hra   soa   e     fp  bpf  ppf   dp
0  176  1362  75  0.987  108  106  139

I want to concatenate that data frame (dfBostonStats) to each row of the first data frame (df).

I determined I could use pandas.concat and I proved this through concatenating the first row of df:

print(pd.concat([df.iloc[[0]], dfBostonStats], axis=1))

       date visitor  visitor_score home  home_score   W   L teamID    ab  \
0  20160405     BOS              6  CLE           2  94  67    BOS  5670

      h ...   era  IPouts    HA  hra   soa   e     fp  bpf  ppf   dp
0  1598 ...   4.0    4319  1342  176  1362  75  0.987  108  106  139

I then tried to concatenate each row by using a for loop but it gives me an unexpected result. it concatenates one row properly but then prints me a row of just the 2nd dataframe I have listed (dfBostonStats)

for index, element in df.iterrows():
   tempdf = pd.concat([df.iloc[[index]], dfBostonStats], axis=1)
   concatDataFrame = concatDataFrame.append(tempdf, ignore_index=True)


print(concatDataFrame.head())

       date visitor  visitor_score home  home_score     W     L teamID  \
0  20160405     BOS            6.0  CLE         2.0  94.0  67.0    BOS
1       NaN     NaN            NaN  NaN         NaN   NaN   NaN    BOS
2  20160406     BOS            6.0  CLE         7.0  94.0  67.0    NaN
3       NaN     NaN            NaN  NaN         NaN   NaN   NaN    BOS
4  20160408     BOS            8.0  TOR         7.0  89.0  73.0    NaN

       ab       h ...   era  IPouts      HA    hra     soa     e     fp  \
0  5670.0  1598.0 ...   4.0  4319.0  1342.0  176.0  1362.0  75.0  0.987
1  5670.0  1598.0 ...   4.0  4319.0  1342.0  176.0  1362.0  75.0  0.987
2     NaN     NaN ...   NaN     NaN     NaN    NaN     NaN   NaN    NaN
3  5670.0  1598.0 ...   4.0  4319.0  1342.0  176.0  1362.0  75.0  0.987
4     NaN     NaN ...   NaN     NaN     NaN    NaN     NaN   NaN    NaN

     bpf    ppf   dp
0  108.0  106.0  139
1  108.0  106.0  139
2    NaN    NaN  NaN
3  108.0  106.0  139
4    NaN    NaN  NaN

I can not figure out why it is printing that row with only dfBostonStats rather then just printing only concatenated rows?

On a side note, I know inside the for loop there is a copy occuring every time causing a performance hit but I figured I would deal with that once I get the data looking how it should.

4
  • Is necessary loop? Do you need add one column DataFrame dfBostonStats to df ? Commented Feb 9, 2018 at 12:38
  • It is necessary to loop because I have other teams I am going to do this for. I was planning on having a loop then I will have a switch statement and append a different data frame depending on the team Commented Feb 9, 2018 at 12:42
  • Where are the other teams stored? In list, dict, dataframe? Commented Feb 9, 2018 at 13:46
  • the other teams will also be stored in a dataframe, because they are sql queries. Each team will be seperate dataframes if i use the for loop with a switch inside it. On the other hand, they could just be one dataframe if I used a different approach. Commented Feb 9, 2018 at 14:06

1 Answer 1

1

I think if need join first dataframe by column visitor and second by column teamID use merge with left join. No loop is necessary:

print (df)
       date visitor  visitor_score home  home_score   W   L
0  20160405     BOS              6  CLE           2  94  67
1  20160406     BOS              6  CLE           7  94  67
2  20160408     AAA              8  TOR           7  89  73
3  20160409     AAA              8  TOR           4  89  73
4  20160410     AAA              0  TOR           3  89  73

print (dfBostonStats)
  teamID    ab     h   2b  3b   hr    so  sb   ra   er  era  IPouts    HA  \
0    BOS  5670  1598  343  25  208  1160  83  694  640  4.0    4319  1342   
0    AAA     4     5    6   4    5  1160  83  694  640  4.0    4319  1342   

   hra   soa   e     fp  bpf  ppf   dp  
0  176  1362  75  0.987   10  106  139  
0  176  1362  75  0.987   10  106  139  

df2 = df.merge(dfBostonStats, left_on='visitor', right_on='teamID', how='left')
print (df2)
       date visitor  visitor_score home  home_score   W   L teamID    ab  \
0  20160405     BOS              6  CLE           2  94  67    BOS  5670   
1  20160406     BOS              6  CLE           7  94  67    BOS  5670   
2  20160408     AAA              8  TOR           7  89  73    AAA     4   
3  20160409     AAA              8  TOR           4  89  73    AAA     4   
4  20160410     AAA              0  TOR           3  89  73    AAA     4   

      h ...   era  IPouts    HA  hra   soa   e     fp  bpf  ppf   dp  
0  1598 ...   4.0    4319  1342  176  1362  75  0.987   10  106  139  
1  1598 ...   4.0    4319  1342  176  1362  75  0.987   10  106  139  
2     5 ...   4.0    4319  1342  176  1362  75  0.987   10  106  139  
3     5 ...   4.0    4319  1342  176  1362  75  0.987   10  106  139  
4     5 ...   4.0    4319  1342  176  1362  75  0.987   10  106  139  

[5 rows x 27 columns]
Sign up to request clarification or add additional context in comments.

5 Comments

using your first method I see how I would assign it to the dataframe as a whole but how would i assign it to one element at a time in the for loop? This will obviously blow up: for index, element in df.iterrows(): df = element.assign(**dfBostonStats.iloc[0]) I dont know how else to look at it (maybe i have been staring at it for too long!)
I think you need second solution, if need join first dataframe by column visitor and second by column teamID. No loop is necessary.
I am at work now and away from my PC, when I get home I will give it a shot
Sure, no problem.
The merge solution worked! Thank you for your help sir

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.