pandas: merge two files based on multiple columns

Question

I have two files which have 3 common columns - Date, KeywordId, AdGroupId. I want to merge these two files based on these columns such that for each row with a particular date, keywordid, adgroupid in the first file, if there's corresponding row with same date, keywordid, adgroupid in the second file, then append the rest of the values present in the second file and if not just append null or - in the rest of the columns.

The first file (df1 here) has 59,00,000 rows. The second file has around 10,00,000 rows. I used the code below

 import pandas as pd

df1 = pd.read_csv(r"C:\Users\Rakshit Lal\Desktop\QVC Data\psnb_extract_daily\Final\cumulative_adwords_test.csv")
df2 = pd.read_csv(r"C:\Users\Rakshit Lal\Desktop\QVC Data\psnb_extract_daily\Final\Test_psnbfull.csv")

# Merge the two dataframes, using _ID column as key
df3 = pd.merge(df1, df2, on = ['Date', 'KeywordId', 'AdGroupId'])
df3.set_index('Date', inplace = True)

# Write it to a new CSV file
df3.to_csv('CSV3.csv')

# Write it to a new CSV file
df3.to_csv('CSV3.csv')

My final file - csv3 contains only 6,05,277 rows for some reason where it should contain 59,00,000 rows (as in file 1). I believe I'm making an error with the way I'm using the merge function. Can someone help me out on where I'm going wrong and what I should modify?

Perhaps you want to set the how keyword argument to 'outer'? — 9769953
– 9769953, Commented Jul 27, 2020 at 14:13
Is it guaranteed that each row in df2 matches one or more rows in df1? That is how I read your question, but it's not entirely clear. — 9769953
– 9769953, Commented Jul 27, 2020 at 14:13
For more information, have a read through pandas.pydata.org/pandas-docs/stable/user_guide/merging.html . The figures may give you an idea of what to use in which case. — 9769953
– 9769953, Commented Jul 27, 2020 at 14:14
No. For any particular row in df2 there might not be a corresponding row in df1. But all I want is to have all the rows in df1 intact in my final csv and if there's a corresponding entry in df1, then add the values for those rows and if there isn't then just leave those extra columns for those rows blank or null — Rakshit
– Rakshit, Commented Jul 27, 2020 at 14:30
Then 'left' is indeed the better choice compared to 'outer'. — 9769953
– 9769953, Commented Jul 27, 2020 at 16:26

blutab · Accepted Answer · 2020-07-27 14:14:25Z

2

If you don't specify how to merge it takes a inner join - but actually you want to do left join. You can use 'left' merge:

df3 = pd.merge(df1, df2, on = ['Date', 'KeywordId', 'AdGroupId'], how = 'left')

answered Jul 27, 2020 at 14:14

blutab

1819 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rakshit Over a year ago

I'll try this out and let you know if it works! Thanks.

Rakshit Over a year ago

Any idea on what I should include to have the rows sorted by the date column (descending)?

blutab Over a year ago

This could help you: stackoverflow.com/questions/28161356/…

Collectives™ on Stack Overflow

pandas: merge two files based on multiple columns

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related