2

I have two files which have 3 common columns - Date, KeywordId, AdGroupId. I want to merge these two files based on these columns such that for each row with a particular date, keywordid, adgroupid in the first file, if there's corresponding row with same date, keywordid, adgroupid in the second file, then append the rest of the values present in the second file and if not just append null or - in the rest of the columns.

The first file (df1 here) has 59,00,000 rows. The second file has around 10,00,000 rows. I used the code below

 import pandas as pd

df1 = pd.read_csv(r"C:\Users\Rakshit Lal\Desktop\QVC Data\psnb_extract_daily\Final\cumulative_adwords_test.csv")
df2 = pd.read_csv(r"C:\Users\Rakshit Lal\Desktop\QVC Data\psnb_extract_daily\Final\Test_psnbfull.csv")

# Merge the two dataframes, using _ID column as key
df3 = pd.merge(df1, df2, on = ['Date', 'KeywordId', 'AdGroupId'])
df3.set_index('Date', inplace = True)

# Write it to a new CSV file
df3.to_csv('CSV3.csv')

# Write it to a new CSV file
df3.to_csv('CSV3.csv')

My final file - csv3 contains only 6,05,277 rows for some reason where it should contain 59,00,000 rows (as in file 1). I believe I'm making an error with the way I'm using the merge function. Can someone help me out on where I'm going wrong and what I should modify?

6
  • 1
    Perhaps you want to set the how keyword argument to 'outer'? Commented Jul 27, 2020 at 14:13
  • Is it guaranteed that each row in df2 matches one or more rows in df1? That is how I read your question, but it's not entirely clear. Commented Jul 27, 2020 at 14:13
  • For more information, have a read through pandas.pydata.org/pandas-docs/stable/user_guide/merging.html . The figures may give you an idea of what to use in which case. Commented Jul 27, 2020 at 14:14
  • No. For any particular row in df2 there might not be a corresponding row in df1. But all I want is to have all the rows in df1 intact in my final csv and if there's a corresponding entry in df1, then add the values for those rows and if there isn't then just leave those extra columns for those rows blank or null Commented Jul 27, 2020 at 14:30
  • Then 'left' is indeed the better choice compared to 'outer'. Commented Jul 27, 2020 at 16:26

1 Answer 1

2

If you don't specify how to merge it takes a inner join - but actually you want to do left join. You can use 'left' merge:

df3 = pd.merge(df1, df2, on = ['Date', 'KeywordId', 'AdGroupId'], how = 'left')
Sign up to request clarification or add additional context in comments.

3 Comments

I'll try this out and let you know if it works! Thanks.
Any idea on what I should include to have the rows sorted by the date column (descending)?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.