1

I have two csv files: test1

A   B 
1   a
2   b
3   c
4   d
5   e

test2

A   C   D   B
3   x   25  101
2   y   0.35    11
1   z   0.45    111
6   k   0.55    1101
7   l   0.65    1010

I want to merge them on = A but i need only the columns test1.A,B and test2.D, B in the final file. And since both these files have a same column name B, need to rename it while joining itself. The output file should be like this:

A   B   D       B1
1   a   0.45    1110
2   b   0.35    1010
3   c   25      1011
4   d       
5   e   0.55    
6       0.65    1000
7               1111

where B1 corresponds to column B in Table2. Both B columns in test1 and test2 are not same.

import pandas

csv1 = pandas.read_csv('test1.csv',dtype='unicode')
csv2 = pandas.read_csv('test2.csv',dtype='unicode')
merged = pandas.merge(csv1[list('AB')],csv2[list('DB')], on='A',how="outer")
merged.to_csv("outputtest.csv", index=False)

This is giving me error:

KeyError: "['B'] not in index"

1 Answer 1

1

You can drop column C in csv2 and then merge with parameter suffixes and last fillna by empty string:

merged = pd.merge(csv1,
                  csv2.drop('C', axis=1), 
                  on='A',
                  how="outer", 
                  suffixes=('','1')).fillna('')
print (merged)
     A  B     D    B1
0  1.0  a  0.45   111
1  2.0  b  0.35    11
2  3.0  c    25   101
3  4.0  d            
4  5.0  e            
5  6.0     0.55  1101
6  7.0     0.65  1010

If in csv is many columns, you can use subset - only columns which need and column for join - in this solution column A:

merged = pd.merge(csv1[['A','B']],
                  csv2[['A','D','B']], 
                  on='A',
                  how="outer", 
                  suffixes=('','1')).fillna('')
print (merged)
     A  B     D    B1
0  1.0  a  0.45   111
1  2.0  b  0.35    11
2  3.0  c    25   101
3  4.0  d            
4  5.0  e            
5  6.0     0.55  1101
6  7.0     0.65  1010

Or:

merged = pd.merge(csv1[list('AB')],
                  csv2[list('ADB')], 
                  on='A',
                  how="outer", 
                  suffixes=('','1')).fillna('')
print (merged)
     A  B     D    B1
0  1.0  a  0.45   111
1  2.0  b  0.35    11
2  3.0  c    25   101
3  4.0  d            
4  5.0  e            
5  6.0     0.55  1101
6  7.0     0.65  1010
Sign up to request clarification or add additional context in comments.

6 Comments

I think this error means you have maybe some space in column names - try check it by print csv1.columns.
Instead of suffix, if we need to rename the column name altogether say make it 'BiNumeral' instead of B, what would be the approach. Should i change the names while reading csv2 or is there a shorter way?
IIUC you can use df.rename(columns={'B':'BiNumeral'}, inplace=True)
yup. I did something like that. import pandas csv1 = pandas.read_csv('test1.csv',dtype='unicode') csv2 = pandas.read_csv('test2.csv',dtype='unicode') csv2 = csv2.rename(columns={'C':'Binomial'}) merged = pandas.merge(csv1,csv2[['A','Binomial','D']], on='A') merged.to_csv("outputtest.csv", index=False)
One last doubt, if i need the output to give only the rows of test1, which join do i need to perform. If i do an inner join it gives only the rows which have same id., i.e it gives me rows where A=1,2,3 but i want rows with A=1,2,3,4,5
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.