Compare two csv files with python pandas

Question

I have two csv files both consist of two columns.

The first one has the product id, and the second has the serial number.

I need to lookup, all serial numbers from the first csv, and find matches, on the second csv. The result report, will have the serial number that matched, and the corresponding product ids from each csv, in a separate column i trued to modify the below code, no luck.

How would you approach this?

import pandas as pd
    A=set(pd.read_csv("c1.csv", index_col=False, header=None)[0]) #reads the csv, takes only the first column and creates a set out of it.
    B=set(pd.read_csv("c2.csv", index_col=False, header=None)[0]) #same here
    print(A-B) #set A - set B gives back everything thats only in A.
    print(B-A) # same here, other way around.

Can you add some sample data and desired output? Because it is a bit unclear what need exactly. — jezrael
– jezrael, Commented Feb 23, 2017 at 14:41

jezrael · Accepted Answer · 2017-02-23 14:32:07Z

9

I think you need merge:

A = pd.DataFrame({'product id':   [1455,5452,3775],
                    'serial number':[44,55,66]})

print (A)

B = pd.DataFrame({'product id':   [7000,2000,1000],
                    'serial number':[44,55,77]})

print (B)

print (pd.merge(A, B, on='serial number'))
   product id_x  serial number  product id_y
0          1455             44          7000
1          5452             55          2000

answered Feb 23, 2017 at 14:32

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user7609771 Over a year ago

only a small modification needed, how could one give the two filenames as input, in the above snippet, instead of hard coding the values?

greenbug Over a year ago

@user7609711 just use a = pd.read_csv("path-to-file") or a = pd.read_excel("path-to-file") if you have an xlsx file. Though you will need openpyxl to open excel files.

MaxU - stand with Ukraine · Accepted Answer · 2017-02-23 14:34:03Z

4

Try this:

A = pd.read_csv("c1.csv", header=None, usecols=[0], names=['col']).drop_duplicates()
B = pd.read_csv("c2.csv", header=None, usecols=[0], names=['col']).drop_duplicates()
# A - B
pd.merge(A, B, on='col', how='left', indicator=True).query("_merge == 'left_only'")
# B - A
pd.merge(A, B, on='col', how='right', indicator=True).query("_merge == 'right_only'")

answered Feb 23, 2017 at 14:34

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Comments

Shijo · Accepted Answer · 2017-02-23 14:55:00Z

You can convert df into Sets , that will ignore the index while comparing the data, then use set symmetric_difference

ds1 = set([ tuple(values) for values in df1.values.tolist()])
ds2 = set([ tuple(values) for values in df2.values.tolist()])

ds1.symmetric_difference(ds2)
print df1 ,'\n\n'
print df2,'\n\n'

print pd.DataFrame(list(ds1.difference(ds2))),'\n\n'
print pd.DataFrame(list(ds2.difference(ds1))),'\n\n'

df1

id  Name  score isEnrolled               Comment
0  111  Jack   2.17       True  He was late to class
1  112  Nick   1.11      False             Graduated
2  113   Zoe   4.12       True                   NaN

df2

    id  Name  score isEnrolled               Comment
0  111  Jack   2.17       True  He was late to class
1  112  Nick   1.21      False             Graduated
2  113   Zoe   4.12      False           On vacation

Output

     0     1     2      3          4
0  113   Zoe  4.12   True        NaN
1  112  Nick  1.11  False  Graduated 


     0     1     2      3            4
0  113   Zoe  4.12  False  On vacation
1  112  Nick  1.21  False    Graduated

Collectives™ on Stack Overflow

Compare two csv files with python pandas

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related