Compare columns of a CSV file using Python

Question

I have a CSV file as such

ID  OLD_A  NEW_A  OLD_B     NEW_B     OLD_C  NEW_C
1   0      0      1/1/2017  1/1/2017  ABC    BCD
2   0      0      1/1/2017  2/1/2017  ABC    ABC
3   1      2      1/1/2017  1/1/2017  ABC    BCD

I want to compare the old and new columns for A,B and C and in case there's a difference in the value of OLD and NEW (there's 10k+ rows) , I want return an output as such (from above example):

ID Field_Changed OLD_Value   NEW_Value
1  C             ABC         BCD
2  B             1/1/2017    2/1/2017
3  A             1           2
3  C             ABC         BCD

So far I've used the .loc method of pandas.DataFrame which returns the indexed location of the rows that qualify the boolean indexing, but I need the values as well...

df.loc[(df['OLD_A'] != df['NEW_A'])].index)

I'm new to python scripting and cannot seem to figure the logic out. Can someone help please?

jezrael · Accepted Answer · 2017-02-01 07:35:33Z

2

set_index and create MultiIndex by split column names

stack and rename column

filter with boolean indexing

df = df.set_index('ID')

df.columns = df.columns.str.split('_', expand=True)
df1 = df.stack()[['OLD','NEW']].reset_index().rename(columns={'level_1':'Field_Changed'})
print (df1)
   ID Field_Changed       OLD       NEW
0   1             A         0         0
1   1             B  1/1/2017  1/1/2017
2   1             C       ABC       BCD
3   2             A         0         0
4   2             B  1/1/2017  2/1/2017
5   2             C       ABC       ABC
6   3             A         1         2
7   3             B  1/1/2017  1/1/2017
8   3             C       ABC       BCD

print (df1.columns)
Index(['ID', 'Field_Changed', 'OLD', 'NEW'], dtype='object')

print (df1.index)
RangeIndex(start=0, stop=9, step=1)

print (df1['OLD'] != df1['NEW'])
0    False
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
dtype: bool

df2 = df1[df1['OLD'] != df1['NEW']]
print (df2)
   ID Field_Changed       OLD       NEW
2   1             C       ABC       BCD
4   2             B  1/1/2017  2/1/2017
6   3             A         1         2
8   3             C       ABC       BCD

edited Feb 1, 2017 at 7:35

answered Feb 1, 2017 at 7:02

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

ThatRiddimGuy Over a year ago

Thanks , but I seem to get this error ValueError: cannot join with no level specified and no overlapping names

jezrael Over a year ago

Hmm, df.columns = df.columns.str.split('_', expand=True) this return error? What is print (df.columns) after this code?

ThatRiddimGuy Over a year ago

That line is fine, its this line which returns an error df2 = df1[df1['OLD'] != df1['NEW']]

jezrael Over a year ago

are you sure? Because it seems error is in df1 = df.stack(). Can you test it?

ThatRiddimGuy Over a year ago

yes I'm sure, df2 = df1[df1['OLD'] != df1['NEW']] is where the error seems to pop up. Sorry about this, really appreciate your help. Is there anything else I can do?

|

Collectives™ on Stack Overflow

Compare columns of a CSV file using Python

1 Answer 1

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related