1

I have a CSV file as such

ID  OLD_A  NEW_A  OLD_B     NEW_B     OLD_C  NEW_C
1   0      0      1/1/2017  1/1/2017  ABC    BCD
2   0      0      1/1/2017  2/1/2017  ABC    ABC
3   1      2      1/1/2017  1/1/2017  ABC    BCD

I want to compare the old and new columns for A,B and C and in case there's a difference in the value of OLD and NEW (there's 10k+ rows) , I want return an output as such (from above example):

ID Field_Changed OLD_Value   NEW_Value
1  C             ABC         BCD
2  B             1/1/2017    2/1/2017
3  A             1           2
3  C             ABC         BCD

So far I've used the .loc method of pandas.DataFrame which returns the indexed location of the rows that qualify the boolean indexing, but I need the values as well...

df.loc[(df['OLD_A'] != df['NEW_A'])].index)

I'm new to python scripting and cannot seem to figure the logic out. Can someone help please?

1 Answer 1

2

set_index and create MultiIndex by split column names

stack and rename column

filter with boolean indexing

df = df.set_index('ID')

df.columns = df.columns.str.split('_', expand=True)
df1 = df.stack()[['OLD','NEW']].reset_index().rename(columns={'level_1':'Field_Changed'})
print (df1)
   ID Field_Changed       OLD       NEW
0   1             A         0         0
1   1             B  1/1/2017  1/1/2017
2   1             C       ABC       BCD
3   2             A         0         0
4   2             B  1/1/2017  2/1/2017
5   2             C       ABC       ABC
6   3             A         1         2
7   3             B  1/1/2017  1/1/2017
8   3             C       ABC       BCD

print (df1.columns)
Index(['ID', 'Field_Changed', 'OLD', 'NEW'], dtype='object')

print (df1.index)
RangeIndex(start=0, stop=9, step=1)
print (df1['OLD'] != df1['NEW'])
0    False
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
dtype: bool

df2 = df1[df1['OLD'] != df1['NEW']]
print (df2)
   ID Field_Changed       OLD       NEW
2   1             C       ABC       BCD
4   2             B  1/1/2017  2/1/2017
6   3             A         1         2
8   3             C       ABC       BCD
Sign up to request clarification or add additional context in comments.

11 Comments

Thanks , but I seem to get this error ValueError: cannot join with no level specified and no overlapping names
Hmm, df.columns = df.columns.str.split('_', expand=True) this return error? What is print (df.columns) after this code?
That line is fine, its this line which returns an error df2 = df1[df1['OLD'] != df1['NEW']]
are you sure? Because it seems error is in df1 = df.stack(). Can you test it?
yes I'm sure, df2 = df1[df1['OLD'] != df1['NEW']] is where the error seems to pop up. Sorry about this, really appreciate your help. Is there anything else I can do?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.