0

I am exporting hdfs query output into a csv file using INSERT OVERWRITE LOCAL DIRECTORY command. Since this export the data without header. I got another dataframe from Oracle output with file header which I need to compare against hdfs output.

df1 = pd.read_csv('/home/User/hdfs_result.csv', header = None)
print(df1)

      0  1                    2
0  XPRN  A  2019-12-16 00:00:00
1  XPRW  I  2019-12-16 00:00:00
2  XPS2  I  2003-09-30 00:00:00


df = pd.read_sql(sqlquery, sqlconn)


  UNIT  STATUS Date
0  XPRN  A     2019-12-16 00:00:00
1  XPRW  A     2019-12-16 00:00:00
2  XPS2  I     2003-09-30 00:00:00

Since df1 is having no header i cant use Merge or Join to compare data. Though I can do df-df1.

Please suggest how can i compare and print the difference?

1
  • What is your expected output? Commented Jun 4, 2020 at 21:15

1 Answer 1

2

You can pass the underlying numpy array for comparison:

df2.where(df2==df1.values)

Output (difference are masked as NaN)

   UNIT STATUS                 Date
0  XPRN      A  2019-12-16 00:00:00
1  XPRW    NaN  2019-12-16 00:00:00
2  XPS2      I  2003-09-30 00:00:00

For non matching row:

df2[(df2!=df1.values).any(1)]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for answer. I need to find non matching rows. I can do it using cmp = (df1 - df2) diff = cmp.drop_duplicates(keep=False) Pleas elet me know if there is any other approach i can follow?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.