1

I'm migrating data from SQL Server 2017 to Postgres 10.5, i.e., all the tables, stored procedures etc.

I want to compare the data consistency between SQL Server and Postgres databases after the data migration.

All I can think of now is using Python Pandas and loading the tables into data frames from SQL Server and also Postgres and compare the data frames.

But the data is around 6 GB which takes much time for loading table into the data frame and also hosted on a server which is not local to where I'm running the Python script. Is there any way to efficiently compare the data consistency across SQL Server and Postgres?

10
  • You don't need Pandas to copy data, use SQL Server's SSIS to copy data from one database to another. It supports a lot of data sources. At the very least you can work with Postgres using the ODBC or OLEDB drivers. Commented Nov 4, 2019 at 11:36
  • SQL Server can also connect to other databases as linked servers. Comparing data though has to be done in one of the two databases. Even with a JOIN, the remote server will filter its data and send it to the local server for comparison Commented Nov 4, 2019 at 11:37
  • BTW what do you mean data consistency? It's not just "identical data", especially when the data types don't match, or the source or target database is missing features. Never mind Unicode and collations. Commented Nov 4, 2019 at 11:41
  • A quick&dirty way would be to bulk export the data from both databases to CSV files per table with the same format and compare them. Simple size and checksum checks are enough to identify tables with differences. After that you can proceed to line-by-line comparisons using shell tools. Commented Nov 4, 2019 at 11:43
  • 2
    Another quick check is to calculate row counts and key statistics (at least min max) for PKs for all tables in both databases. If you find tables with different counts, you know where to look for differences. Commented Nov 4, 2019 at 12:04

2 Answers 2

1

Yes, you can order the data by primary key, and then write the data to a json or xml file.

Then you can run diff over the two files.

You can also run this chunked by primary-key, that way you don't have to work with a huge file.

Log any diff that doesn't show as equal. If it doesn't matter what the difference is, you could also just run MD5/SHA1 on the two file chunks, and if the hash machtches, there is no difference, if it doesn't, there is.

Speaking from experience with nhibernate, what you need to watch out for is:

  • bit fields
  • text, ntext, varchar(MAX), nvarchar(MAX) fields
    (they map to varchar with no length, by the way - encoding UTF8)
  • varbinary, varbinary(MAX), image (bytea[] vs. LOB)
  • xml
  • that all primary-key's id serial generator is reset after you inserted all data in pgsql.

Another thing to watch out is which time zone CURRENT_TIMESTAMP uses.

Note:
I'd actually run System.Data.DataRowComparer directly, without writing data to a file:

static void Main(string[] args)
{
    DataTable dt1 = dt1();
    DataTable dt2= dt2();
    IEnumerable<DataRow> idr1 = dt1.Select();
    IEnumerable<DataRow> idr2 = dt2.Select();

    // MyDataRowComparer MyComparer = new MyDataRowComparer();
    // IEnumerable<DataRow> Results = idr1.Except(idr2, MyComparer);
    IEnumerable<DataRow> results = idr1.Except(idr2);
}

Then you write all non-matching DataRows into a logfile, for each table one directory (if there are differences).

Don't know what Python uses in place of System.Data.DataRowComparer, though.
Since this would be a one-time task, you could also opt to not do it in Python, and use C# instead (see above code sample).

Also, if you had large tables, you could use DataReader with sequential access to do the comparison. But if the other way cuts it, it reduces the required work considerably.

Sign up to request clarification or add additional context in comments.

6 Comments

A CSV file would probably be better than either JSON or XML, unless each JSON/XML element is serialized as a single row fragment. There's no need for fidelity, just something that quickly detects differences, so there's no need to parse the line. Moved block detection in a diff tool can also handle missing rows (up to a point). Finally, row-per-line allows easy parallelization for very big tables, by splitting the table across lines
@Panagiotis Kanavos: CSV doesn't guarantee all rows are on a single line. If a text-field contains a line-break, then you shift everything by 1 line.
JSON and XML will also have trouble with that too, unless the data is encoded, which will slow things down. This can be solved by using an unusual character as field and row separator, eg ¦, ¤ or ¶
@Panagiotis Kanavos: In general, the file format doesn't matter. What matters is that your file-format-reader can run foreach table foreach row foreach column table1.row[column] == table2.row[column] efficiently and RELIABLY. Actually, might be fastest and the most reliable if you do it in raw ADO.NET directly on both servers.
And with minimal parsing. Which is why I mentioned XML/JSON fragments instead of well-formed documents. Instead of parsing an entire million-row JSON/XML document, the diff tool only needs to parse a single line from each source
|
0

Have you considered making your SQL Server data visible within your Postgres with a Foreign Data Wrapper (FDW)?

https://github.com/tds-fdw/tds_fdw

I haven't used this FDW tool but, overall, the basic FDW setup process is simple. An FDW acts like a proxy/alias, allowing you to access remote data as though it were housed in Postgres. The tool linked above doesn't support joins, so you would have to perform your comparisons iteratively, etc. Depending on your setup, you would have to check if performance is adequate.

Please report back!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.