I'm a software guy, but in my second week of PowerShell knowledge.
We have a set of 12 fixed-width format files containing lists of people (records can possibly be duplicated). These files are about 800MB each with a total combined row count of about 14 million. Looking at the first file, it contains 1,201,940 rows.
Additionally, we have a SQL table that should contain all that data (distinct records). I've been tasked to use PowerShell to ensure the data is fully loaded by comparing a few select fields in the source files against the SQL table, and then writing any missing records to a CSV log.
Let's assume my fields of interest are ID, FirstName, LastName, and for all situations I am limiting my objects/queries to only consider those fields.
What would be the most ideal methodology in PowerShell to compare the data? Do you bundle the data out to SQL, make it do the work, then retrieve results, or bundle all the data to PowerShell and work on it there?
I've thought of the following Ideas, but have not tested them:
- Create a SQL table variable (
@fileInfo). Create aDataTablefrom the file ($dtFile). Using$dtFile, for every X number of rows, load@fileInfo. Perform aLEFT JOINbetween@fileInfoand the SQL table and shove the results into a DataTable ($dtResults). Write$dtResultsto the log. Empty the contents of@fileInfoto prepare for the next iteration of the loop. This seems like my best idea. - Create a
DataTablefrom the file ($dtFile). Using$dtFile, for every X number of rows, construct a SQL select statement that has a terrible lookingWHEREclause that limits the rows the database returns. Shove that in another DataTable ($dtSQL). Compare the two and log any entries in$dtFilethat don't appear in$dtSQL. Looks gross, but works. - Load all 1.2m records from the file into a
DataTable. Bulk insert them to a SQL temporary table,LEFT JOINagainst the SQL table, retrieve results and write results to the log. I assume I would get bogged due to shoving a bunch of data over the network - Load all records from the SQL table to a
DataTable, Load all records from the file into a secondDataTable, compare the results in PowerShell and write results to the log. I assume that I would run out of memory...?
I would create scripts for each solution and do a test myself, but I'm under a time crunch and don't have the luxury. Isn't that always the situation?
Edit: I posted a solution that worked for me below

StreamWriterto put my fields of interest into a CSV, then use bulk insert into SQL! Really liking that idea!. @Mathias R. Jessen , unfortunately the contents of the file is sensitive.