1

I'm a software guy, but in my second week of PowerShell knowledge.

We have a set of 12 fixed-width format files containing lists of people (records can possibly be duplicated). These files are about 800MB each with a total combined row count of about 14 million. Looking at the first file, it contains 1,201,940 rows.

Additionally, we have a SQL table that should contain all that data (distinct records). I've been tasked to use PowerShell to ensure the data is fully loaded by comparing a few select fields in the source files against the SQL table, and then writing any missing records to a CSV log.

Let's assume my fields of interest are ID, FirstName, LastName, and for all situations I am limiting my objects/queries to only consider those fields.

What would be the most ideal methodology in PowerShell to compare the data? Do you bundle the data out to SQL, make it do the work, then retrieve results, or bundle all the data to PowerShell and work on it there?

I've thought of the following Ideas, but have not tested them:

  1. Create a SQL table variable (@fileInfo). Create a DataTable from the file ($dtFile). Using $dtFile, for every X number of rows, load @fileInfo. Perform a LEFT JOIN between @fileInfo and the SQL table and shove the results into a DataTable ($dtResults). Write $dtResults to the log. Empty the contents of @fileInfo to prepare for the next iteration of the loop. This seems like my best idea.
  2. Create a DataTable from the file ($dtFile). Using $dtFile, for every X number of rows, construct a SQL select statement that has a terrible looking WHERE clause that limits the rows the database returns. Shove that in another DataTable ($dtSQL). Compare the two and log any entries in $dtFile that don't appear in $dtSQL. Looks gross, but works.
  3. Load all 1.2m records from the file into a DataTable. Bulk insert them to a SQL temporary table, LEFT JOIN against the SQL table, retrieve results and write results to the log. I assume I would get bogged due to shoving a bunch of data over the network
  4. Load all records from the SQL table to a DataTable, Load all records from the file into a second DataTable, compare the results in PowerShell and write results to the log. I assume that I would run out of memory...?

I would create scripts for each solution and do a test myself, but I'm under a time crunch and don't have the luxury. Isn't that always the situation?

Edit: I posted a solution that worked for me below

4
  • 1
    Do you have access to another database on the same server or other accessible server? 1.2M rows isn't really that big. You might just do a standard bcp Bulk Insert of the whole csv file. Max 3 hours on a decent server. follow the guidelines on bulk insert technet.microsoft.com/en-us/library/… ie table lock! and index management! and log management !!!!!. Commented Jul 14, 2016 at 22:32
  • stackoverflow.com/questions/2479434/… Commented Jul 14, 2016 at 22:38
  • Could you show a sample/snippet of the fixed-width input file? Commented Jul 14, 2016 at 22:43
  • @Gareth - I can spin up a database, but the source data isn't in CSV, it's in fixed width. However, that brings up an interesting idea - I can use a StreamWriter to put my fields of interest into a CSV, then use bulk insert into SQL! Really liking that idea!. @Mathias R. Jessen , unfortunately the contents of the file is sensitive. Commented Jul 14, 2016 at 22:57

2 Answers 2

1

I would offload the comparison entirely on the database engine:

  1. Bulk load data into SQL using something like Import-CsvToSql (or bcp) into a new table fileTable
  2. Compare fileTable to your originalTable using UNION ALL (see below)
  3. Log results (ie. the discrepancies) to a file.

Depending on the underlying storage, you may want to copy the original table to a database where you can switch the recovery model to SIMPLE or BULK_LOGGED before importing the dataset from the files


UNION ALL-based comparison procedure would look something like:

SELECT MIN(TableName) as TableName, ID, FirstName, LastName
FROM
(
  SELECT 'Database' as TableName, originalTable.ID, originalTable.FirstName, originalTable.LastName
  FROM originalTable
  UNION ALL
  SELECT 'Files' as TableName, fileTable.ID, fileTable.FirstName, fileTable.LastName
  FROM fileTable
) tmp
GROUP BY ID, FirstName, LastName
HAVING COUNT(*) = 1
ORDER BY ID
Sign up to request clarification or add additional context in comments.

1 Comment

You could also use the EXCEPT set operator to find any missing results
0

Sorry all for not being timely on replying, but I have a solution! There's most likely room for improvement, but the solution is reasonably quick. I didn't have access to my database to run PowerShell directly against it, so I'm making use of the SQL Import and Export Wizard at the end.

Summary of process:

  1. Create a CSV file of data points, which will be consumed as an object array in PowerShell.
  2. Find the files of interest and store them into a string array (possibly not necessary, but it seemed to be speedier)
  3. Cycle through each file found in your string array. For each file, open a .NET StreamReader, and for each line, parse the file against your object array of data points to create substrings that you write to a single, consolidated, delimited, output file. I recommend the pipe character (looks like this: |, typically above Enter) because it typically isn't found in data, whereas a rogue comma or tab character may trip you up.
  4. Once the script is complete, use the import wizard in SQL to create a table from your output file

Detail

  1. Create a CSV file that lists your data points in column A, the starting position in column B, and the widths in column C. It would look something like this: data point CSV file
  2. Import your data points in your script by utilizing an object array.

    $dataPoints = Import-Csv "c:\temp\datapoints.csv"
    $objDataCols = @()
    foreach($objCol in $dataPoints){
        objColumn = New-Object psobject
        $objColumn | Add-Member -Type NoteProperty -Name Name -Value $objCol.Name
        $objColumn | Add-Member -Type NoteProperty -Name Position -Value ([int] $objCol.Position)
        $objColumn | Add-Member -Type NoteProperty -Name ColumnLength -Value ([int] $objCol.ColumnLength)
        $objSourceCols += $objColumn
    }
    
  3. Find the files and assemble the names to an array (optional). I used a regular expression to filter for my files.

    $files = @()
    Get-ChildItem -Path $sourceFilePath | Where-Object { $_.FullName -match $regExpression } | ForEach-Object{
        $files += $_.FullName
    }
    
  4. Loop through each file and parse them to an output file. In production code, you would want try/catch blocks, but I left them out in the example.

    $writer = New-Object System.IO.StreamWriter "c:\temp\outputFile.txt"
    ForEach($sourceFileName in $files){
        $reader = [System.IO.File]::OpenText($sourceFileName)
        while($reader.Peek() -gt -1){
            $line = $reader.ReadLine()
    
            # Write each data point in the line, pipe delimited
            for($i = 0; $j -le ($objDataCols).Length; $i++){
                # Write to a pipe-delimited file
                $writer.Write("{0}|", $line.Substring($objDataCols[$i].Position, $objDataCols[$i].ColumnLength))
            }
    
            # Write a new line, along with any additional reference columns not defined in the source file, such as adding in the source file name and line number
            $writer.WriteLine($sourceFileName)
        }    
        $reader.Close()         
        $reader.Dispose()
    }
    $writer.Close()
    $writer.Dispose()
    
  5. Import the pipe-delimited output file into SQL.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.