1

I have two large CSVs to compare. Bosth csvs are basically data from the same system 1 day apart. No of rows are around 12k and columns 30.

The aim is to identify what column data has changed for primary key(#ID).

My idea was to loop through the CSVs to identify which rows have changed and dump these into a separate csvs. One done, I again loop through the changes rows, and indetify the exact change in column.

    NewCSV = Import-Csv -Path ".\Data_A.csv"
    OldCSV = Import-Csv -Path ".\Data_B.csv"
     
   foreach ($LineNew in $NewCSV)
    {
        ForEach ($LineOld in $OldCSV)
        {
            If($LineNew -eq $LineOld)
            {
                Write-Host $LineNew, " Match"
            }else{
                Write-Host $LineNew, " Not Match"
            }
        }
    }

But as soon as run the loop, it takes forever to run for 12k rows. I was hoping there must be a more efficient way to compare large files powershell. Something that is quicker.

5
  • You only care about "what has changed" on $NewCsv from $OldCsv or side-by-side comparison? Also, does ID have unique values on both CSVs? Commented Dec 17, 2021 at 13:32
  • I want to know both the old value and new value for changed columns for a particular ID Commented Dec 17, 2021 at 13:34
  • What about Compare-Object (Get-content Data_A.csv) (Get-Content Data_B.csv) ? Commented Dec 17, 2021 at 13:39
  • 1
    Why it takes forever: by comparing each row in oldcsv against each row of newcsv makes 12k*12k comparisons, so about 144 million operations. Commented Dec 17, 2021 at 13:43
  • This is a quiet classic issue. Using this Join-Object script/Join-Object Module (see also: In Powershell, what's the best way to join two tables into one?): Import-Csv .\Data_A.csv |Join (Import-Csv .\Data_B.csv) -on ID -Name 'A.', 'B.' Commented Dec 17, 2021 at 16:16

2 Answers 2

1

Well you can give this a try, I'm not claiming it will be fast for what vonPryz has already pointed out but it should give you a good side-by-side perspective to compare what has changed from OldCsv to NewCsv.

Note: Those cells that have the same value on both CSVs will be ignored.

$NewCSV = Import-Csv -Path ".\Data_A.csv"
$OldCSV = Import-Csv -Path ".\Data_B.csv" | Group-Object ID -AsHashTable -AsString

$properties = $newCsv[0].PSObject.Properties.Name

$result = foreach($line in $NewCSV)
{
    if($ref = $OldCSV[$line.ID])
    {
        foreach($prop in $properties)
        {
            if($line.$prop -ne $ref.$prop)
            {
                [pscustomobject]@{
                    ID = $line.ID
                    Property = $prop
                    OldValue = $ref.$prop
                    NewValue = $line.$prop
                }
            }
        }
        continue
    }

    Write-Warning "ID $($line.ID) could not be found on Old Csv!!"
}
Sign up to request clarification or add additional context in comments.

5 Comments

This solution is much faster that the one I was trying myself. One issue though, it randomly read some values as System.Object[] and hence shows data mismatch sometime.
@misguided without looking at the actual CSVs I don't see how that could be possible
The only possible way I could see it returning Object[] is if the OldCsv or NewCsv did not have Unique Values on the ID Column, which you already mentioned values were unique, if this was not the case then it should've been clarified.
Awesome...that was the cause. I went back and dug a bit deeper into data and sawa few duplicates which as you say were causing the issue. I have removed them now and it works fine. Less than a minute for comparing 2 csv files with 12k rows each.
@misguided good to know, glad it worked :)
1

As vonPryz hints in the comments, you've written an algorithm with quadratic time complexity (O(n²) in Big-O notation) - every time the input size doubles, the number of computations performed increase 4-fold.

To avoid this, I'd suggest using a hashtable or other dictionary type to hold each data set, and use the primary key from the input as the dictionary key. This way you get constant-time lookup of corresponding records, and the time complexity of your algorithm becomes near-linear (O(2n + k)):

$NewCSV = @{}
Import-Csv -Path ".\Data_A.csv" |ForEach-Object {
  $NewCSV[$_.ID] = $_
}

$OldCSV = @{}
Import-Csv -Path ".\Data_B.csv" |ForEach-Object {
  $OldCSV[$_.ID] = $_
}

Now that we can efficiently resolve each row by it's ID, we can inspect the whole of the data sets with an independent loop over each:

foreach($entry in $NewCSV.GetEnumerator()){
  if(-not $OldCSV.ContainsKey($entry.Key)){
    # $entry.Value is a new row, not seen in the old data set
  }

  $newRow = $entry.Value
  $oldRow = $OldCSV[$entry.Key]

  # do the individual comparison of the rows here
}

Do another loop like above, but with $NewCSV in place of $OldCSV to find/detect deletions.

6 Comments

You cannot call a method on a null-valued expression. At C:\Data\FIle.PS1:39 char:19 + foreach($entry in $NewCSV.GetEnumerator()){ + ~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidOperation: (:) [], RuntimeException + FullyQualifiedErrorId : InvokeMethodOnNull
@misguided Did you run the second block of code without running the first block of code in my answer? :) First line, $NewCSV = @{}, assigns a hashtable to $NewCSV, so if you run them in order, that error won't occur
I did actually. But still getting the error. The data set has some null values in some columns(not primary key). Could that be an issue?
No, Import-Csv would spit out empty strings rather than null values. Also, the $NewCSV variable still exists regardless of whether the CSV is empty or not.
What does this do ? ForEach-Object {$NewCSV[$_.ID] = $_} ??? Normal import csv loads data fine. When I pipe it with this code, the array is blank. I am assuming ID is the "primary Key" column name ?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.