Fast compare two large csv(boths rows and columns) in powershell

Question

I have two large CSVs to compare. Bosth csvs are basically data from the same system 1 day apart. No of rows are around 12k and columns 30.

The aim is to identify what column data has changed for primary key(#ID).

My idea was to loop through the CSVs to identify which rows have changed and dump these into a separate csvs. One done, I again loop through the changes rows, and indetify the exact change in column.

    NewCSV = Import-Csv -Path ".\Data_A.csv"
    OldCSV = Import-Csv -Path ".\Data_B.csv"
     
   foreach ($LineNew in $NewCSV)
    {
        ForEach ($LineOld in $OldCSV)
        {
            If($LineNew -eq $LineOld)
            {
                Write-Host $LineNew, " Match"
            }else{
                Write-Host $LineNew, " Not Match"
            }
        }
    }

But as soon as run the loop, it takes forever to run for 12k rows. I was hoping there must be a more efficient way to compare large files powershell. Something that is quicker.

You only care about "what has changed" on $NewCsv from $OldCsv or side-by-side comparison? Also, does ID have unique values on both CSVs? — Santiago Squarzon
– Santiago Squarzon, Commented Dec 17, 2021 at 13:32
I want to know both the old value and new value for changed columns for a particular ID — misguided
– misguided, Commented Dec 17, 2021 at 13:34
What about Compare-Object (Get-content Data_A.csv) (Get-Content Data_B.csv) ? — Sergiu
– Sergiu, Commented Dec 17, 2021 at 13:39
Why it takes forever: by comparing each row in oldcsv against each row of newcsv makes 12k*12k comparisons, so about 144 million operations. — vonPryz
– vonPryz, Commented Dec 17, 2021 at 13:43
This is a quiet classic issue. Using this Join-Object script/Join-Object Module (see also: In Powershell, what's the best way to join two tables into one?): Import-Csv .\Data_A.csv |Join (Import-Csv .\Data_B.csv) -on ID -Name 'A.', 'B.' — iRon
– iRon, Commented Dec 17, 2021 at 16:16

Santiago Squarzon · Accepted Answer · 2021-12-17 14:05:25Z

1

Well you can give this a try, I'm not claiming it will be fast for what vonPryz has already pointed out but it should give you a good side-by-side perspective to compare what has changed from OldCsv to NewCsv.

Note: Those cells that have the same value on both CSVs will be ignored.

$NewCSV = Import-Csv -Path ".\Data_A.csv"
$OldCSV = Import-Csv -Path ".\Data_B.csv" | Group-Object ID -AsHashTable -AsString

$properties = $newCsv[0].PSObject.Properties.Name

$result = foreach($line in $NewCSV)
{
    if($ref = $OldCSV[$line.ID])
    {
        foreach($prop in $properties)
        {
            if($line.$prop -ne $ref.$prop)
            {
                [pscustomobject]@{
                    ID = $line.ID
                    Property = $prop
                    OldValue = $ref.$prop
                    NewValue = $line.$prop
                }
            }
        }
        continue
    }

    Write-Warning "ID $($line.ID) could not be found on Old Csv!!"
}

answered Dec 17, 2021 at 14:05

Santiago Squarzon

65.7k5 gold badges26 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

misguided Over a year ago

This solution is much faster that the one I was trying myself. One issue though, it randomly read some values as System.Object[] and hence shows data mismatch sometime.

Santiago Squarzon Over a year ago

@misguided without looking at the actual CSVs I don't see how that could be possible

Santiago Squarzon Over a year ago

The only possible way I could see it returning Object[] is if the OldCsv or NewCsv did not have Unique Values on the ID Column, which you already mentioned values were unique, if this was not the case then it should've been clarified.

misguided Over a year ago

Awesome...that was the cause. I went back and dug a bit deeper into data and sawa few duplicates which as you say were causing the issue. I have removed them now and it works fine. Less than a minute for comparing 2 csv files with 12k rows each.

Santiago Squarzon Over a year ago

@misguided good to know, glad it worked :)

Mathias R. Jessen · Accepted Answer · 2021-12-17 14:08:36Z

1

As vonPryz hints in the comments, you've written an algorithm with quadratic time complexity (O(n²) in Big-O notation) - every time the input size doubles, the number of computations performed increase 4-fold.

To avoid this, I'd suggest using a hashtable or other dictionary type to hold each data set, and use the primary key from the input as the dictionary key. This way you get constant-time lookup of corresponding records, and the time complexity of your algorithm becomes near-linear (O(2n + k)):

$NewCSV = @{}
Import-Csv -Path ".\Data_A.csv" |ForEach-Object {
  $NewCSV[$_.ID] = $_
}

$OldCSV = @{}
Import-Csv -Path ".\Data_B.csv" |ForEach-Object {
  $OldCSV[$_.ID] = $_
}

Now that we can efficiently resolve each row by it's ID, we can inspect the whole of the data sets with an independent loop over each:

foreach($entry in $NewCSV.GetEnumerator()){
  if(-not $OldCSV.ContainsKey($entry.Key)){
    # $entry.Value is a new row, not seen in the old data set
  }

  $newRow = $entry.Value
  $oldRow = $OldCSV[$entry.Key]

  # do the individual comparison of the rows here
}

Do another loop like above, but with $NewCSV in place of $OldCSV to find/detect deletions.

answered Dec 17, 2021 at 14:08

Mathias R. Jessen

178k13 gold badges175 silver badges234 bronze badges

6 Comments

misguided Over a year ago

You cannot call a method on a null-valued expression. At C:\Data\FIle.PS1:39 char:19 + foreach($entry in $NewCSV.GetEnumerator()){ + ~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidOperation: (:) [], RuntimeException + FullyQualifiedErrorId : InvokeMethodOnNull

Mathias R. Jessen Over a year ago

@misguided Did you run the second block of code without running the first block of code in my answer? :) First line, $NewCSV = @{}, assigns a hashtable to $NewCSV, so if you run them in order, that error won't occur

misguided Over a year ago

I did actually. But still getting the error. The data set has some null values in some columns(not primary key). Could that be an issue?

Mathias R. Jessen Over a year ago

No, Import-Csv would spit out empty strings rather than null values. Also, the $NewCSV variable still exists regardless of whether the CSV is empty or not.

misguided Over a year ago

What does this do ? ForEach-Object {$NewCSV[$_.ID] = $_} ??? Normal import csv loads data fine. When I pipe it with this code, the array is blank. I am assuming ID is the "primary Key" column name ?

|

Collectives™ on Stack Overflow

Fast compare two large csv(boths rows and columns) in powershell

2 Answers 2

5 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related