2

since a week I try to succeed on my topic, but I cant find any acceptable solution. I mean, ... I have a working solution, but comparing only takes about half a day :-S

Precondition: Both csv-files are already copy-pasted into local workbook. they are present and ready to play with them. each file has ~6000 rows and 4 columns. column A: documentname/version column B: subject1 column C: subject2 column D: boolean-artefact both csv files have same structure. column A includes a documentname and its latest version. each row contains a combination of: documentname/version, subj1, subj2, boolean

Examples of CSV_old new including comments/change for csv_new in column_E

Document/Version    Subj1   Subj2   BOOLEAN 
DOC_1/Vers1         FUN     GERMANY FALSE   
DOC_2/Vers3         FUN     GERMANY TRUE    
DOC_2/Vers3         FUN     UK      TRUE    <- to be deleted in CSV_new
DOC_2/Vers3         FUN     FRANCE  TRUE    
DOC_3/Vers7         ACTION  GERMANY FALSE   <- Version Update in CSV_new
DOC_4/Vers4         MOVIE   UK      TRUE    
DOC_6/Vers1         HELP    SPAIN   FALSE   
DOC_7/Vers2         FUN     GERMANY FALSE   <- boolean: true in CSV_new
DOC_8/Vers5         FUN FRANCE  TRUE    <- Subj1: ACTION instead of FUN

CSV_new

Document/Version    Subj1   Subj2   BOOLEAN 
DOC_1/Vers1         FUN     GERMANY FALSE   
DOC_2/Vers3         FUN     GERMANY TRUE    
DOC_2/Vers3         FUN     UK      TRUE    
DOC_2/Vers3         FUN     FRANCE  TRUE    
DOC_3/Vers9         ACTION  GERMANY FALSE   <- Version Updated
DOC_4/Vers4         MOVIE   UK      TRUE    
DOC_5/Vers5         DANGER  UK      FALSE   <- new/added Row in CSV_new
DOC_6/Vers1         HELP    SPAIN   FALSE   
DOC_7/Vers2         FUN     GERMANY FALSE   <- boolean updated to true
DOC_8/Vers5         ACTION  FRANCE  TRUE    <- Subj1: ACTION instead of FUN

Aim: Compare two CSV files (both derived from database). Each file is a derived version from a huge database (extract). I would like to check an older csv file (e.g version 2.0, csv_old) against a newer on (e.g. version 4.1, csv_new).

This way I would like to see the differents between both derived versions (extracts) of the database. There can be new inserted/added lines as well as deleted lines.

So far I got a code which is working, but takes sooooooo much time. I paste a kind of pseudo code to give you an imagination about my approach (it does only contain one step of comparison):

For rowInOldCSV = 3 To Sheets("_ws_oldCSV").Range("A65536").End(xlUp).Row

Set findSameDocumentNumberInColumnA = Sheets(givenActiveWS).Cells.Find(Sheets("_ws_oldCSV").Range("A" & rowInOldCSV & ":D" & rowInOldCSV).Value, LookIn:=xlValues)
Set findSameDocumentNumberInColumnA_withoutVers = Sheets(givenActiveWS).Cells.Find(Left(Sheets("_ws_oldCSV").Cells(rowInOldCSV, 1).Value, Len(Sheets("_ws_oldCSV").Cells(rowInOldCSV, 1).Value) - 5), LookIn:=xlValues)


If Not findSameDocumentNumberInColumnA Is Nothing Then
    'document/version found!

    firstAddress = findSameDocumentNumberInColumnA.Address
    Do
         'if subj1+subj2 are same
        If (Sheets(givenActiveWS).Cells(findSameDocumentNumberInColumnA.Row, 2).Value = Sheets("_ws_oldCSV").Cells(rowInOldCSV, 2).Value) And _
           (Sheets(givenActiveWS).Cells(findSameDocumentNumberInColumnA.Row, 3).Value = Sheets("_ws_oldCSV").Cells(rowInOldCSV, 3).Value) Then '....and boolean-value the same

            'Sheets("_ws_oldCSV").Range("A" & rowInOldCSV & ":D" & rowInOldCSV).Copy 'takes even longer
            'Sheets(givenActiveWS).Cells(findSameDocumentNumberInColumnA.Row, 6).PasteSpecial Paste:=xlPasteValues
            Sheets(givenActiveWS).Cells(findSameDocumentNumberInColumnA.Row, 6).Value = Sheets("_ws_oldCSV").Cells(rowInOldCSV, 1).Value
            Sheets(givenActiveWS).Cells(findSameDocumentNumberInColumnA.Row, 7).Value = Sheets("_ws_oldCSV").Cells(rowInOldCSV, 2).Value
            Sheets(givenActiveWS).Cells(findSameDocumentNumberInColumnA.Row, 8).Value = Sheets("_ws_oldCSV").Cells(rowInOldCSV, 3).Value
            Sheets(givenActiveWS).Cells(findSameDocumentNumberInColumnA.Row, 9).Value = Sheets("_ws_oldCSV").Cells(rowInOldCSV, 4).Value

            'leave loop
            Exit Do
        End If
        Set findSameDocumentNumberInColumnA = Sheets(givenActiveWS).Cells.FindNext(findSameDocumentNumberInColumnA)
    Loop While Not findSameDocumentNumberInColumnA Is Nothing And findSameDocumentNumberInColumnA.Address <> firstAddress

Else
    'document/version not found
    If Not findSameDocumentNumberInColumnA_withoutVers Is Nothing Then
        'document found, looks like new version
        'mark it with yellow to show updated version
    Else
        'unkown document, means  new introduced since csv_old
        'copy it under last item in RowF
        '
    End If
End If
next rowInOldCSV

So far to my approach. I saw two different ones: http://www.ms-office-forum.net/forum/showthread.php?t=279399 and Excel VBA: Range to String Array in 1 step both seems to work quite well and veeeery fast, but unfortunately I am not able to use it for my scenario.

I guess, I have to put the values from column into string-array to start a comparison? I am out of ideas and have no clue how to handle Column-Values into String-Arrays. Sorry, ...

You might help me?

Result of comparison: would be nice to write stuff into CSV_new.

Doc/Vers       Subj1    Subj2    BOOLEAN    Doc    Subj1    Subj1    Boolean
DOC_1/Vers1    FUN      GERMANY  FALSE      -       -       -           -
DOC_2/Vers3    FUN      GERMANY  TRUE       -       -       -           -
DOC_2/Vers3    FUN      UK       TRUE      Deleted  -       -           -
DOC_2/Vers3    FUN      FRANCE   TRUE      -        -       -           -
DOC_3/Vers9    ACTION   GERMANY  FALSE     Updated  -       -           -
DOC_4/Vers4    MOVIE    UK       TRUE      -        -       -           -
DOC_5/Vers5    DANGER   UK       FALSE     New      -       -           -
DOC_6/Vers1    HELP     SPAIN    FALSE     -        -       -           -
DOC_7/Vers2    FUN      GERMANY  TRUE      -        -       -           X
DOC_8/Vers5    ACTION   FRANCE   TRUE      -        X       -           -

Many, many thanks in advance for your effort!!!!! :o)

1 Answer 1

0

This code will generate 2 sets of results: one for Sheet1 (old), the other for Sheet2 (new)

  • Set 1 - Sheet1 will show the records missing from Sheet2
  • Set 2 - Sheet2 will show the records missing from Sheet1
  • Both sets will show updated records

It uses nested dictionaries (detailed bellow)


Option Explicit

Public Sub CompareCSVs()    '1 = Old, 2 = New; UsedRange starts at A1

    Const LC1 = 4           'D - LastCol in Old
    Const LC2 = 4           'D - LastCol in New

    Dim ur1 As Range, arr1 As Variant, dv1 As Object
    Dim ur2 As Range, arr2 As Variant, dv2 As Object

    Set ur1 = Sheet1.UsedRange  'Or: ThisWorkbook.Worksheets("csv_old").UsedRange
    Set ur2 = Sheet2.UsedRange  'Or: ThisWorkbook.Worksheets("csv_new").UsedRange
    arr1 = ur1
    arr2 = ur2
    Set dv1 = CreateObject("Scripting.Dictionary")
    Set dv2 = CreateObject("Scripting.Dictionary")

    Dim urRes1 As Range, urRes2 As Range, arrRes1 As Variant, arrRes2 As Variant

    Set urRes1 = ur1.Offset(1, LC1).Resize(ur1.Rows.Count - 1, LC1 + 1) 'Exclude Headers
    Set urRes2 = ur2.Offset(1, LC2).Resize(ur2.Rows.Count - 1, LC2 + 1) 'Exclude Headers
    urRes1.ClearContents        'Clear results
    urRes2.ClearContents
    arrRes1 = urRes1
    arrRes2 = urRes2

    SetDictionaries dv1, arr1, LC1
    SetDictionaries dv2, arr2, LC2:     'ShowAllItems dv1:   ShowAllItems dv2

    CompareData dv1, dv2, arrRes2
    CompareData dv2, dv1, arrRes1

    urRes1 = arrRes1
    urRes2 = arrRes2
End Sub

Private Sub SetDictionaries(ByRef d As Object, ByRef arr As Variant, ByVal ubC As Long)

    Dim r As Long, c As Long, k As String

    For r = 2 To UBound(arr)
        For c = 1 To ubC
            k = k & arr(r, c) & "|"
            d(Left(k, Len(k) - 1)) = 0
        Next
        k = vbNullString
    Next
End Sub

Private Sub CompareData(ByRef d1 As Variant, ByRef d2 As Variant, ByRef res As Variant)

    Dim r As Long, c As Long, itm As Variant, sp As Variant, k As Variant

    r = 1
    For Each itm In d2
        sp = Split(itm, "|")
        c = UBound(sp) + 1
        If Not d1.Exists(itm) Then
            If Len(res(r, 1)) = 0 Then
                res(r, 1) = IIf(c = 1, "Missing: ", "Updated: ")
                res(r, c + 1) = sp(c - 1)
            Else
                If res(r, 1) = "Updated: " Then res(r, c + 1) = sp(c - 1)
            End If
        End If
        If c = 4 Then r = r + 1
    Next
End Sub

Private Sub ShowAllItems(ByRef d As Object)

    Dim x As Variant

    For Each x In d
        Debug.Print x   'Space$(5), String$(5, "-")
    Next
    Debug.Print
End Sub

Dictionaries, when ready to be compared

Dictionaries

Before

Before

After

After

Note: The sample data you provided is not the same as in the description

CSV_old

Document/Version Subj1  Subj2   BOOLEAN 
DOC_1/Vers1      FUN    GERMANY FALSE                                 <- Correct
DOC_2/Vers3      FUN    GERMANY TRUE                                  <- Correct
DOC_2/Vers3      FUN    UK      TRUE  <- to be deleted in CSV_new     <- Exists in new
DOC_2/Vers3      FUN    FRANCE  TRUE                                  <- Correct
DOC_3/Vers7      ACTION GERMANY FALSE <- Version Update in CSV_new    <- This not in new
DOC_4/Vers4      MOVIE  UK      TRUE                                  <- Correct
DOC_6/Vers1      HELP   SPAIN   FALSE                                 <- Correct  
DOC_7/Vers2      FUN    GERMANY FALSE <- boolean: true in CSV_new     <- FALSE in new
DOC_8/Vers5      FUN    FRANCE  TRUE  <- Subj1: ACTION instead of FUN <- Correct

CSV_new

Document/Version Subj1  Subj2   BOOLEAN 
DOC_1/Vers1      FUN    GERMANY FALSE                                 <- Correct
DOC_2/Vers3      FUN    GERMANY TRUE                                  <- Correct  
DOC_2/Vers3      FUN    UK      TRUE                                  <- Exists in new
DOC_2/Vers3      FUN    FRANCE  TRUE                                  <- Correct   
DOC_3/Vers9      ACTION GERMANY FALSE <- Version Updated              <- New record
DOC_4/Vers4      MOVIE  UK      TRUE                                  <- Correct
DOC_5/Vers5      DANGER UK      FALSE <- new/added Row in CSV_new     <- Correct
DOC_6/Vers1      HELP   SPAIN   FALSE                                 <- Correct
DOC_7/Vers2      FUN    GERMANY FALSE <- boolean updated to true      <- FALSE in new
DOC_8/Vers5      ACTION FRANCE  TRUE  <- Subj1: ACTION instead of FUN <- Correct
Sign up to request clarification or add additional context in comments.

5 Comments

MAAAAANY THANKS, thats so good! :o) And its so fast! Unfortunately this thing about using 'CreateObject("Scripting.Dictionary")' I dont understand at all. Result "Missing:" concludes added and deleted rows in the same way/meaning. Is there a chance to make a difference between new/added rows and deleted ones?
I'm glad it helped ! but I'm not sure what you meant by making a difference between new and deleted rows: Sheet1 will show all missing rows from Sheet2 - this shows all deleted records from CSV_new. Then, Sheet2 will show all missing rows from Sheet1 - this shows all new records from CSV_new. You cannot show new and deleted records in one sheet, because you'd have to insert the missing row, to show that is missing (maybe I'm not understanding correctly)
@stoeven - BTW - can you give me an idea how long it takes now? (how many total rows in old, and how many total rows in new?). I didn't have proper data for testing thousands of rows. Thanks
Oh dear, you are right - got it :o) So, ... i will keep this as a solution. Right now I have approx. 5900 rows in both and it takes about 2 seconds until finish. Brilliant approach working with these Dictionarys. But to be honest, need to Bing/Google that again to understand its structure to go further on my own. Many thanks again for your effort you put into this topic.
You are welcome! If you work with a lot of data you will be able to improve performance significantly with arrays, dictionaries, and collections. The key thing to remember is to keep interactions with ranges at a minimum: read the entire sheet into an array: Dim arr as Variant: arr = Sheet1.UsedRange, make the changes to the array the same way you'd make them to the sheet: Sheet1.Cells(1, 1) = "Test", is the same as arr(1, 1) = "Test", then place the entire array back on the UsedRange: Sheet1.UsedRange = arr

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.