1

I want to compare each row of a CSV file with itself and every other row within a column.
For example, if the column values are like this:

Value_1
Value_2
Value_3

The code should pick Value_1 and compare it with Value_1 (yes, with itself too), Value_2 and then with Value_3. Then it should pick up Value_2 and compare it with Value_1, Value_2, Value_3, and so on.

I've written following code for this purpose:

csvfile = "c:\temp\temp.csv"
with open(csvfile, newline='') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        for compare_row in reader:
            if row == compare_row
                print(row,'is equal to',compare_row)
            else:
                print(row,'is not equal to',compare_row)

The code gives the following output:

['Value_1'] is not equal to ['Value_2']
['Value_1'] is not equal to ['Value_3']

The code compares Value_1 to Value_2 and Value_3 and then stops. Loop 1 does not pick Value_2, and Value_3. In short, the first loop appears to iterate over only the first row of the CSV file before stopping.

Also, I can't compare Value_1 to itself using this code. Any suggestions for the solution?

1
  • 1
    Your indentation looks weird, but I assume it is not like this in your real code. Could you try to create a new reader inside the first loop for compare_row instead of using the same for both loops? Commented Oct 4, 2015 at 0:57

2 Answers 2

3

I would have suggested loading the CSV into memory but this is not an option considering the size.

Instead think of it like a SQL statement, for every row in the left table you want to match it to a value in the right table. So you would only scan through the left table once and start re-scanning the right table until left has reached EoF.

with open(csvfile, newline='') as f_left:
    reader_left = csv.reader(f_left, delimiter=',')
    with open(csvfile, newline='') as f_right:
        reader_right = csv.reader(f_right, delimiter=',')
        for row in reader_left:
            for compare_row in reader_right:
                if row == compare_row:
                    print(row,'is equal to',compare_row)
                else:
                    print(row,'is not equal to',compare_row)
            f_right.seek(0)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for reply, but the program is going into infinite loop as result of f_right.seek(0) function. I tried to find any workaround but couldn't. Can you please suggest what could be the issue?
I am really sorry, I was putting the seek(0) at wrong place. Thanks a lot for the answer! Will mark green once I am done with testing. :)
1

Try to use inbuilt package from Python : Itertools

from itertools import product

with open("abcTest.txt") as inputFile:
    aList = inputFile.read().split("\n")
    aProduct = product(aList,aList)
    for aElem,bElem in aProduct:
        if aElem == bElem:
            print aElem,'is equal to',bElem
        else:
            print aElem,'is not equal to',bElem

The problem you are facing is called Cartesian product in Python where we need to compare the row of data with itself and every other row.

For this if you are doing multiple time read from source then it will cause signficant performance issue if the file is big. Instead you can store the the data in list and iterate it over multiple time but this also will have huge performance over head.

The itertool package is useful in this case as it is optimized for these kind of problems.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.