1

I have two csv files. I am reading this without the csv reader because there are inconsistencies in the lines - some lines have quotations and some do not, and this was throwing off the csv reader. The files are both of the same format, but have different entries So they look something like this:

a   b   c   d    e   f   g   h   i   j   h  i   j   k
"a  b   c   d    e   f   g   h   i   j   h  i   j   k  j"
"a  b   c   d    e   f   g   h   i   j   h  i   j   k  j"

What I need to do is find all the lines in file 1 and file 2 that have have the same value for the third column (c). Note that the rest of the values will be quite different so I don't think that something like difflib will work, unless I've missed something.

At first I tried using a nested for loop - something like this

for line in fileOne:
    entry=line.split()
    print ("A")
    for row in fileTwo:
        space=row.split
        print ("B")
        if space[2]=entry[2]:
            outputHandle.write(line)

but I found using print statements that this was outputting

 A
 B
 B
 B
 A
 A

I need the script to check through all the lines of the second file for each line in the first file so it would look like this:

 A
 B
 B
 B
 A
 B
 B
 B....etc

(This is very expensive, I know. But I am just staring out, not sure how to do this more efficiently, sadly)

I also tried using a function:

def file_check(variableName):
    for row in fileTwo:
        return("B")
        if entry in row:
            return ("found")
    return("not found")
for line in fileOne:
    entry= line.split()
    print ("A")
    var=file_check(entry[2])
    print (var)

This outputs: A ('Not found') A ('Not found') A ('Not found')

Since I am using test files, I KNOW that there are matching entries and so this is also not looping through the second file, but rather checking only the first line.

Sorry to ask such a basic question, StackOverflowians, but I'm really stuck this time. ANY advice is welcome and appreciated!!!

NOTE: this question HAS been asked before, but the answers only work for Python 2, the csv module for python 3 seems to be really different. Here is the previous version of this question: Comparing two CSV Files Based on Specific Data in two Columns

3 Answers 3

2

I'm not certain whether you mean you want to find how many lines in B have the same value for field 3, as each line in file A does, or match up the lines from both files that share the same value for field 3.... I'm going to assume the latter.

How about sorting each file's lines by the third column before you start?

If you do that, then you can read down file A, and for each time file A's value in field 3 changes, print the records from A with that new value and then switch to handling file B:

Arecord = read file A

while not EOF on file A:
    currentKey = field 3 of Arecord
    print "\n" + Arecord
    Arecord = read file A
    while field 3 of Arecord == currentKey
        print Arecord

    while field 3 of Brecord < currentKey:
        Brecord = read file B
    while field 3 of Brecord == currentKey:
        print Brecord

Because you already sorted both file by field 3, this will get your results in one quick pass.

If for some reason you need the lines back in order at the end, add their original record-number as an additional field before you start, sort by that afterwards, and then remove that extra field.

If you add an extra field that says which file each line came from, then you can just put the files together and sort by two keys: field 3 and the "which file I came from" field, and get the results in one shot.

Caveat: the usual *nix "sort" command (like most/all the other *nix "field"-related commands) can't deal with quoted fields. So you may have to get rids of quoting first. "sort" also isn't happy with Unicode, so if there are any non-ASCII characters in your data use "msort" or something instead.

Hope that helps.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, this is a cool way of doing this. I was in fact looking for the latter way - the one you described, so it is definitely helpful.
1

You need to go through each line in each file and split() them into arrays so you can compare them. try something like this:

with open("file1") as file1, open("file2") as file2: for row1 in file 1: row1=row1.split() for row2 in file2: row2=row2.split() if row1[2]==row2[2]: print("found")

if you also need to take out the quotation mark in the string you could try somehtig like this:

row1=row1.split() for i in range (len(row1)): row1[i]=row1[i].replace("\"", "")

this will replace every quotation mark with an empty string.

3 Comments

Yes, this is what I have basically, the nested for loop. But this doesn't actaully scan through all the lines in the second file for some reason, as I noted. Try it with some print statements in each loop and you may see what I am talking about.
you should test that part separately, because that seems very unlikely.
The problem, I realized, is that the pointer is at the bottom of the second file by the time the for loop reaches file2 for the second run-through. so if you rewind the cursor before the second for loop your code runs perfectly! If you edit your answer to include "file2.seek(0)" in your answer I'll "accept" it. Thanks for the help!
0

i would try something along the lines of:

import pandas as pd
df1 = pd.read_csv(f1)
df2 = pd.read_csv(f2)
df1['same'] = df1[2] == df2[2]

that should give you an array of True/False showing where rows are the same/different.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.