Can't compare strings in Python

Question

I have this code that should open and read two text files, and match when a word is present in both. The match is represented by printing "SUCESS" and by writing the word to a temp.txt file.

dir = open('listac.txt','r')
path = open('paths.txt','r')
paths = path.readlines()
paths_size = len(paths)
matches = open('temp.txt','w')
dirs = dir.readlines()

for pline in range(0,len(paths)):
        for dline in range(0,len(dirs)):
                p = paths[pline].rstrip('\n').split(".")[0].replace(" ", "")
                dd = dirs[dline].rstrip('\n').replace(" ", "")
                #print p.lower()
                #print dd.lower()
                if (p.lower() == dd.lower()):
                        print "SUCCESS\n"
                        matches.write(str(p).lower() + '\n')

listac.txt is formatted as

/teetetet
/eteasdsa
/asdasdfsa
/asdsafads
.
. 
...etc

paths.txt is formated as

/asdadasd.php/asdadas/asdad/asd
/adadad.html/asdadals/asdsa/asd
.
.
...etc

hence I use the split function in order to get the first /asadasda (within paths.txt) before the dot. The problem is that the words never seem to match, I have even printed out each comparison before each IF statement and they are equal, is there something else that Python does before comparing strings?

=======

Thanks everyone for the help. As suggested by you, I cleaned the code so It ended up like this:

dir = open('listac.txt','r')
path = open('paths.txt','r')
#paths = path.readlines()
#paths_size = len(paths)

for line in path:
        p = line.rstrip().split(".")[0].replace(" ", "")
        for lines in dir:
                d = str(lines.rstrip())
                if p == d:
                        print p + " = " + d

Apparently, having p declared and initialized before entering the second for loop makes a difference in the comparison down the road. When I declared p and d within the second for loop, it wouldn't work. I don't know the reason for that but If someone does, I am listening :)

Thanks again!

Too complicated. Don't use for pline in range(0,len(paths)), just iterate over the elements with for pline in paths. And why rstrip('\n'). There might be an extra \r. Just use rstrip(). — Matthias
– Matthias, Commented Sep 11, 2012 at 14:49
You can also move p = ... outside the inner for loop since it's doing the same calculation each time. — mgilson
– mgilson, Commented Sep 11, 2012 at 14:51
Rather than compare every possible pair from two lists, find the intersection of two sets. — unutbu
– unutbu, Commented Sep 11, 2012 at 14:53

mgilson · Accepted Answer · 2012-09-12 13:43:00Z

7

While we're reading the entire datafiles into memory anyway, why not try to use sets and get the intersection?:

def format_data(x):
    return x.rstrip().replace(' ','').split('.')[0].lower()

with open('listac.txt') as dirFile:
     dirStuff = set( format_data(dline) for dline in dirFile )

with open('paths.txt') as pathFile:
     intersection = dirStuff.intersection( format_data(pline) for pline in pathFile )

for elem in intersection:
    print "SUCCESS\n"
    matches.write(str(elem)+"\n")

I've used the same format_data function for both datasets, since they look more or less the same, but you can use more than one function if you please. Also note that this solution only reads 1 of the two files into memory. The intersection with the other should be calculated lazily.

As pointed out in the comments, this does not make any attempt to preserve the order. However, if you really need to preserve the order, try this:

<snip>
...
</snip>

with open('paths.txt') as pathFile:
    for line in pathFile:
        if format_line(line) in dirStuff:
           print "SUCCESS\n"
           #...

edited Sep 12, 2012 at 13:43

answered Sep 11, 2012 at 14:55

mgilson

312k70 gold badges656 silver badges722 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Tim Pietzcker Over a year ago

This will lead to a different output, though, since the order of the lines will be lost. Most probably this won't matter.

Alfe Over a year ago

I prefer a & b before a.intersection(b). "What is in a & b" == "What is in a & in b".

Alfe Over a year ago

Too bad that operators like ∩ are not ASCII and thus not (yet) part of Python ;-)

Alfe Over a year ago

Again I think that accepting an answer within the first 10 minutes isn't a good idea and should be prohibited. I cringe on seeing that now the nested loop stuff is optically preferred above the set solution. And I also don't think that user1663160 fares good with it.

Alfe Over a year ago

Using the .index() stuff is no good idea to reestablish the order anyway. Better just walk linearly through the input which provides the order and use in on a set of the other input to determine whether a match is found.

|

desimusxvii · Accepted Answer · 2012-09-11 14:50:32Z

4

I'd have to see more of your data set to see why you aren't getting matches. I've refactored some of your code to be more pythonic.

dirFile = open('listac.txt','r')
pathFile = open('paths.txt','r')
paths = pathFile.readlines()
dirs = dirFile.readlines()

matches = open('temp.txt','w')

for pline in paths:
    p = pline.rstrip('\n').split(".")[0].replace(" ", "")
    for dline in dirs:
        dd = dline.rstrip('\n').replace(" ", "")
        #print p.lower()
        #print dd.lower()
        if p.lower() == dd.lower():
            print "SUCCESS\n"
            matches.write(str(p).lower() + '\n')

answered Sep 11, 2012 at 14:50

desimusxvii

1,0941 gold badge8 silver badges10 bronze badges

2 Comments

Tim Pietzcker Over a year ago

+1, but you can avoid the horrible nested loop by converting dirs into a set first (dirs = {line.lower() for line in dirFile} and then just check if p.lower() in dirs), and just iterate over the file directly, avoiding the readlines() and all those rstrip()s entirely.

desimusxvii Over a year ago

@TimPietzcker for sure. I would write this completely differently. I think it's more helpful to ease beginners into such things.

Collectives™ on Stack Overflow

Can't compare strings in Python

2 Answers 2

11 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related