2

I have this code that should open and read two text files, and match when a word is present in both. The match is represented by printing "SUCESS" and by writing the word to a temp.txt file.

dir = open('listac.txt','r')
path = open('paths.txt','r')
paths = path.readlines()
paths_size = len(paths)
matches = open('temp.txt','w')
dirs = dir.readlines()

for pline in range(0,len(paths)):
        for dline in range(0,len(dirs)):
                p = paths[pline].rstrip('\n').split(".")[0].replace(" ", "")
                dd = dirs[dline].rstrip('\n').replace(" ", "")
                #print p.lower()
                #print dd.lower()
                if (p.lower() == dd.lower()):
                        print "SUCCESS\n"
                        matches.write(str(p).lower() + '\n')

listac.txt is formatted as

/teetetet
/eteasdsa
/asdasdfsa
/asdsafads
.
. 
...etc

paths.txt is formated as

/asdadasd.php/asdadas/asdad/asd
/adadad.html/asdadals/asdsa/asd
.
.
...etc

hence I use the split function in order to get the first /asadasda (within paths.txt) before the dot. The problem is that the words never seem to match, I have even printed out each comparison before each IF statement and they are equal, is there something else that Python does before comparing strings?

=======

Thanks everyone for the help. As suggested by you, I cleaned the code so It ended up like this:

dir = open('listac.txt','r')
path = open('paths.txt','r')
#paths = path.readlines()
#paths_size = len(paths)

for line in path:
        p = line.rstrip().split(".")[0].replace(" ", "")
        for lines in dir:
                d = str(lines.rstrip())
                if p == d:
                        print p + " = " + d

Apparently, having p declared and initialized before entering the second for loop makes a difference in the comparison down the road. When I declared p and d within the second for loop, it wouldn't work. I don't know the reason for that but If someone does, I am listening :)

Thanks again!

4
  • 1
    In your examples, there is no match. Commented Sep 11, 2012 at 14:46
  • 1
    Too complicated. Don't use for pline in range(0,len(paths)), just iterate over the elements with for pline in paths. And why rstrip('\n'). There might be an extra \r. Just use rstrip(). Commented Sep 11, 2012 at 14:49
  • You can also move p = ... outside the inner for loop since it's doing the same calculation each time. Commented Sep 11, 2012 at 14:51
  • 1
    Rather than compare every possible pair from two lists, find the intersection of two sets. Commented Sep 11, 2012 at 14:53

2 Answers 2

7

While we're reading the entire datafiles into memory anyway, why not try to use sets and get the intersection?:

def format_data(x):
    return x.rstrip().replace(' ','').split('.')[0].lower()

with open('listac.txt') as dirFile:
     dirStuff = set( format_data(dline) for dline in dirFile )

with open('paths.txt') as pathFile:
     intersection = dirStuff.intersection( format_data(pline) for pline in pathFile )

for elem in intersection:
    print "SUCCESS\n"
    matches.write(str(elem)+"\n")

I've used the same format_data function for both datasets, since they look more or less the same, but you can use more than one function if you please. Also note that this solution only reads 1 of the two files into memory. The intersection with the other should be calculated lazily.

As pointed out in the comments, this does not make any attempt to preserve the order. However, if you really need to preserve the order, try this:

<snip>
...
</snip>

with open('paths.txt') as pathFile:
    for line in pathFile:
        if format_line(line) in dirStuff:
           print "SUCCESS\n"
           #...
Sign up to request clarification or add additional context in comments.

11 Comments

This will lead to a different output, though, since the order of the lines will be lost. Most probably this won't matter.
I prefer a & b before a.intersection(b). "What is in a & b" == "What is in a & in b".
Too bad that operators like ∩ are not ASCII and thus not (yet) part of Python ;-)
Again I think that accepting an answer within the first 10 minutes isn't a good idea and should be prohibited. I cringe on seeing that now the nested loop stuff is optically preferred above the set solution. And I also don't think that user1663160 fares good with it.
Using the .index() stuff is no good idea to reestablish the order anyway. Better just walk linearly through the input which provides the order and use in on a set of the other input to determine whether a match is found.
|
4

I'd have to see more of your data set to see why you aren't getting matches. I've refactored some of your code to be more pythonic.

dirFile = open('listac.txt','r')
pathFile = open('paths.txt','r')
paths = pathFile.readlines()
dirs = dirFile.readlines()

matches = open('temp.txt','w')

for pline in paths:
    p = pline.rstrip('\n').split(".")[0].replace(" ", "")
    for dline in dirs:
        dd = dline.rstrip('\n').replace(" ", "")
        #print p.lower()
        #print dd.lower()
        if p.lower() == dd.lower():
            print "SUCCESS\n"
            matches.write(str(p).lower() + '\n')

2 Comments

+1, but you can avoid the horrible nested loop by converting dirs into a set first (dirs = {line.lower() for line in dirFile} and then just check if p.lower() in dirs), and just iterate over the file directly, avoiding the readlines() and all those rstrip()s entirely.
@TimPietzcker for sure. I would write this completely differently. I think it's more helpful to ease beginners into such things.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.