1

Python noob... please be gentle. In my current program, I have a list of 3 files which may or may not reside in my current directory. If they do reside in my directory, I want to be able to assign them values to be later used in other functions. If the file does not reside in the directory, it should not be assigned values as the file does not exist anyway. The code I have so far is below:

import os, csv

def chkifexists():
    files = ['A.csv', 'B.csv', 'C.csv']
    for fname in files:
        if os.path.isfile(fname):
            if fname == "A.csv":
                hashcolumn = 7
                filepathNum = 5
            elif fname == "B.csv":
                hashcolumn = 15
                filepathNum = 5
            elif fname == "C.csv":
                hashcolumn = 1
                filepathNum = 0
        return fname, hashcolumn, filepathNum


def removedupes(infile, outfile, hashcolumn):
    fname, hashcolumn, filepathNum = chkifexists()
    r1 = file(infile, 'rb')
    r2 = csv.reader(r1)
    w1 = file(outfile, 'wb')
    w2 = csv.writer(w1)
    hashes = set()
    for row in r2:
        if row[hashcolumn] =="": 
            w2.writerow(row)       
            hashes.add(row[hashcolumn])  
        if row[hashcolumn] not in hashes:
            w2.writerow(row)
            hashes.add(row[hashcolumn])
    w1.close()
    r1.close()


def bakcount(origfile1, origfile2):
    '''This function creates a .bak file of the original and does a row count to determine
    the number of rows removed'''
    os.rename(origfile1, origfile1+".bak")
    count1 = len(open(origfile1+".bak").readlines())
    #print count1

    os.rename(origfile2, origfile1)
    count2 = len(open(origfile1).readlines())
    #print count2

    print str(count1 - count2) + " duplicate rows removed from " + str(origfile1) +"!"


def CleanAndPrettify():
    print "Removing duplicate rows from input files..."
    fname, hashcolumn, filepathNum = chkifexists()
    removedupes(fname, os.path.splitext(fname)[0] + "2.csv", hashcolumn)
    bakcount (fname, os.path.splitext(fname)[0] + "2.csv")


CleanAndPrettify()

The problem I am running into is that the code runs through the list and stops at the first valid file it finds.

I'm not sure if I'm completely thinking of it in the wrong way but I thought I was doing it right.

Current output of this program with A.csv, B.csv, and C.csv present in the same directory:

Removing duplicate rows from input files...
2 duplicate rows removed from A.csv!

The Desired output should be:

Removing duplicate rows from input files...
2 duplicate rows removed from A.csv!
5 duplicate rows removed from B.csv!
8 duplicate rows removed from C.csv!

...and then continue on with the next portion of creating the .bak files. The output of this program without any CSV files in the same directory:

UnboundLocalError: local variable 'hashcolumn' referenced before assignment
4
  • So now, it's bound to finish executing chkifexists() as soon as it finds the first occurence. Are you calling chkifexists() multiple times. I am unable to grasp your problem completely. Commented Oct 16, 2011 at 6:54
  • Where are these magical hashcolumn and filepathNum values coming from? What do they mean? Why isn't that information stored in the actual files somehow? Commented Oct 16, 2011 at 9:27
  • Is the order of files significant? Commented Oct 16, 2011 at 9:37
  • @refaim The order of the files is not significant. Commented Oct 16, 2011 at 16:40

3 Answers 3

2

The checking condition that you are using is not the suggested way to compare two strings in python. Unless you are explicitly interning the string, you should not use is for comparison as there is no guarantee that it would return True use == instead.

Alternatively, you can do the following:

files=['A.csv', 'B.csv', 'C.csv']
filedict['A.csv']=(7,5)
filedict['B.csv']=(15,5)
filedict['C.csv']=(1,0)
print [(fname,filedict[fname]) for fname in files if filedict.has_key(fname) and os.path.isfile(fname)]
Sign up to request clarification or add additional context in comments.

Comments

2

Of course it stops after first match, because you are doing return from a function. Instead, you should either populate some array in the loop and return it in the end, or create a generator using yield on each iteration and raise StopIteration in case of nothing is found. The first approach is simpler and closer to your solution, here it is:

import os, csv

def chkifexists():
    files = ['A.csv', 'B.csv', 'C.csv']
    found = []
    for fname in files:
        if os.path.isfile(fname):
            if fname == "A.csv":
                hashcolumn = 7
                filepathNum = 5
            elif fname == "B.csv":
                hashcolumn = 15
                filepathNum = 5
            elif fname == "C.csv":
                hashcolumn = 1
                filepathNum = 0
            found.append({'fname': fname,
                          'hashcolumn': hashcolumn,
                          'filepathNum': filepathNum})
    return found

found = chkifexists()
if not found:
    print 'No files to scan'
else
    for f in found:
        print f['fname'], f['hashcolumn'], f['filepathNum']

4 Comments

is for string comparison is bad, unless you are comparing identities and not values.
Yeah, thanks, just copy-pasted that part, didn't even read it. :-) It actually even won't work on pypy (I've faced this issue one in third-party library and spent some 20 minutes debugging).
Actually the generator solution is simplier. You don't have to explicitly raise StopIteration. Python does that for you when the generator function ends (or return-s). So he would just have to replace return with yield in his code.
@yak can you provide an example with my code using a generator solution? Thanks!
1
+50

You have a couple problems in your code.

First, chkifexists is returning as soon as it finds an existing file, so it never checks any remaining names; also, if no files are found then the hashcolumn and filepathNum are never set -- giving you the UnboundLocalError.

Second, you are calling chkifexists in two places -- from removedupes and from CleanAndPrettify. So removedupes will run for every existing file for every existing file -- not what you want! In fact, since CleanAndPrettify has just verified the file exists removedupes should just go with whatever is handed to it.

There are at least three ways to handle the case where no files are found: have chkifexists raise an exception; have a flag in CleanAndPrettify that tracks if files were found; or turn the results of chkifexists into a list which you can then check for emptiness.

In the modified code I moved the files into a dictionary with the name as the key and the value as a tuple of hashcolumn and filepathNum. chkifexists now accepts the filenames to look for as a dictionary, and yields the values when a file is found; if no files were found, a NoFilesFound exception will be raised.

Here's the code:

import os, csv

# store file attributes for easy modifications
# format is 'filename': (hashcolumn, filepathNum)
files = {
        'A.csv': (7, 5),
        'B.csv': (15, 5),
        'C.csv': (1, 0),
        }

class NoFilesFound(Exception):
    "No .csv files were found to clean up"

def chkifexists(somefiles):
    # load all three at once, but only yield them if filename
    # is found
    filesfound = False
    for fname, (hashcolumn, filepathNum) in somefiles.items():
        if os.path.isfile(fname):
            filesfound = True
            yield fname, hashcolumn, filepathNum
    if not filesfound:
        raise NoFilesFound

def removedupes(infile, outfile, hashcolumn, filepathNum):
    # this is now a single-run function
    r1 = file(infile, 'rb')
    r2 = csv.reader(r1)
    w1 = file(outfile, 'wb')
    w2 = csv.writer(w1)
    hashes = set()
    for row in r2:
        if row[hashcolumn] =="": 
            w2.writerow(row)       
            hashes.add(row[hashcolumn])  
        if row[hashcolumn] not in hashes:
            w2.writerow(row)
            hashes.add(row[hashcolumn])
    w1.close()
    r1.close()


def bakcount(origfile1, origfile2):
    '''This function creates a .bak file of the original and does a row count
    to determine the number of rows removed'''
    os.rename(origfile1, origfile1+".bak")
    count1 = len(open(origfile1+".bak").readlines())
    #print count1

    os.rename(origfile2, origfile1)
    count2 = len(open(origfile1).readlines())
    #print count2

    print str(count1 - count2) + " duplicate rows removed from " \
        + str(origfile1) +"!"


def CleanAndPrettify():
    print "Removing duplicate rows from input files..."
    try:
        for fname, hashcolumn, filepathNum in chkifexists(files):
            removedupes(
                   fname,
                   os.path.splitext(fname)[0] + "2.csv",
                   hashcolumn,
                   filepathNum,
                   )
            bakcount (fname, os.path.splitext(fname)[0] + "2.csv")
    except NoFilesFound:
        print "no files to clean up"

CleanAndPrettify()

Unable to test as I don't have the A, B, and C .csv files, but hopefully this will get you pointed in the right direction. As you can see, the raise NoFilesFound option uses the flag method to keep track of files not being found; here is the list method:

def chkifexists(somefiles):
    # load all three at once, but only yield them if filename
    # is found
    for fname, (hashcolumn, filepathNum) in somefiles.items():
        if os.path.isfile(fname):
            filesfound = True
            yield fname, hashcolumn, filepathNum

def CleanAndPrettify():
    print "Removing duplicate rows from input files..."
    found_files = list(chkifexists(files))
    if not found_files:
        print "no files to clean up"
    else:
        for fname, hashcolumn, filepathNum in found_files:
            removedupes(...)
            bakcount(...)

2 Comments

This is exactly what I was looking for. One last thing before I accept this, if no files are present how would I go about printing "no files to clean up" in the CleanAndPrettify function? I've tried an IF statement if chkifexists(files) is None: print "no files to clean up" else: (the rest of the function goes here)
Thank you for the help. I tried to award bounty but it said wait an hour. I'll do so once I log back in this afternoon!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.