1

I have few filenames in a list called data. I want to read the contents of the file and check if a given text (example - orange) appears in the file. My filenames are appended to the list in a sequential order i.e if given text "orange", appears in file pi.txt (index 2), it will be present in all the files after index 2 as well and off course i want to get the index or filename where text "orange" appeared first.

I have more than thousand files in a list, therefore i want to use binary search.

data = ['ae.txt', 'ac.txt', 'pi.txt', 'ad.txt', 'mm.txt', 'ab.txt']
target = "orange"

def binary_search(a, x):
    lo = 0
    hi = len(a)

    while lo < hi:
        mid = (lo + hi) // 2

        if not x in open(a[mid]).read():
            lo = mid + 1
        elif x in open(a[mid]).read():
            hi = mid
        elif mid > 0 and x in open(a[mid-1]).read():
            hi = mid
        else:
            return mid

    return -1

print(binary_search(data, target))



$ cat ae.txt
papaya
guava

$ cat ac.txt 
mango
durian
papaya
guava

$ cat pi.txt 
orange
papaya
guava

$ cat ad.txt 
orange
papaya
guava

$ cat mm.txt 
orange
papaya
guava

$ cat ab.txt 
orange
papaya
guava
2
  • What is your question? Commented Jun 26, 2019 at 11:19
  • if you run the code i pasted above, it does not give the expected result. In my case , result should be 2 because "orange" first appears in "pi.txt". Thanks. Commented Jun 26, 2019 at 11:23

3 Answers 3

1

I think there is a bit too many if conditions, you can manage to get the expected result like this :

data = ['ae.txt', 'ac.txt', 'pi.txt', 'ad.txt', 'mm.txt', 'ab.txt']
target = "orange"

def binary_search(a, x):
    lo = 0
    hi = len(a)

    while lo < hi:
        mid = (lo + hi) // 2
        print(mid)
        if not x in open(a[mid]).read():
            lo = mid + 1

        elif x in open(a[mid]).read():
            hi = mid
        if lo == hi:
            return lo

        print("low : {}; high : {}".format(lo,hi))

    return -1
index = binary_search(data, target)
print("The index where we first found the word orange is {}, the file name is {}".format(index,data[index]))

The index where we first found the word orange is 2, the file name is pi.txt
Sign up to request clarification or add additional context in comments.

1 Comment

your solution works as expected. Thank you for your help !
1

Your binary search is not really looking for equality so it could be simplified a bit:

def binary_search(files, string):
    lo,hi  = 0,len(files)-1
    while hi>=lo:
        mid     = (hi+lo)//2
        if string in open(files[mid]).read(): 
            hi = mid-1
        else: 
            lo = mid+1
    return lo

Since there is no equality check, hi and lo will hit the stop condition (hi>=lo) at which point lo will be on the index of the first match or at len(files) if there are no matches.

1 Comment

your solution works as expected. Thank you for your help !
0

Binary search on files only works if your files are already sorted by the search key you're using, meaning that file X[n+1] mustn't have data lexicographically smaller than file X[n]. In this case, you'll have to go through every file manually until you go thorough all of them... or build a dictionary file like this:

'ae.txt', 'ac.txt', 'pi.txt', 'ad.txt', 'mm.txt', 'ab.txt'
durian 1
guava 0-5
mango 1
orange 2-5
papaya 0-5

First line denotes files included and gives them indices via positioning(e.g. 'ae.txt' is at position 0). Other lines denote indices of files that include each word.

From here, you can read the first line for file names, binary search your way through the lines to find the word you're looking for (you should probably find a way to read a specific line in O(1), though - or put them in separate dictionary files for e.g. individual letters if there's too much of them). Then, determine if the file name's index (its location in the first line) is denoted in the word's line.

Making code that writes the initial file and updates it accordingly seems simple enough, but I can help you with that if you want to.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.