0

Is there a way to search text files in python for a phrase withough having to use forloops and if statments such as:

for line in file:
    if line in myphrase:
        do something

This seems like a very inefficient way to go through the file as it does not run in parallel if I understand correctly, but rather iteratively. Is re.search a more efficient system by which to do it?

12
  • Have you considered regexes? It'd probably still work linearly (i.e. a single thread scanning over the text as a series) but any loops would be hidden from you. Loops aren't as bad as all that, in fact, they're kind of what computers do best. Alternately, you could force the parallelisation of a series of threads working on different portions of your file, but then you'd have to manage their synchronisation and cross-talk, which might outweigh the benefit of what you're trying to do in the first place. Commented Feb 28, 2020 at 14:15
  • If you have any efficiency issues maybe you should consider pre-procces from any kind. it could help you. Commented Feb 28, 2020 at 14:17
  • @ThomasKimber I have looked at them but have heard they can also be slow on large files. It just striked me as odd that there is no paralelisation method here. Commented Feb 28, 2020 at 14:21
  • I doubt that regular expressions will save you any time. Some have pointed out that it's faster to read the contents of the file all at once, which is true for reasonably sized files. If it's a large file, however, reading the whole file into memory will also cause a performance hit. I personally would take the simple approach and do exactly what you have done here. It's convenient because it splits by newlines, which presumably are not part of your phrase. If you have to optimize by reading in larger chunks, you have to beware that you might be splitting your phrase at the chunk boundaries Commented Feb 28, 2020 at 14:22
  • 2
    Just gonna drop this here Commented Feb 28, 2020 at 14:24

3 Answers 3

4

Reading a sequential file (e.g. a text file) is always going to be a sequential process. Unless you can store it in separate chunks or skip ahead somehow it will be hard to do any parallel processing.

What you could do is separate the inherently sequential reading process from the searching process. This requires that the file content be naturally separated into chunks (e.g. lines) across which the search is not intended to find a result.

The general structure would look like this:

  • initiate a list of processing threads with input queues
  • read the file line by line and accumulate chunks of lines up to a given threshold
  • when the threshold or the end of file is reached, add the chunk of lines to the next processing thread's input queue
  • wait for all processing threads to be done
  • merge results from all the search threads.

In this era of solid state drives and fast memory busses, you would need some pretty compelling constraining factors to justify going to that much trouble.

You can figure out your minimum processing time by measuring how long it takes to read (without processing) all the lines in your largest file. It is unlikely that the search process for each line will add much to that time given that I/O to read the data (even on an SSD) will take much longer than the search operation's CPU time.

Sign up to request clarification or add additional context in comments.

Comments

3

Let's say you have the file:

Hello World!
I am a file.

Then:

file = open("file.txt", "r")
x = file.read()
# x is now:
"Hello World!\nI am a file."
# just one string means that you can search it faster.
# Remember:
file.close()

Edit:

To actually test how long it takes:

import time
start_time = time.time()
# Read File here
end_time = time.time()
print("This meathod took " + str( end_time - start_time ) + " seconds to run!")

Another Edit:

I read some other articles and did the test, and the fastest checking meathod if you're just trying to find True of False is:

x = file.read() # "Hello World!\nI am a file."
tofind = "Hello"
tofind_in_x = tofind in x
# True

This meathod was faster than regex in my tests by quite a bit.

2 Comments

hmm this is a very intresting way of looking at it. But i fear it would become very slow with large files ?
Depending on large file size, it will probably be slower so matter what you do. It may be best to test by creating a large find and testing the different methods. You could test the time they take with The edit to my awnser
3

The tool you need is called regular expressions (regex).

You can use it as follows:

import re

if re.match(myphrase, myfile.read()):
    do_something()

1 Comment

I have previously read people saying that regex is slow on large files, is this truth and then if so what is considerd a large file?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.