Searching python text file without for loops and if statments

Question

Is there a way to search text files in python for a phrase withough having to use forloops and if statments such as:

for line in file:
    if line in myphrase:
        do something

This seems like a very inefficient way to go through the file as it does not run in parallel if I understand correctly, but rather iteratively. Is re.search a more efficient system by which to do it?

Have you considered regexes? It'd probably still work linearly (i.e. a single thread scanning over the text as a series) but any loops would be hidden from you. Loops aren't as bad as all that, in fact, they're kind of what computers do best. Alternately, you could force the parallelisation of a series of threads working on different portions of your file, but then you'd have to manage their synchronisation and cross-talk, which might outweigh the benefit of what you're trying to do in the first place. — Thomas Kimber
– Thomas Kimber, Commented Feb 28, 2020 at 14:15
If you have any efficiency issues maybe you should consider pre-procces from any kind. it could help you. — Yanirmr
– Yanirmr, Commented Feb 28, 2020 at 14:17
@ThomasKimber I have looked at them but have heard they can also be slow on large files. It just striked me as odd that there is no paralelisation method here. — Lamma
– Lamma, Commented Feb 28, 2020 at 14:21
I doubt that regular expressions will save you any time. Some have pointed out that it's faster to read the contents of the file all at once, which is true for reasonably sized files. If it's a large file, however, reading the whole file into memory will also cause a performance hit. I personally would take the simple approach and do exactly what you have done here. It's convenient because it splits by newlines, which presumably are not part of your phrase. If you have to optimize by reading in larger chunks, you have to beware that you might be splitting your phrase at the chunk boundaries — TallChuck
– TallChuck, Commented Feb 28, 2020 at 14:22

Alain T. · Accepted Answer · 2020-03-02 13:31:38Z

Reading a sequential file (e.g. a text file) is always going to be a sequential process. Unless you can store it in separate chunks or skip ahead somehow it will be hard to do any parallel processing.

What you could do is separate the inherently sequential reading process from the searching process. This requires that the file content be naturally separated into chunks (e.g. lines) across which the search is not intended to find a result.

The general structure would look like this:

initiate a list of processing threads with input queues
read the file line by line and accumulate chunks of lines up to a given threshold
when the threshold or the end of file is reached, add the chunk of lines to the next processing thread's input queue
wait for all processing threads to be done
merge results from all the search threads.

In this era of solid state drives and fast memory busses, you would need some pretty compelling constraining factors to justify going to that much trouble.

You can figure out your minimum processing time by measuring how long it takes to read (without processing) all the lines in your largest file. It is unlikely that the search process for each line will add much to that time given that I/O to read the data (even on an SSD) will take much longer than the search operation's CPU time.

score 3 · Accepted Answer · 2020-02-28 14:44:52Z

3

Let's say you have the file:

Hello World!
I am a file.

Then:

file = open("file.txt", "r")
x = file.read()
# x is now:
"Hello World!\nI am a file."
# just one string means that you can search it faster.
# Remember:
file.close()

Edit:

To actually test how long it takes:

import time
start_time = time.time()
# Read File here
end_time = time.time()
print("This meathod took " + str( end_time - start_time ) + " seconds to run!")

Another Edit:

I read some other articles and did the test, and the fastest checking meathod if you're just trying to find True of False is:

x = file.read() # "Hello World!\nI am a file."
tofind = "Hello"
tofind_in_x = tofind in x
# True

This meathod was faster than regex in my tests by quite a bit.

edited Feb 28, 2020 at 14:44

answered Feb 28, 2020 at 14:18

user11229202

2 Comments

Lamma Over a year ago

hmm this is a very intresting way of looking at it. But i fear it would become very slow with large files ?

user11229202 Over a year ago

Depending on large file size, it will probably be slower so matter what you do. It may be best to test by creating a large find and testing the different methods. You could test the time they take with The edit to my awnser

Simon Crane · Accepted Answer · 2021-06-03 14:23:20Z

3

The tool you need is called regular expressions (regex).

You can use it as follows:

import re

if re.match(myphrase, myfile.read()):
    do_something()

edited Jun 3, 2021 at 14:23

answered Feb 28, 2020 at 14:15

Simon Crane

2,1822 gold badges13 silver badges22 bronze badges

1 Comment

Lamma Over a year ago

I have previously read people saying that regex is slow on large files, is this truth and then if so what is considerd a large file?

Collectives™ on Stack Overflow

Searching python text file without for loops and if statments

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related