Is there a Python file type for accessing random lines without traversing the whole file? I need to search within a large file, reading the whole thing into memory wouldn't be possible.
Any types or methods would be appreciated.
Is there a Python file type for accessing random lines without traversing the whole file? I need to search within a large file, reading the whole thing into memory wouldn't be possible.
Any types or methods would be appreciated.
This seems like just the sort of thing mmap was designed for. A mmap object creates a string-like interface to a file:
>>> f = open("bonnie.txt", "wb")
>>> f.write("My Bonnie lies over the ocean.")
>>> f.close()
>>> f.open("bonnie.txt", "r+b")
>>> mm = mmap(f.fileno(), 0)
>>> print mm[3:9]
Bonnie
In case you were wondering, mmap objects can also be assigned to:
>>> print mm[24:]
ocean.
>>> mm[24:] = "sea. "
>>> print mm[:]
My Bonnie lies over the sea.
You can use linecache:
import linecache
print linecache.getline(your_file.txt, randomLineNumber) # Note: first line is 1, not 0
Since lines can be of arbitrary length, you really can't get at a random line (whether you mean "a line whose number is actually random" or "a line with an arbitrary number, selected by me") without traversing the whole file.
If kinda-sorta-random is enough, you can seek to a random place in the file and then read forward until you hit a line terminator. But that's useless if you want to find (say) line number 1234, and will sample lines non-uniformly if you actually want a randomly chosen line.
Yes, you can easily get a random line. Just seek to a random position in the file, then seek towards the beginning until you hit a \n or the beginning of the file, then read a line.
Code:
import sys,random
with open(sys.argv[1],"r") as f:
f.seek(0,2) # seek to end of file
bytes = f.tell()
f.seek(int(bytes*random.random()))
# Now seek forward until beginning of file or we get a \n
while True:
f.seek(-2,1)
ch = f.read(1)
if ch=='\n': break
if f.tell()==1: break
# Now get a line
print f.readline()