Python Random Access File

Question

Is there a Python file type for accessing random lines without traversing the whole file? I need to search within a large file, reading the whole thing into memory wouldn't be possible.

Any types or methods would be appreciated.

Can you explain why you want to read a random line, instead of reading the whole file in parts or something. — Alphaneo
– Alphaneo, Commented Feb 15, 2011 at 2:19
To do a binary search within the file, as each line starts with an integer. I want to treat the integers as keys and the rest of the line as the value. — Mantas Vidutis
– Mantas Vidutis, Commented Feb 15, 2011 at 2:21

senderle · Accepted Answer · 2012-08-10 15:29:23Z

21

This seems like just the sort of thing mmap was designed for. A mmap object creates a string-like interface to a file:

>>> f = open("bonnie.txt", "wb")
>>> f.write("My Bonnie lies over the ocean.")
>>> f.close()
>>> f.open("bonnie.txt", "r+b")
>>> mm = mmap(f.fileno(), 0)
>>> print mm[3:9]
Bonnie

In case you were wondering, mmap objects can also be assigned to:

>>> print mm[24:]
ocean.
>>> mm[24:] = "sea.  "
>>> print mm[:]
My Bonnie lies over the sea.

edited Aug 10, 2012 at 15:29

answered Feb 15, 2011 at 3:58

senderle

152k36 gold badges218 silver badges244 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Oleg Over a year ago

But doesn't mmap also loads the entire file into the memory? ` mm = mmap(f.fileno(), 0)` is (as far as I understand) reads the entire file into the memory. Can you elaborate on that please?

senderle Over a year ago

@Oleg my understanding is that it does not. Instead, it makes the hard drive addressable as if it were RAM, via virtual memory mechanisms. I don't know all the details, but it definitely doesn't load the whole file into memory at first, as the Mmap article on wikipedia explains.

David M. · Accepted Answer · 2021-01-09 23:14:59Z

9

You can use linecache:

import linecache
print linecache.getline(your_file.txt, randomLineNumber) # Note: first line is 1, not 0

edited Jan 9, 2021 at 23:14

answered Jul 17, 2014 at 16:50

David M.

4,6383 gold badges22 silver badges28 bronze badges

1 Comment

Krzysztof Przygodzki Over a year ago

This is so cool! And it is a build-in Python module! Thanks!

Gareth McCaughan · Accepted Answer · 2011-02-15 02:22:14Z

6

Since lines can be of arbitrary length, you really can't get at a random line (whether you mean "a line whose number is actually random" or "a line with an arbitrary number, selected by me") without traversing the whole file.

If kinda-sorta-random is enough, you can seek to a random place in the file and then read forward until you hit a line terminator. But that's useless if you want to find (say) line number 1234, and will sample lines non-uniformly if you actually want a randomly chosen line.

answered Feb 15, 2011 at 2:22

Gareth McCaughan

20k1 gold badge43 silver badges64 bronze badges

1 Comment

Gareth McCaughan Over a year ago

I see you're wanting to do a binary search. In that case: yes, seek to a point approximately halfway between your current lower and upper bounds, read a chunk, and look for lines within that chunk. (It's probably obvious, but: Don't try to seek around randomly at the granularity of individual lines, unless they're very long. Read larger pieces. That way you're less dependent on the details of the OS's and Python's buffering, and you avoid some overhead.)

Senthil Kumaran · Accepted Answer · 2011-02-15 02:20:46Z

2

file objects have a seek method which can take a value to particular byte within that file. For traversing through the large files, iterate over it and check for the value in each line. Iterating the file object does not load the whole file content into memory.

answered Feb 15, 2011 at 2:20

Senthil Kumaran

57.3k15 gold badges99 silver badges139 bronze badges

Comments

vy32 · Accepted Answer · 2011-02-15 02:43:07Z

2

Yes, you can easily get a random line. Just seek to a random position in the file, then seek towards the beginning until you hit a \n or the beginning of the file, then read a line.

Code:

import sys,random
with open(sys.argv[1],"r") as f:
    f.seek(0,2)                 # seek to end of file
    bytes = f.tell()
    f.seek(int(bytes*random.random()))

    # Now seek forward until beginning of file or we get a \n
    while True:
        f.seek(-2,1)
        ch = f.read(1)
        if ch=='\n': break
        if f.tell()==1: break

    # Now get a line
    print f.readline()

answered Feb 15, 2011 at 2:43

vy32

30.1k39 gold badges139 silver badges264 bronze badges

2 Comments

Gareth McCaughan Over a year ago

A couple of notes. (1) You should probably open it as a binary file if you're going to go seeking around in it. (2) You'll probably get better performance (because of better interaction with the underlying buffering, and fewer calls to f.seek) if you look forwards from the point you landed at instead of backward. The downside of that is that you'll never get the first line that way, and may hit the end of the file, but you can work around that by saying: if you hit the end of the file, go to the beginning and read a line from there.

vy32 Over a year ago

True. Going forward is slightly higher performance, but then you need the wrap-around logic. I do most of my development on Unix where the distinction between "binary" and "text" files is not meaningful.

Michael Dillon · Accepted Answer · 2011-02-15 02:45:13Z

1

The File object supports seek but make sure that you open them as binary, i.e. "rb".

You may also wish to use the mmap module for random access, particularly if the data is in an internal format already.

answered Feb 15, 2011 at 2:45

Michael Dillon

32.5k7 gold badges76 silver badges107 bronze badges

Comments

John Machin · Accepted Answer · 2011-02-15 04:36:22Z

1

Has fixed-length records? If so, yes, you can implement a binary search algorithm using seeking.

Otherwise, load your file into an SQLlite database. Query that.

answered Feb 15, 2011 at 4:36

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

Collectives™ on Stack Overflow

Python Random Access File

7 Answers 7

2 Comments

1 Comment

1 Comment

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

2 Comments

1 Comment

1 Comment

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related