1

I need to run a regex match over a file, but I'm faced with an unexpected problem: the file is too big to read() or mmap() in one call, File objects don't support the buffer() interface, and the regex module takes only strings or buffers.

Is there an easy way to do this?

4
  • 1
    Does the regex need to match multiple lines, or can you do the equivalent of grep? Commented Apr 5, 2011 at 23:00
  • Big. It would need to match multiple lines. I'm taking a different approach now (not a life or death situation), but I was wondering, isn't there a simpler way of doing this? Commented Apr 5, 2011 at 23:13
  • 2
    "Big" is not an answer to my question. The reason I ask, is that if you're on a 64-bit OS (and you should be if you're dealing with "big" files today), then you will be able to mmap() the file. I've done this with files up to 30 GB, in Python, and it works great. Commented Apr 5, 2011 at 23:15
  • @Greg Oh, look at that. No, the file won't get that big :) I'll mmap() it. Post it as an answer (maybe provide some code in case someone else stumbles upon this) and I'll accept it! Commented Apr 5, 2011 at 23:18

1 Answer 1

6

The Python mmap module provides a nice Python-friendly way of memory mapping a file. On a 32-bit operating system, the maximum size of the file is will be limited to no more than a GB or maybe two, but on a 64-bit OS you will be able to memory map a file of arbitrary size (until storage sizes exceed 264, of course).

I've done this with files of up to 30 GB (the Wikipedia XML dump file) in Python with excellent results.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.