0

I use PHP to do a lot of data processing ( realizing I'm probably pushing into territories where I should be using other languages and/or techniques ).

I'm doing entity extraction with a PHP process that loads an array containing ngrams to look for into memory. That array uses 3GB of memory and takes about 20 seconds to load each time I launch a process. I generate it once locally on the machine and each process loads it from a .json file. Each process then tokenizes the text it's processing and does an array_intersect between these two arrays to extract entities.

Is there any way to preload this into memory on the machine that is running all these processes and then share the resource across all the processes?

Since it's probably not possible with PHP: What type of languages/methods should I be researching to do this sort of entity extraction more efficiently?

2
  • 1
    I'd start with in-memory DB solutions. Commented Sep 5, 2014 at 21:20
  • Will a lookup in a MySQL in MEMORY table containing these ngrams to look for be comparable in speed to an array lookup in PHP? This is probably a ... dunno 'til you test it ... situation. Commented Sep 5, 2014 at 21:22

2 Answers 2

1

If the array never gets modified after it's loaded, then you could use pcntl_fork() and fork off a bunch of copies of the script. With copy-on-write semantics, they'd all be reading from the exact same memory copy of the array.

However, as soon as the array gets modified, then you'll pay a huge penalty as the array gets copied into each forked child's memory space. This would be especially true if any of the scripts finish their run early - they'd shut down, that PHP process starts shutdown cleanup, and that'd count as a write on the array's memory space, causing the copying.

Sign up to request clarification or add additional context in comments.

Comments

1

In your case, the best way of sharing might be read only mmap access.

I don't know if this is possible in PHP. A lot of languages will allow you to mmap a file into memory - and your operating system will be smart enough to realize that read-only maps can be shared. Also, if you don't need all of it, the operating system can reclaim the memory, and load it again from disk as necessary. In fact, it may even allow you to map more memory than you physically have.

mmap is really elegant. But nevertheless, dealing with such mapped data in PHP will likely be a pain, and sloooow. In general PHP is slow. In benchmarks, it is common to see PHP come in at 40-50 times the runtime of a good C program. This is much worse than e.g. Java, where a good Java program is only twice as slow as a highly optimized C; there it may pay off to have the powerful development tools of Java as opposed to having to debug low-level C code. But PHP does not have any key benefit: it is neither elegant to write, nor does it have a superior toolchain, nor it is fast...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.