4

I have a PHP script that imports CSV files and goes through tens of thousands of iterations. As the script runs over a course of hours, the memory use goes up and up, and if the file is big enough, the script uses up so much memory that the whole machine grinds to a halt.

Right now the only technique I'm using is to unset() everything I can when I'm done with it. I've tried to isolate the part that's using the most memory, but it seems like every function in my script is just one more straw on the camel's back, and using "as little memory as possible."

So what can I do?

I've tried looking into benchmarking/profiling tools but I haven't found anything good. I'm on a Windows machine, SSHing into a Linux box.

6
  • You might want to post some code... Commented Dec 16, 2010 at 20:36
  • 1
    @ircmaxell I'm not sure that would be helpful. 1) There are thousands of lines of it, spread across dozens of different files. 2) I'm asking for technique advice, not asking for someone to hand me the answer. Commented Dec 16, 2010 at 20:42
  • @Jason Swett - iterate all lines in multiple CSV is not a concern, what you trying to achieve ? Commented Dec 16, 2010 at 20:48
  • @ajreal What I'm trying to achieve is to have the data from the CSV files in my database. I'm sure you're looking for a more specific answer, but you'll have to ask me a more specific question because I'm not really sure what you're asking. Commented Dec 16, 2010 at 21:35
  • @jason Swett - If your intention is to create a well-prepare CSV to load into database, you should consider a straight load of these CSV into database, then use SELECT to pick your desired rows into a final table, something like this : stackoverflow.com/questions/4410495/… Commented Dec 16, 2010 at 21:40

2 Answers 2

4

Ok, since you're looking for techniques, let me list some...

1. Don't read files, stream them

Rather than calling $data = file_get_contents($file), open it with fopen and only read the data you need at that point in time (fgets or fgetcsv, etc). It'll be a touch slower, but it'll use FAR less memory.

2. Upgrade to 5.3.4

If you're still on PHP 5.2.x, memory will be greatly conserved by upgrading to 5.3.x (latest 5.3.4). It includes a garbage collector that will clean up freed memory after a while.

3. Don't use anything in the global scope

Don't store any information in the global scope. it's never cleaned until the end of execution, so it could be a memory leak in and of itself.

4. Don't pass around references

PHP uses copy-on-right. Passing around references only increases the chances that unset won't get all of them (because you forgot to unset one of the references). Instead, just pass around the actual variables.

5. Profile the code

Profile your code. Add debug hooks to the start and end of each function call, and then log them watching the memory usage at the entrance and exit of every function. Take the diffs of these and you'll know how much memory is being used by each function. Take the biggest offenders (those that are called a lot, or use a lot of memory) and clean them up... (lowest hanging fruit).

6. Use a different language

While you can do this with PHP (I have and do quite often), realize it may not be the best tool for the job. Other languages were designed for this exact problem, so why not use one of them (Python or Perl for example)...

7. Use Scratch Files

If you need to keep track of a lot of data, don't store it all in memory the entire time. Create scratch files (temporary files) to store the data when you're not explicitly using it. Load the file only when you're going to use that specific data, and then re-save it and get rid of the variables.

8. Extreme cases only: don't use large arrays!

If you need to keep track of a large number of integers (or other simple data types), don't store them in an array! The zval (internal data structure) has a fair bit of overhead. Instead, if you REALLY need to store a LARGE number of integers (hundreds of thousands or millions), use a string. For a 1 byte int, ord($numbers[$n]) will get the value of the $n index, and $numbers[$n] = chr($value); will set it. For multy-byte ints, you'd need to do $n * $b to get the start of the sequence where $b is the number of bytes. I stress that this should only be used in the extreme case where you need to store a TON of data. In reality, this would be better served by a scratch file or an actual database (Temporary Table likely), so it may not be a great idea...

Good Luck...

Sign up to request clarification or add additional context in comments.

5 Comments

Wow, thanks for the thorough advice. I'm already doing 1, 2 and 8, but not any of the others. What I think I might actually do is a two-step thing: 1) break the files into smaller chunks as a temporary workaround and 2) re-write the import script in, like you suggest, Perl or Python. I didn't realize certain languages were built specifically to deal with the import/ETL problem.
@Jason: Well, it's not that they were written specifically to do import, they were built to run generically. PHP was built to make web pages. Then people realized it could do other things and support was added as a side effect. The primary usecase of PHP is not long running applications (hence the lack of memory tools). But others like Perl or Python have been used much more in that realm. But how much memory are we talking that you're using too much? A few kb from a function shouldn't cause any harm. You should be able to profile it pretty well if you're using a ton...
I'm reaching 100% memory use (or close to it) part-way through my imports on a machine with 4GB of RAM.
@Jason: WOW! Are you storing that much information? Or is it memory leaks that are getting you that bad?
No, I'm taking pains to make sure I store as little as possible. Once each row is saved in the database, everything to do with that row can be purged from memory as far as I'm concerned. The memory leaks are just that bad.
0

Could you run the script many times and only process a small number of files each time? If you are accumulating totals or something you could store them in a file or memcached so that you can still keep a running total.

1 Comment

I've considered a similar solution. My script is already only processing one file at a time. It's just that most of these CSV files are around 50,000 lines. I could break each file into multiple files. That "fixes" the problem, but that feels like a lame solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.