4

I've been asked to choose which is the best option out of three in terms of resource optimization.
Suppose I have a big Excel file of thousands of records, and I need to extract these data and insert them into a database. The 3 options are:

  1. Load everything into a multidimensional array and insert everything with just one complex query;
  2. Load everything into a multidimensional array, then loop over each excel row and do a simple insert query.
  3. Inside a loop, read each Excel row, put it into an array, and then do a simple insert query on the DB.

This is for an interview test (I labelled it homework, not sure if it's right); I pondered for a while:

  • Case 1: I could risk an *out_of_memory* error (depending on the machine, of course), but it's the solution that performs less request to the database. Two drawbacks are the huge amount of memory to be allocated both to the array and the database. I know that I can transform excel into CSV, but it's not an option here. I'd go for a big array and a bulk insert, but I fear it would be hard for the database.
  • Case 2: I could risk an *out_of_memory* error when loading it into the array, but not for the second task. Nonetheless, performing thousands of queries could be a performance hit on the database, and this query is likely to be a candidate for optimization.
  • Case 3: Still have a loop over thousands records (which also takes a lot of memory...), and still have thousands queries to run (which hits the database).

So, I actually chose answer one, and it took me some thinking before doing it.

And it was WRONG. And I don't know actually which of the three was the right one.

Can someone help me on this? Is that answer so bad? I thought that thousands of insert queries would be "bad", but seems like I'm totally wrong..

EDIT
Clarification: my question is not about which is the best optimization absolutely, but which one among the three I presented; so I'm not looking into other alternatives, just an explanation on why I was wrong and which is, argumentatively, the best answer instead.

4
  • Playing devil's advocate - Is it not possible for you to save the spreadsheet as a CSV and use MySQL's LOAD DATA LOCAL INFILE to import it? Why use PHP? Commented Jul 11, 2011 at 20:48
  • @Michael I think Damien had to pick from the three options. Commented Jul 11, 2011 at 20:51
  • @Michael It was a short online test-interview for a job, I just had to choose among 3 answers for each questions. I hope I could discuss them during the actual interview and explain my reasons.. Commented Jul 11, 2011 at 20:55
  • @Roland @Damien Yes, I promise to fully read questions in the future. Sorry :( Commented Jul 11, 2011 at 20:58

3 Answers 3

3

On the one hand, this seems like a bit of a trick question. The sane answer is, use a bulk import utility like MySQL's mysqlimport or SQL Server's BULK INSERT ... FROM [data_file]. On the other hand, those utilities are essentially doing one of the above three options (albeit in a presumably highly-optimized fashion).

Thing is, you have to consider the entirety of the question when answering these. The "best option in terms of resource utilization" is case 3, given that your memory usage will be rather low and that most database platforms are designed to handle a metric crapton of requests per second anyway.

Sign up to request clarification or add additional context in comments.

2 Comments

maybe in a real-world application I would have choosen the first one too, with a LOAD INFILE or BULK INSERT and CSV transform of Excel file. But in this scenario that option was not available...I too suspected the numeber 3) but I thought that all those loops would have been resource-heavy, seems I'm totally wrong on that
Looping over the file is going to have to happen in any case, but with option 3 you are handling the data as you retrieve it, discarding it and moving on. The first two require that you store the data and handle it later (essentially, two loops instead of one).
2

"Wrong" seems like the wrong answer.

There are a number of tradeoffs, and the "right" answer depends on factors you haven't listed such as: 1) Is this a production database? 2) Is the site online when you insert this data? 3) Is it ok if row 1 is inserted and visible to the public, when row 10,985 isn't? 4) Are others writing to the table while you are?

Assuming the answer to all of these questions is yes, I'd probably go with the row at a time read and insert. The first two are going to lock up your table so that no one else is going to be able to access it. With option 3 you can even meter your rate of inserts.

2 Comments

I have no such information, the test was formulated more or less like I exposed (just in another language ;))
+1 Impossible to answer without knowing if the database is in active use. On one hand, you may risk consistency issues. On the other hand, running one massive query could lock up the DB.
0

I think the PHP way presupposes Case 3, because you minimize amount of memory used. It's slow, but it reduces how munch memory each operation takes. Loading the whole thing in one big multidimensional array and doing a complex insert takes a lot more resources, and the speedup is not that much better. The question assumes, this is a long running task, so maybe that's what threw you off.

Whoever wrote this doesn't seem to have considered that insert operations are expensive for data loading and are not meant to be used when you have a lot of data to load.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.