4

I have a Java program that needs to insert a large number of largish rows into a SQL Server database. The number of rows is 800k, and the size of each is around 200 bytes.

Currently they are divided into batches of 50, and then each batch is inserted using a single statement. (We've confirmed by JTDS logging that a single sp_exec call is used for each batch.) Tuning the batch size between 25 and 250 does not seem to have any significant effect, 50 is approximately the optimum.

I've experimented with dividing the batches into (say) 5 groups, and processing each group in parallel using threads. This is significantly faster -- more than twice as fast with 5 threads.

My question is about making the thread usage robust. In particular, if any of the batches fails, there will be an exception thrown. I want that exception to be caught and passed up to the caller, and I want to be 100% sure that the other threads have finished (either aborted, or completed) before we pass it up. Because when recovering from the exception later in the program, we don't want unexpected rows to continue to arrive in the table.

Here's what I've done:

/** Method to insert a single batch. */
private void insertBatchPostings(Collection<Posting> postings) throws PostingUpdateException
{
    // insert the batch using a single INSERT invokation
    // throw a PostingUpdateException if anything goes wrong
}

private static final int insertionThreads = 5;

/** Method to insert a collection of batches in parallel, using the above. */
protected void insertBatchPostingsThreaded(Collection<Collection<Posting>> batches) throws PostingUpdateException
{
    ExecutorService pool = Executors.newFixedThreadPool(insertionThreads);
    Collection<Future> futures = new ArrayList<Future>(batches.size());

    for (final Collection<Posting> batch : batches) {
        Callable c = new Callable() {
            public Object call() throws PostingUpdateException {
                insertBatchPostings(batch);
                return null;
            }            
        };
        /* So we submit each batch to the pool, and keep a note of its Future so we can check it later. */
        futures.add(pool.submit(c));
    }

    /* Pool is running, indicate that no further work will be submitted to it. */
    pool.shutdown();

    /* Check all the futures for problems. */
    for (Future f : futures) {
        try {
            f.get();
        } catch (InterruptedException ex) {
            throw new PostingUpdateException("Interrupted while processing insert results: " + ex.getMessage(), ex);
        } catch (ExecutionException ex) {
            pool.shutdownNow();
            throw (PostingUpdateException) ex.getCause();
        }
    }
}

By the time this returns I want to guarantee that all the threads are dormant.

Questions

(I'm trying to clarify what exactly I'm asking.)

  1. Is the above code completely robust, in that no threads insertions will continue to operate after insertBatchPostingsThreaded returns?
  2. Are there better and simpler ways of using the Java concurrency features to acheive this? My code looks ridiculously overcomplicated to me (raising the suspicion of missed edge cases).
  3. What is the best way to get it to fail as soon as any one thread fails?

I'm not a natural Java programmer so I'm hoping to end up with something that doesn't advertise that fact. :)

9
  • Augh. Can you use generics to make your code more readable? Commented Mar 13, 2012 at 0:03
  • @Edmund disabling table indexes for the batch insert improves speed. You the have to trigger index recalculation. Commented Mar 13, 2012 at 0:24
  • @Louis - I copied it verbatim from the working program to ensure it was accurate; it's a legacy app. But I have attempted to translate it to modern Java. I presume the for loops offended you most, but I've translated the collection types, too. Commented Mar 13, 2012 at 0:29
  • @hidralisk - This question is just one part of a larger investigation into performance. The destination table is large (about one billion rows), won't it take rather long to rebuild all the indexes on that? It's also in use by other processes, which benefit from indexes when selecting existing rows on it. One idea we had though was to insert into a temporary table, and then copy from that to the destination in one statement. Do you think that will result in more efficient index updates? Commented Mar 13, 2012 at 0:33
  • @Edmund updating the index while inserting is not efficient. If you have a big update (10k+) it is usually better to disable index, insert data, rebuild index. This is databases 101. I have never had a 1 billion rows table, so you should do some benchmarks. If you have other processes reading the table while you insert, then disabling the index might not be an option. In the projects I worked these bulk inserts were done off peak hours so we suspended all activities on the affected table (you can do it by having a flag in a separate table, and interested processes check that first). Commented Mar 13, 2012 at 15:13

1 Answer 1

1

Guava's Futures.successfulAsList takes a list of futures as input and returns a future "whose value is a list containing the values of all its successful input futures." You could call get() on that generated Future, and then walk through your original future list to check for any failures.

Sign up to request clarification or add additional context in comments.

4 Comments

My other requirement (which I've added to the question), is that if any fail, the remaining tasks in the pool can be cancelled or aborted so that it fails quickly. Is there anything in Guava that will help with that?
Ah. I did not see that you wanted all the other threads to fail. It wouldn't be too difficult, though, to add a callback to each future to cancel all the other futures, with ListenableFuture...
So would the ListenableFuture call the listener, which would in turn call shutdownNow on the pool? Looking at the Java source, it seems shutdownNow makes an effort to cancel all queued tasks, so it probably already does that in my code, but if I can make the code cleaner using something from Guava then I'm all for that.
I was thinking that you would call shutdownNow on the pool. I'm not quite positive that that represents an improvement, though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.