Parallel XML Parsing in Java

Question

I'm writing an application which processes a lot of xml files (>1000) with deep node structures. It takes about six seconds with with woodstox (Event API) to parse a file with 22.000 Nodes.

The algorithm is placed in a process with user interaction where only a few seconds response time are acceptable. So I need to improve the strategy how to handle the xml files.

My process analyses the xml files (extracts only a few nodes).
Extracted nodes are processed and the new result is written into a new data stream (resulting in a copy of the document with modified nodes).

Now I'm thinking about a multithreaded solution (which scales better on 16 Core+ hardware). I thought about the following stategies:

Creating multiple parsers and running them in parallel on the xml sources.
Rewriting my parsing algorithm thread-save to use only one instance of the parser (factories, ...)
Split the XML source into chunks and assign the chunks to multiple processing threads (map-reduce xml - serial)
Optimizing my algorithm (better StAX parser than woodstox?) / Using a parser with build-in concurrency

I want to improve both, the performance overall and the "per file" performance.

Do you have experience with such problems? What is the best way to go?

It's not clear what needs to be maximized here... the performance on a SINGLE file, or the total performance on all 1000 files. — Jim Garrison
– Jim Garrison, Commented Nov 17, 2010 at 20:13
One more suggestion: if you can quantify sizes of files, to allow calculation of throughout (megabytes per second processed) it can give an idea of expected performance. I typically get 10 - 40 MB/s for parsing with Woodstox when testing; but my hard drives can only deliver 5 - 10 MB/s sustained speed. — StaxMan
– StaxMan, Commented Dec 21, 2010 at 23:17

Peter Knego · Accepted Answer · 2010-11-17 21:05:10Z

4

This one is obvious: just create several parsers and run them in parallel in multiple threads.
Take a look at Woodstox Performance (down at the moment, try google cache).
This can be done IF structure of your XML is predictable: if it has a lot of same top-level elements. For instance:
```
<element>
    <more>more elements</more>
</element> 
<element>
    <other>other elements</other>
</element>
```
In this case you could create simple splitter that searches <element> and feeds this part to a particular parser instance. That's a simplified approach: in real life I'd go with RandomAccessFile to find start stop points (<element>) and then create custom FileInputStream that just operates on a part of file.
Take a look at Aalto. The same guys that created Woodstox. This are experts in this area - don't reinvent the wheel.

answered Nov 17, 2010 at 21:05

Peter Knego

80.4k11 gold badges128 silver badges155 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

AlexR · Accepted Answer · 2010-11-17 20:32:38Z

4

I am agree with Jim. I think that if you want to improve performance of overall processing of 1000 files your plan is good except #3 that is irrelevant in this case. If however you want to improve performance of parsing of single file you have a problem. I do not know how it is possible to split XML file without it parsing. Each chunk will be illegal XML and your parser will fail.

I believe that improving overall time is good enough for you. In this case read this tutorial: http://download.oracle.com/javase/tutorial/essential/concurrency/index.html then create thread pool of for example 100 threads and queue that contains XML sources. Each thread will parse only 10 files that will bring serious performance benefit in multi-CPU environment.

answered Nov 17, 2010 at 20:32

AlexR

116k16 gold badges137 silver badges216 bronze badges

1 Comment

Don Roby Over a year ago

+1: Though it may not improve performance much if the parsing is simple enough that the main problem is IO.

StaxMan · Accepted Answer · 2011-01-06 22:24:07Z

In addition to existing good suggestions there is one rather simple thing to do: use cursor API (XMLStreamReader), NOT Event API. Event API adds 30-50% overhead without (just IMO) significantly making processing easire. In fact, if you want convenience, I would recommend using StaxMate instead; it builds on top of Cursor API without adding significant overhead (at most 5-10% compared to hand-written code).

Now: I assume you have done basic optimizations with Woodstox; but if not, check out "3 Simple Rules for Fast XML-processing using Stax". Specifically, you absolutely should:

Make sure you only create XMLInputFactory and XMLOutputFactory instances once
Close readers and writers to ensure buffer recycling (and other useful reuse) works as expected.

The reason I mention this is that while these make no functional difference (code works as expected) they can make big performance difference; although more so when processing smaller files.

Running multiple instances does also make sense; although usually with at most 1 thread per core. However you will only get benefit as long as your storage I/O can support such speeds; if disk is the bottleneck this will not help and can in some cases hurt (if disk seeks compete). But it is worth a try.

Collectives™ on Stack Overflow

Parallel XML Parsing in Java

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related