0

How long does sorting a 100MB XML file with Java take ?

The file has items with the following structure and I need to sort them by event

<doc>
    <id>84141123</id>
    <title>kk+ at Hippie Camp</title>
    <description>photo by SFP</description>
    <time>18945840</time>
    <tags>elphinstone tribalharmonix vancouver intention intention7 newyears hippiecamp bc sunshinecoast woowoo kk kriskrug sunglasses smoking unibomber møtleykrüg </tags>
    <geo></geo>
    <event>47409</event>
</doc>

I'm on a Intel Dual Duo Core and 4GB RAM.

Minutes ? Hours ?

thanks

1
  • 10
    Build it. Measure it. Share what you learned. We can't speculate on how show your code is, how slow your computer is, how slow your SAN is and how slow your OS is. You, however, can measure the actual time by generating fake data and writing it. You'll find that parsing the input is slower than creating the output. Commented Mar 29, 2011 at 9:56

4 Answers 4

7

Here are the timings for a similar task executed using Saxon XQuery on a 100Mb input file.

Saxon-EE 9.3.0.4J from Saxonica
Java version 1.6.0_20
Analyzing query from {for $i in //item order by location return $i}
Analysis time: 195 milliseconds
Processing file:/e:/javalib/xmark/xmark100.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/e:/javalib/xmark/xmark100.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 6158 milliseconds
Tree size: 4787932 nodes, 79425460 characters, 381878 attributes
Execution time: 3.466s (3466ms)
Memory used: 471679816

So: about 6 seconds for parsing the input file and building a tree, 3.5 seconds for sorting it. That's invoked from the command line, but invoking it from Java will get very similar performance. Don't try to code the sort yourself - it's only a one-line query, and you are very unlikely to match the performance of an optimized XQuery engine.

Sign up to request clarification or add additional context in comments.

1 Comment

+1. Great answer. Just use this existing tool, don't bother writing code yourself that is already out there in a nice, accessible, packaged solution.
2

i would say minutes - you shud be able to do that completely in-memory, so with a sax parser that would be reading-sorting-writing, should not be a problem for your hardware

5 Comments

If you do it in-memory, then a DOM parser might be more appropriate, as you don't need to build a separate in-memory representation of the data this way.
But a custom in-memory structure could be much more compact than a DOM.
i was actually thinking of a treemap with key of event id and item of the xml piece - plain stupid implementation w/o any xml magic. =)
probably storing it in a dumb precisely-sized array and then applying a dumb in-place quicksort will be as fast and consume less memory.
i agree, but i wouldn't say that the task (100Mb) seems hard enough to justify additional effort =)
0

I think a problem like this would be better sorted using serialisation.

  1. Deserialise the XML file into an ArrayList of 'doc'.

  2. Using straight Java code, apply sort on the event attribute and stored sorted arraylist in another variable.

  3. Serialise out the sorted 'doc' ArrayList to file

2 Comments

Beware of ArrayList — when it expands, it allocates twice as much memory as it has. Imho, it's better to scan the file first and count <id> entries (grep | wc -l does this just fine) and then allocate an array of exact size.
@9000, its only a 100 MB file in a 4 GB machine. 2x expansion shouldn't be a problem. ;)
0

If you do it in memory, you should be able to do this in under 10 seconds. You would be pusshing to do this under 2 seconds because it will spend that much times reading/writing to disk.

This program should use no more than 4-5x times the original file size. about 500 MB in your case.

String[] records = FileUtils.readFileToString(new File("my-file.xml")).split("</?doc>");
Map<Long, String> recordMap = new TreeMap<Long, String>();
for(int i=1;i<records.length;i+=2) {
    String record = records[i];
    int pos1 = record.indexOf("<id>");
    int pos2 = record.indexOf("</id>", pos1+4);
    long num = Long.parseLong(record.substring(pos1+3, pos2));
    recordMap.put(num, record);
}

StringBuilder sb = new StringBuilder(records[0]);
for (String s : recordMap.values()) {
    sb.append("<doc>").append(s).append("</doc>");
}
sb.append(records[records.length-1]);
FileUtils.writeStringToFile(new File("my-output-file.xml"), sb.toString());

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.