5

I want to transform XML files using XSLT2, in a huge directory with a lot of levels. There are more than 1 million files, each file is 4 to 10 kB. After a while I always receive java.lang.OutOfMemoryError: Java heap space.

My command is: java -Xmx3072M -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEna bled -XX:MaxPermSize=512M ...

Add more memory to -Xmx is not a good solution.

Here are my codes:

for (File file : dir.listFiles()) {
    if (file.isDirectory()) {
        pushDocuments(file);
    } else {
        indexFiles.index(file);
    }
}

public void index(File file) {
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();

    try {
        xslTransformer.xslTransform(outputStream, file);
        outputStream.flush();
        outputStream.close();
    } catch (IOException e) {
        System.err.println(e.toString());
    }
}

XSLT transform by net.sf.saxon.s9api

public void xslTransform(ByteArrayOutputStream outputStream, File xmlFile) {
    try {
        XdmNode source = proc.newDocumentBuilder().build(new StreamSource(xmlFile));
        Serializer out = proc.newSerializer();
        out.setOutputStream(outputStream);
        transformer.setInitialContextNode(source);
        transformer.setDestination(out);
        transformer.transform();

        out.close();
    } catch (SaxonApiException e) {
        System.err.println(e.toString());
    }
}

4 Answers 4

5

My usual recommendation with the Saxon s9api interface is to reuse the XsltExecutable object, but to create a new XsltTransformer for each transformation. The XsltTransformer caches documents you have read in case they are needed again, which is not what you want in this case.

As an alternative, you could call xsltTransformer.getUnderlyingController().clearDocumentPool() after each transformation.

(Please note, you can ask Saxon questions at saxonica.plan.io, which gives a good chance we [Saxonica] will notice them and answer them. You can also ask them here and tag them "saxon", which means we'll probably respond to the question at some point, though not always immediately. If you ask on StackOverflow with no product-specific tags, it's entirely hit-and-miss whether anyone will notice the question.)

Sign up to request clarification or add additional context in comments.

1 Comment

I added this line of code to clear the document pool, it's well done. Thank you! In fact I followed S9APIExamples.java at saxonica.com/documentation/index.html#!using-xsl/embedding/… to write my program, but I didn't know it sould clear the document pool after each transformation.
1

I would check you don't have a memory leak. The number of files shouldn't matter as you are only processing one at at time and as long as you can process the largest file you should be able to process them all.

I suggest you run jstat -gc {pid} 10s while the program is running to look for memory leaks. What you should look for is the size of memory after a Full GC, if this is ever increasing, use the VisualVM memory profiler to work out why. Or use jmap -histo:live {pid} | head -20 for a hint.

If the memory is not increasing you have a file which is triggering the out of memory. This is because either a) the file is much bigger than the others, or uses much more memory b) it triggers a bug in the library.

1 Comment

this explanation sounds interesting and helpful
0

Try this one

String[] files = dir.list();
for (String fileName : files) {
    File file = new File(fileName);
    if (file.isDirectory()) {
        pushDocuments(file);
    } else {
        indexFiles.index(file);
    }
}

4 Comments

Can you explain how will that make a help, as it appears to do the same thing to me.
@PeterLawrey my thought of File objects will took more memory.
Well iterating over file objects (if there are more than 1 million in total) is surely consuming more memory than just looking at the strings and creating File objects on the go (which then can be GCed again). If this however is solving the problem, is an assumption which should be explained :)
@ByteCode Thank you. I modified your code: File file = new File(dir + File.separator + fileName); But it doesn't resolve my problem. The right answer is by Michael Kay
0

I had a similar problem that came from the javax.xml.transform package that used a ThreadLocalMap to cache the XML chunks that were read during XSLT. I Had to outsource the XSLT into its own Thread so that the ThreadLocalMap cleared when the new Thread died - this freed the memory. See here: https://www.ahoi-it.de/ahoi/news/java-xslt-memory-leak/1446

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.