java.lang.OutOfMemoryError while transforming XML in a huge directory

Question

I want to transform XML files using XSLT2, in a huge directory with a lot of levels. There are more than 1 million files, each file is 4 to 10 kB. After a while I always receive java.lang.OutOfMemoryError: Java heap space.

My command is: java -Xmx3072M -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEna bled -XX:MaxPermSize=512M ...

Add more memory to -Xmx is not a good solution.

Here are my codes:

for (File file : dir.listFiles()) {
    if (file.isDirectory()) {
        pushDocuments(file);
    } else {
        indexFiles.index(file);
    }
}

public void index(File file) {
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();

    try {
        xslTransformer.xslTransform(outputStream, file);
        outputStream.flush();
        outputStream.close();
    } catch (IOException e) {
        System.err.println(e.toString());
    }
}

XSLT transform by net.sf.saxon.s9api

public void xslTransform(ByteArrayOutputStream outputStream, File xmlFile) {
    try {
        XdmNode source = proc.newDocumentBuilder().build(new StreamSource(xmlFile));
        Serializer out = proc.newSerializer();
        out.setOutputStream(outputStream);
        transformer.setInitialContextNode(source);
        transformer.setDestination(out);
        transformer.transform();

        out.close();
    } catch (SaxonApiException e) {
        System.err.println(e.toString());
    }
}

Michael Kay · Accepted Answer · 2013-11-04 10:10:43Z

5

My usual recommendation with the Saxon s9api interface is to reuse the XsltExecutable object, but to create a new XsltTransformer for each transformation. The XsltTransformer caches documents you have read in case they are needed again, which is not what you want in this case.

As an alternative, you could call xsltTransformer.getUnderlyingController().clearDocumentPool() after each transformation.

(Please note, you can ask Saxon questions at saxonica.plan.io, which gives a good chance we [Saxonica] will notice them and answer them. You can also ask them here and tag them "saxon", which means we'll probably respond to the question at some point, though not always immediately. If you ask on StackOverflow with no product-specific tags, it's entirely hit-and-miss whether anyone will notice the question.)

answered Nov 4, 2013 at 10:10

Michael Kay

165k11 gold badges97 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

wonder garance Over a year ago

I added this line of code to clear the document pool, it's well done. Thank you! In fact I followed S9APIExamples.java at saxonica.com/documentation/index.html#!using-xsl/embedding/… to write my program, but I didn't know it sould clear the document pool after each transformation.

Peter Lawrey · Accepted Answer · 2013-11-04 09:00:09Z

1

I would check you don't have a memory leak. The number of files shouldn't matter as you are only processing one at at time and as long as you can process the largest file you should be able to process them all.

I suggest you run jstat -gc {pid} 10s while the program is running to look for memory leaks. What you should look for is the size of memory after a Full GC, if this is ever increasing, use the VisualVM memory profiler to work out why. Or use jmap -histo:live {pid} | head -20 for a hint.

If the memory is not increasing you have a file which is triggering the out of memory. This is because either a) the file is much bigger than the others, or uses much more memory b) it triggers a bug in the library.

answered Nov 4, 2013 at 9:00

Peter Lawrey

535k83 gold badges770 silver badges1.2k bronze badges

1 Comment

Asraful Over a year ago

this explanation sounds interesting and helpful

Prabhakaran Ramaswamy · Accepted Answer · 2013-11-04 09:00:22Z

0

Try this one

String[] files = dir.list();
for (String fileName : files) {
    File file = new File(fileName);
    if (file.isDirectory()) {
        pushDocuments(file);
    } else {
        indexFiles.index(file);
    }
}

answered Nov 4, 2013 at 9:00

Prabhakaran Ramaswamy

26.1k10 gold badges61 silver badges65 bronze badges

4 Comments

Peter Lawrey Over a year ago

Can you explain how will that make a help, as it appears to do the same thing to me.

Prabhakaran Ramaswamy Over a year ago

@PeterLawrey my thought of File objects will took more memory.

Matthias Over a year ago

Well iterating over file objects (if there are more than 1 million in total) is surely consuming more memory than just looking at the strings and creating File objects on the go (which then can be GCed again). If this however is solving the problem, is an assumption which should be explained :)

wonder garance Over a year ago

@ByteCode Thank you. I modified your code: File file = new File(dir + File.separator + fileName); But it doesn't resolve my problem. The right answer is by Michael Kay

Alexander · Accepted Answer · 2015-02-07 21:55:56Z

0

I had a similar problem that came from the javax.xml.transform package that used a ThreadLocalMap to cache the XML chunks that were read during XSLT. I Had to outsource the XSLT into its own Thread so that the ThreadLocalMap cleared when the new Thread died - this freed the memory. See here: https://www.ahoi-it.de/ahoi/news/java-xslt-memory-leak/1446

edited Feb 7, 2015 at 21:55

answered Nov 13, 2013 at 10:31

Alexander

3,0563 gold badges36 silver badges39 bronze badges

Collectives™ on Stack Overflow

java.lang.OutOfMemoryError while transforming XML in a huge directory

4 Answers 4

1 Comment

1 Comment

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related