1

Im reading a large XML around 4 GB in java using JAXB, I have a good system with SSDs, RAM and multiple CPU cores. I want to read that XML file using multiple threads. I have research it but not found any solution yet.

I was thinking that if I can read the XML using multiple Threads and send the chunks of bytes to parse through XML parser it will be good, but wondering if a solution is already there with implementation.

My code Snippet is

public void parseXML() throws Exception{

    try(InputStream is = new BufferedInputStream(new FileInputStream(xmlFile),XML_READ_BUFFER)){
    //try(InputStream is = new ByteArrayInputStream(removeAnd.getBytes(StandardCharsets.UTF_16))){ 
        XMLInputFactory xmlif = XMLInputFactory.newInstance();
        XMLStreamReader sr = xmlif.createXMLStreamReader(is);

        JAXBContext ctx = JAXBContext.newInstance(XwaysImage.class);
        Unmarshaller unmar = ctx.createUnmarshaller();

        int c=0;
        while (sr.hasNext()){

            while(this.pause.get())Thread.sleep(100);
            if(this.cancel.get()) break;

            int eventType = sr.next();
            if(eventType == XMLStreamConstants.START_ELEMENT){
                if("ImageFile".equals(sr.getName().getLocalPart())){
                    XwaysImage xim = unmar.unmarshal(sr,XwaysImage.class).getValue();
                    //TODO code here. 
                }
            }
        }
        sr.close();
        is.close();
    }catch(Exception e){
        log.error("",e);
    }
}

4 Answers 4

1

Since this is not a DOM-style parser, the low-level reading of XML file from disk is fast, especially from SSD. So don't think multi-threaded reading will help there.

But, multi-threaded processing of retrieved data could increase overall performance, so instead of 'read the XML using multiple Threads and send the chunks of bytes to parse' try to read in single thread, but process in parallel.

Sign up to request clarification or add additional context in comments.

Comments

1

There have been projects that attempt to apply parallel processing to XML parsing -- see for example https://www.ibm.com/support/knowledgecenter/en/SSZJPZ_8.7.0/com.ibm.swg.im.iis.ds.stages.xml.core.usage.doc/topics/largescaleparallelparsing.html -- but I don't know whether there are tools that are usable in practice. Intrinsically, it's not a task that is readily parallelisable into independent threads.

How much of the cost is parsing anyway? In many applications 25% might be typical. If that's the case for you, then the best approach might be to have one thread doing the parsing and other threads dealing with the parsed data.

Comments

1

Maybe you can try Declarative Stream Mapping (DSM) library. It is very good for processing large or complex XML and JSON documents. You need to define mapping between class end XML data in YAML file.

For example let's say you have below xml file:

<root>
  <item >
    <id>1</id>
    <name>Item 1</name>
  </item>
  <item >
    <id>2</id>
    <name>Item 2</name>
    <date>13/06/2019</date>
  </item>
  <item >
    <id>3</id>
    <name>Item 3</name>
    <date>11/06/2019</date>
  </item>
  <!-- 
  .........
  -->
</root>

Define you mapping for data you want to process

result:
   type: object  // it will only store one item in memory.
   path: /root/item    # path is regex can be writen as "/.+item".
   function: processData   # call processData function for every item.
   filter: self.index%params.threadCount==params.threadNo  // you can write script to filter data.
   fields:
     id: long   # id dataType long
     name:      # default dataType string         
     registerDate:   
        path: date
        dataType: date   # data type is date
        dataTypeParams: 
           dateFormat: dd/MM/yyyy  # date format

Write function to execute your data and register it to mapping file as shown above.

FunctionExecutor processData = new FunctionExecutor() {
        @Override
        public void execute(Params params) {
            System.out.println(params.getCurrentNode().getData());
        }
    };

    // java 8+
    //FunctionExecutor processData = params->System.out.println(params.getCurrentNode().getData());

Here is the java code. you can set threadNo for each thread. I assume you will run code in 10 thread. For this example thread not is 1. that's mean you will process only item that match filter field in mapping file.

DSMBuilder builder = new DSMBuilder("path/to/mapping.yaml");
    builder.registerFunction("processData ", processData); // register function
        builder.getParams().put("threadCount", 10);
        builder.getParams().put("threadNo", 1);  // run for first thread
    DSM dsm = builder.create();
    // process json data
    Object object = dsm.toObject("path/to/data.xml");

Comments

0

Not sure I fully understand what part of your code you need concurrency with but if it's your while loop you could try:

    sr.parallelStream().forEach(-> {
     //do something
})

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.