0

There is a huge xml file(3-4GB) (360000 lines of records) and have to read each line and append each line using Stringbuilder.once it is read it will be processed further. But will not be able to store in the internal memory as the stringbuilder buffer size exceeds. How to split the records and rest before the buffer size exceeds. Kindly suggest.

        try {
        File file = new File("test.txt");
        FileReader fileReader = new FileReader(file);
        BufferedReader bufferedReader = new BufferedReader(fileReader);
        String builder stringBuilder = new Stringbuilder ();
        String line;
         int count =0;
        while ((line = bufferedReader.readLine()) != null)`enter code here` 
         {
            if (line.startswith("<customer>") ){
              stringBuilder .append(line);
            }     
            count++;    
        }
        fileReader.close();
        System.out.println(stringBuilder .toString());
    } catch (IOException e) {
        e.printStackTrace();
    }

EDIT: Asker tried with StAX

 while (xmlEventReader.hasNext()) {
        XMLEvent xmlEvent = null;
        try {
            xmlEvent = xmlEventReader.nextEvent();
        } catch (Exception e) {
            e.printStackTrace();
        }
        if (xmlEvent.isStartElement()) {
            StartElement elem = (StartElement) xmlEvent;
            if (elem.getName().getLocalPart().equals("<Customer>")) {
                if (customerRecord) {
                    insideChildRecord = true;
                }
                customerRecord = true;
            }
        }
        if (customerRecord) {
            xmlEventWriter.add(xmlEvent);
        }
        if (xmlEvent.isEndElement()) {
            EndElement elem = (EndElement) xmlEvent;
            if (elem.getName().getLocalPart().equals("<Customer>")) {
                if (insideChildRecord) {
                    insideChildRecord = false;
                } else {
                    customerRecord = false;
                    xmlEventWriter.flush();
                    String cmlChunk = stringWriter.toString()
5
  • To be able to know how to split, you should explain/understand how you will process the strings. Otherwise is not clear what you want. Commented Aug 15, 2018 at 8:18
  • <Customer><first>.....<Customer><Customer><seconds>...... <Customer>....... 360000 customer records..... customer records have to be split with 5000 numbers or 10000 numbers .... Commented Aug 15, 2018 at 8:42
  • What data do you need from each customer? And how do you need to process this further? Commented Aug 15, 2018 at 10:02
  • The xml records will be sent to a message queue and then to DB after that it will be extracted and set to relevant tables based on the data. whenever the xml records are less and it can be sent to queue directly using stringbuffer/builder.... but the number of xml records are quite high (360000) and unable to keep them in the sttringbuffer/builder.... parsing the xml file using xstreamreader are time-consuming as it has to go through alll records... Getting struck when reading the records line by line using bufferedreader since we are not sure when will buffersize reaches full capacity. Commented Aug 15, 2018 at 11:11
  • So how is it coming @JoeInigo? Did you get it working? If you answer helped you please consider accepting it. Commented Aug 16, 2018 at 8:59

1 Answer 1

3

It looks like you are parsing an XML file (because I see you checking for "<customer>").

It would be better to use a parsing library for this than low level streams. Since the file is quite large I suggest to use either SAX or StAX for this: https://docs.oracle.com/javase/tutorial/jaxp/stax/index.html

XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
XMLEventReader xmlEventReader = xmlInputFactory.createXMLEventReader(new FileInputStream(fileName));
while(xmlEventReader.hasNext()) {
    XMLEvent xmlEvent = xmlEventReader.nextEvent();
    // parse the XML events one by one

You will have to do all the 'further processing' immediately on the XML events, since you cannot store the data in memory.

Maybe this will make it more clear how to use StAX:

    XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
    XMLEventReader xmlEventReader = xmlInputFactory.createXMLEventReader(new FileInputStream("huge-file.xml"));

    // this variable is re-used to store the current customer
    Customer customer = null;

    while (xmlEventReader.hasNext()) {

        XMLEvent xmlEvent = xmlEventReader.nextEvent();
        if (xmlEvent.isStartElement()) {

            StartElement startElement = xmlEvent.asStartElement();

            if (startElement.getName().getLocalPart().equalsIgnoreCase("customer")) {
                // start populating a new customer
                customer = new Customer();

                // read an attribute for example <customer number="42">
                Attribute attribute = startElement.getAttributeByName(new QName("number"));
                if (attribute != null) {
                    customer.setNumber(attribute.getValue());
                }
            }

            // read a nested element for example:
            // <customer>
            //    <name>John Doe</name>
            if(startElement.getName().getLocalPart().equals("name")){
                xmlEvent = xmlEventReader.nextEvent();
                customer.setName(xmlEvent.asCharacters().getData());
            }
        }

        if (xmlEvent.isEndElement()) {
            EndElement endElement = xmlEvent.asEndElement();
            if(endElement.getName().getLocalPart().equalsIgnoreCase("customer")){
                // all data for the current Customer has been read
                // do something with the customer, like logging it or storing it in a database
                // after this the customer variable will be re-assigned to the next customer
            }
        }
    }
Sign up to request clarification or add additional context in comments.

4 Comments

while (xmlEventReader.hasNext()) { XMLEvent xmlEvent = null; try { xmlEvent = xmlEventReader.nextEvent(); }catch(Exception e){ e.printStackTrace(); } if (xmlEvent.isStartElement()) { StartElement elem = (StartElement) xmlEvent; if (elem.getName().getLocalPart().equals("<Customer>")) { if (customerRecord) { insideChildRecord = true; } customerRecord = true; } } if (customerRecord) { xmlEventWriter.add(xmlEvent); }
if (xmlEvent.isEndElement()) { EndElement elem = (EndElement) xmlEvent; if (elem.getName().getLocalPart().equals("<Customer>")) { if (insideChildRecord) { insideChildRecord = false; } else { customerRecord = false; xmlEventWriter.flush(); String cmlChunk = stringWriter.toString();
Tried with XMLEventReader however parsing each customer records takes time...1. <Customer>.......... </Customer>2.<Customer>.......... </Customer> ...... n records...... Each time customer is stored in the cmlChunk string above
Your code-in-comments approach is not really working. Moved it into the question as an edit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.