How to parse multiple, consecutive xml files in one document?

Question

I have a big text file that is a sequence of XML-valid documents that looks something like this:

<DOC>
   <TEXT> ... </TEXT>
    ...
</DOC>
<DOC>
    <TEXT> ... </TEXT>
    ...
</DOC>

etc. There is no <?xml version="1.0">, the <DOC></DOC> delimits each separate xml. What's the best way to parse this in Java and get the values under <TEXT> in each <DOC>?

If I pass the whole thing to a DocumentBuilder, I get an error saying the document is not well formed. Is there a better solution than simply traversing through, a building a string for each <DOC>?

Chrisji · Accepted Answer · 2016-10-15 15:26:36Z

5

A valid XML document must have a root element under which you can specify all other elements. Also, in a document only ONE root element can be present. have a look on XML Specification (see point 2)

So, to overcome your issue, you can take all the content of your text file into a String (or StringBuffer/StringBuilder...) And put this string in between <root> and </root> tags e.g ,

String origXML = readContentFromTextFile(fileName);
String validXML = "<root>" + origXML + "</root>";
//parse validXML

edited Oct 15, 2016 at 15:26

Chrisji

3112 silver badges13 bronze badges

answered May 10, 2011 at 6:55

Nirmit Shah

7684 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

openshac · Accepted Answer · 2011-05-10 06:48:30Z

2

The document is not well formed because you don't have a 'root' node:

<ROOT>
    <DOC>
       <TEXT> ... </TEXT>
        ...
    </DOC>
    <DOC>
        <TEXT> ... </TEXT>
        ...
    </DOC>
</ROOT>

answered May 10, 2011 at 6:48

openshac

5,1936 gold badges48 silver badges83 bronze badges

Comments

mbreining · Accepted Answer · 2011-05-10 06:51:43Z

1

You'll have a hard time parsing this with a "standard" XML parser such as Xerces. As you mentioned this XML document is not well-formed in part because it is missing an XML declaration <?xml version="1.0"?> but most importantly because it has two document roots (i.e. the <doc> elements).

I suggest you give TagSoup a try. It is intented to parse (quote) "poor, nasty and brutish" XML. No guarantee but that's probably your best shot.

answered May 10, 2011 at 6:51

mbreining

7,8292 gold badges35 silver badges35 bronze badges

1 Comment

smci Over a year ago

Thanks for the tip. The site in that link no longer exists. 'TagSoup' turns up other links but hard to tell what's canonical.

sudmong · Accepted Answer · 2011-05-10 06:57:07Z

0

You can try using xslt for parsing.

answered May 10, 2011 at 6:57

sudmong

2,03613 silver badges12 bronze badges

Comments

Maurice Perry · Accepted Answer · 2011-05-10 07:11:12Z

0

You could create a subclass of InputStream that adds a prefix and a suffix to the input stream, and pass an instance of that class to any XML parser:

public class EnclosedInputStream extends InputStream {
    private enum State {
        PREFIX, STREAM, SUFFIX, EOF
    };

    private final byte[] prefix;
    private final InputStream stream;
    private final byte[] suffix;
    private State state = State.PREFIX;
    private int index;

    EnclosedInputStream(byte [] prefix, InputStream stream, byte[] suffix) {
        this.prefix = prefix;
        this.stream = stream;
        this.suffix = suffix;
    }

    @Override
    public int read() throws IOException {
        if (state == State.PREFIX) {
            if (index < prefix.length) {
                return prefix[index++] & 0xFF;
            }
            state = State.STREAM;
        }
        if (state == State.STREAM) {
            int r = stream.read();
            if (r >= 0) {
                return r;
            }
            state = State.SUFFIX;
            index = 0;
        }
        if (state == State.SUFFIX) {
            if (index < suffix.length) {
                return suffix[index++] & 0xFF;
            }
            state = State.EOF;
        }
        return -1;
    }
}

answered May 10, 2011 at 7:11

Maurice Perry

32.8k9 gold badges72 silver badges97 bronze badges

2 Comments

Nirmit Shah Over a year ago

Why you need InputStream as a parameter in a constructor? You can use super.read() instead of stream.read() (as EnclosedInputStream is a subclass of InputStream).

Maurice Perry Over a year ago

You do not necessarily have access to the code that creates the InputStream with the original content. Suppose you have an URL for instance.

Collectives™ on Stack Overflow

How to parse multiple, consecutive xml files in one document?

5 Answers 5

Comments

Comments

1 Comment

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related