1

I have a big text file that is a sequence of XML-valid documents that looks something like this:

<DOC>
   <TEXT> ... </TEXT>
    ...
</DOC>
<DOC>
    <TEXT> ... </TEXT>
    ...
</DOC>

etc. There is no <?xml version="1.0">, the <DOC></DOC> delimits each separate xml. What's the best way to parse this in Java and get the values under <TEXT> in each <DOC>?

If I pass the whole thing to a DocumentBuilder, I get an error saying the document is not well formed. Is there a better solution than simply traversing through, a building a string for each <DOC>?

5 Answers 5

5

A valid XML document must have a root element under which you can specify all other elements. Also, in a document only ONE root element can be present. have a look on XML Specification (see point 2)

So, to overcome your issue, you can take all the content of your text file into a String (or StringBuffer/StringBuilder...) And put this string in between <root> and </root> tags e.g ,

String origXML = readContentFromTextFile(fileName);
String validXML = "<root>" + origXML + "</root>";
//parse validXML
Sign up to request clarification or add additional context in comments.

Comments

2

The document is not well formed because you don't have a 'root' node:

<ROOT>
    <DOC>
       <TEXT> ... </TEXT>
        ...
    </DOC>
    <DOC>
        <TEXT> ... </TEXT>
        ...
    </DOC>
</ROOT>

Comments

1

You'll have a hard time parsing this with a "standard" XML parser such as Xerces. As you mentioned this XML document is not well-formed in part because it is missing an XML declaration <?xml version="1.0"?> but most importantly because it has two document roots (i.e. the <doc> elements).

I suggest you give TagSoup a try. It is intented to parse (quote) "poor, nasty and brutish" XML. No guarantee but that's probably your best shot.

1 Comment

Thanks for the tip. The site in that link no longer exists. 'TagSoup' turns up other links but hard to tell what's canonical.
0

You can try using xslt for parsing.

Comments

0

You could create a subclass of InputStream that adds a prefix and a suffix to the input stream, and pass an instance of that class to any XML parser:

public class EnclosedInputStream extends InputStream {
    private enum State {
        PREFIX, STREAM, SUFFIX, EOF
    };

    private final byte[] prefix;
    private final InputStream stream;
    private final byte[] suffix;
    private State state = State.PREFIX;
    private int index;

    EnclosedInputStream(byte [] prefix, InputStream stream, byte[] suffix) {
        this.prefix = prefix;
        this.stream = stream;
        this.suffix = suffix;
    }

    @Override
    public int read() throws IOException {
        if (state == State.PREFIX) {
            if (index < prefix.length) {
                return prefix[index++] & 0xFF;
            }
            state = State.STREAM;
        }
        if (state == State.STREAM) {
            int r = stream.read();
            if (r >= 0) {
                return r;
            }
            state = State.SUFFIX;
            index = 0;
        }
        if (state == State.SUFFIX) {
            if (index < suffix.length) {
                return suffix[index++] & 0xFF;
            }
            state = State.EOF;
        }
        return -1;
    }
}

2 Comments

Why you need InputStream as a parameter in a constructor? You can use super.read() instead of stream.read() (as EnclosedInputStream is a subclass of InputStream).
You do not necessarily have access to the code that creates the InputStream with the original content. Suppose you have an URL for instance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.