Java REGEX XML parse/cut-down while maintaining structure HowTo

Question

I am writing a RESTful web service in Java. The idea is to "cut down" an XML document and strip away all the unneeded content (~98%) and leave only the tags we're interested in, while maintaining the document's structure, which is as follows (I cannot provide the actual XML content for confidentiality reasons):

<sear:SEGMENTS xmlns="http://www.exlibrisgroup.com/xsd/primo/primo_nm_bib" xmlns:sear="http://www.exlibrisgroup.com/xsd/jaguar/search">
   <sear:JAGROOT>
      <sear:RESULT>
         <sear:DOCSET IS_LOCAL="true" TOTAL_TIME="176" LASTHIT="9" FIRSTHIT="0" TOTALHITS="262" HIT_TIME="11">
            <sear:DOC SEARCH_ENGINE_TYPE="Local Search Engine" SEARCH_ENGINE="Local Search Engine" NO="1" RANK="0.086826384" ID="2347460">
               [
               <PrimoNMBib>
                  <record>
                     <display>
                        <title></title>
                     </display>
                     <sort>
                        <author></author>
                     </sort>
                  </record>
               </PrimoNMBib>
               ]
            </sear:DOC>
         </sear:DOCSET>
      </sear:RESULT>
   </sear:JAGROOT>
</sear:SEGMENTS>

Of course, this is the structure of only the tags we are interested in - there are hundreds more tags, but they are irrelevant.

The square brackets ([]) are not part of the XML and indicate that the element <PrimoNMBib></PrimoNMBib> are elements of a list of children and occur more than once - one per match of the search from the RESTFUL service.

I've been trying to parse the document with regular expressions, as to leave only the segments of the structure as shown above along with the values of <title> and <author> while removing everything else in-between the tags including other tags, however I can't get it to work for the life of me...

Previously I tried it using XSLT, however for unresolved reasons that didn't work either... I'd already asked a question for the XSLT implementation...

Anyway, I would very much appreciate a tip/hint/solution as how to solve this problem using regex and Java...

I'm sorry to hear XSLT, which is designed for exactly this, doesn't work for you. Doing it with regular expressions sounds very hard. In fact, doing it any other way than using an XML parsing library sounds hard. Perhaps something like making a SAXParser and building up a stack of the ancestor tags might help? — Rob I
– Rob I, Commented Apr 27, 2012 at 13:30
Thanks a lot Rob. Perhaps you would be able to suggest how one could tackle this with XSLT? perhaps you would be able to suggest something for my XSLT implementation?: stackoverflow.com/questions/10340023/… — Piotr
– Piotr, Commented Apr 27, 2012 at 13:39
If is guaranteed the tags to be each one on a separated line and that removing the unnecessary tags will not brake validity of xml structure, you could use a script (perl, bash, sed, python, etc) using regex to read the lines and strip the ones that not contain required tags openings and endings. — Flavio Cysne
– Flavio Cysne, Commented Apr 27, 2012 at 13:44
So I did go look at that question. You said you transformed it via a browser - does that mean you've run that same XSLT transformation successfully on the same documents outside your Java code? If so, it sounds like an XSLT "engine" difference. Wild guess - try inserting the <?xml version="1.0"?> at the front of your Java string. — Rob I
– Rob I, Commented Apr 27, 2012 at 13:46
Hi Rob, yes, I put <?xml-stylesheet type="text/xsl" href="test.xsl"?> at the beginning of my xml where test.xsl is the stylesheet, and applied it in google chrome and it worked just fine... by java string, do you mean my xml string or xsl string? — Piotr
– Piotr, Commented Apr 27, 2012 at 14:09

bdoughan · Accepted Answer · 2012-04-27 17:04:07Z

1

I wouldn't recommend using regex to manipulate XML.

Alternative Approach

You could use a StAX parser that leverages a StreamFilter to cut down the document and still maintain a valid structure.

How a StreamFilter Works

A StreamFilter receives event event from the XMLStreamReader, if you want to have the event reported you return true, otherwise false. In the example below the StreamFilter will reject anything in the "http://www.exlibrisgroup.com/xsd/jaguar/search" namespace. You will need to tweak the logic to get it to match the requirements of your use case.

http://docs.oracle.com/javase/6/docs/api/javax/xml/stream/StreamFilter.html

Demo

package forum10351473;

import java.io.FileReader;
import javax.xml.stream.*;

public class Demo {

    public static void main(String[] args) throws Exception {
        XMLInputFactory xif = XMLInputFactory.newFactory();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("src/forum10351473/input.xml"));
        xsr = xif.createFilteredReader(xsr, new StreamFilter() {

            private boolean reportContent = false;

            @Override
            public boolean accept(XMLStreamReader reader) {
                if(reader.isStartElement() || reader.isEndElement()) {
                    reportContent = !"http://www.exlibrisgroup.com/xsd/jaguar/search".equals(reader.getNamespaceURI());
                }
                return reportContent;
            }

        });

        // The XMLStreamReader (xsr) will now only report the events you care about.
        // You can process the XMLStreamReader yourself or pass as input to something
        // like JAXB.
        while(xsr.hasNext()) {
            if(xsr.isStartElement()) {
                System.out.println(xsr.getLocalName());
            }
            xsr.next();
        }
    }

}

Output

PrimoNMBib
record
display
title
sort
author

edited Apr 27, 2012 at 17:04

answered Apr 27, 2012 at 15:53

bdoughan

149k25 gold badges309 silver badges410 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Piotr Over a year ago

Hi Blaise, thanks very much for the tip! How would I go about including the elements parent to PrimoNMBib also? :)

bdoughan Over a year ago

You want to include sear:DOC as well? This can all be controlled in the accept method. You just need to add the logic of when to accept/reject events.

Piotr Over a year ago

Yes, I want to include all the parent tags including <sear:SEGMENTS ...> along with the text values of <title> and <author>. Can you help? :)

bdoughan Over a year ago

You just need to play around with the logic in the accept method to get the behaviour that you are looking for.

Piotr Over a year ago

how do i do that though? i'm very confused as to how the method works... :(

|

Collectives™ on Stack Overflow

Java REGEX XML parse/cut-down while maintaining structure HowTo

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related