Remove empty tags at XML using Java

Question

I'm giving some functionality to a servlet, one of the things I want to do is, when receiving the InputStream (which is basically a PDF document parsed into an XML format) set that data to a String object, then I try to delete all the empty tags, but I haven't got any good result so far:

This is the data the servlet is receiving

    <form1>
        <GenInfo>
            <Section1>
                <EmployeeDet>
                    <Title>999990000</Title>
                    <Firstname>MIKE</Firstname>
                    <Surname>SPENCER</Surname>
                    <CoName/>
                    <EmpAdd>
                        <Address><Add1/><Add2/><Town/><County/><Pcode/></Address>
                    </EmpAdd>
                    <PosHeld>DEVELOPER</PosHeld>
                    <Email/>
                    <ConNo/>
                    <Nationality/>
                    <PPSNo/>
                    <EmpNo/>
                </EmployeeDet>
            </Section1>
        </GenInfo>
    </form1>

The final result should be looking like this:

    <form1>
        <GenInfo>
            <Section1>
                <EmployeeDet>
                    <Title>999990000</Title>
                    <Firstname>MIKE</Firstname>
                    <Surname>SPENCER</Surname>
                    <PosHeld>DEVELOPER</PosHeld>
                </EmployeeDet>
            </Section1>
        </GenInfo>
    </form1>

My apologies if it is a repeated question but I did some research over similar posts and none of them could provide me the correct approach, that's why I am asking you in a separate post.

Thank you in advance.

What API are you using to parse the XML? Parse the XML, and go through all elements. Delete elements that have no content, no children and no attributes. — JP Moresmau
– JP Moresmau, Commented Jun 1, 2015 at 15:44

Shar1er80 · Accepted Answer · 2015-06-01 17:09:48Z

10

Here's regex way of doing what you're wanting. I'm sure there are probably some "edge" cases that I'm not thinking of, but sometimes you can't tell when you use regex. Also, a DOM parser is probably the best way to do this.

public static void main(String[] args) throws Exception {
    String[] patterns = new String[] {
        // This will remove empty elements that look like <ElementName/>
        "\\s*<\\w+/>", 
        // This will remove empty elements that look like <ElementName></ElementName>
        "\\s*<\\w+></\\w+>", 
        // This will remove empty elements that look like 
        // <ElementName>
        // </ElementName>
        "\\s*<\\w+>\n*\\s*</\\w+>"
    };

    String xml = "    <form1>\n" +
                    "        <GenInfo>\n" +
                    "            <Section1>\n" +
                    "                <EmployeeDet>\n" +
                    "                    <Title>999990000</Title>\n" +
                    "                    <Firstname>MIKE</Firstname>\n" +
                    "                    <Surname>SPENCER</Surname>\n" +
                    "                    <CoName/>\n" +
                    "                    <EmpAdd>\n" +
                    "                        <Address><Add1/><Add2/><Town/><County/><Pcode/></Address>\n" +
                    "                    </EmpAdd>\n" +
                    "                    <PosHeld>DEVELOPER</PosHeld>\n" +
                    "                    <Email/>\n" +
                    "                    <ConNo/>\n" +
                    "                    <Nationality/>\n" +
                    "                    <PPSNo/>\n" +
                    "                    <EmpNo/>\n" +
                    "                </EmployeeDet>\n" +
                    "            </Section1>\n" +
                    "        </GenInfo>\n" +
                    "    </form1>";

    for (String pattern : patterns) {
        Matcher matcher = Pattern.compile(pattern).matcher(xml);
        xml = matcher.replaceAll("");
    }

    System.out.println(xml);
}

Results:

    <form1>
        <GenInfo>
            <Section1>
                <EmployeeDet>
                    <Title>999990000</Title>
                    <Firstname>MIKE</Firstname>
                    <Surname>SPENCER</Surname>
                    <PosHeld>DEVELOPER</PosHeld>
                </EmployeeDet>
            </Section1>
        </GenInfo>
    </form1>

answered Jun 1, 2015 at 17:09

Shar1er80

9,0512 gold badges24 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Shar1er80 Over a year ago

@JoseBerciano You're welcome... Kindly click the check mark to my answer so your question is solved and removed from the unanswered list.

Ramesh Over a year ago

it's failing if tag is <abc:title/>. Can you help me to find regex for it

ralph Over a year ago

Same problem here @Ramesh have you found a solution to it?

Community · Accepted Answer · 2017-05-23 10:29:33Z

1

What you have to do is iterate recursively over all the nodes. And once you've found a leaf, it's it's empty just remove it.

There is a very good example using DOM parser here

edited May 23, 2017 at 10:29

CommunityBot

11 silver badge

answered Jun 1, 2015 at 16:00

Tavo

3,1615 gold badges32 silver badges45 bronze badges

Collectives™ on Stack Overflow

Remove empty tags at XML using Java

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related