Use Java Regex to parse xml file

Question

For some reason I cannot use Sax and DOM parsers and need to parse it with regex.

I want to extract the values in Key-value pairs(Key being content in tag1, value being content in tag 3) . but some of the keys don't have any key values in between, I have to ignore those keys.

XML file

<Main Tag><element><tag1>Key1</tag1><tag2>Not intrested</tag2><tag3>Value1</tag3></element><element><tag1>Key2</tag1><tag2>Not intrested</tag2></element><element><tag1>Key3</tag1><tag2>Not intrested</tag2><tag3>Value3</tag3></element></Main Tag>

The above xml file with indentation:

<Main Tag>
    <element>
        <tag1>Key1</tag1>
        <tag2>Not intrested</tag2>
        <tag3>Value1</tag3>
    </element>
    <element>
        <tag1>Key2</tag1>
        <tag2>Not intrested</tag2>
    </element>
    <element>
        <tag1>Key3</tag1>
        <tag2>Not intrested</tag2>
        <tag3>Value3</tag3>
    </element>
</Main Tag>

So from above file I need to extract Key1-Value1 and Key3-Value3, Ignoring Key2 because it doesn't have a value.

Using the matcher:

final Pattern pattern = Pattern.compile("<tag1>(.+?)</tag1>.*<tag3>(.+?)</tag3>");
final Matcher matcher = pattern.matcher(above string);
matcher.find();
System.out.println(matcher.group(1)); // gives Key1 
System.out.println(matcher.group(1)); // gives Value3 // instead of Value1

I tried using sax and stax, It's not parsing the entire file, that's why I've choose to parse using regex — LeDerp
– LeDerp, Commented Jun 10, 2015 at 15:49
It'll take you much longer to figure out how to parse XML successfully using regex, and you're also in for a world of hurt. You're better off trying to figure out why using the XML parser is working for you. — Vivin Paliath
– Vivin Paliath, Commented Jun 10, 2015 at 15:53
If you check the highlighting in your source you will see that <Main Tag> is not allowed. Valid XML tag names can not contain whitespace and attributes always use the long form (unlike HTML). In other words, the example source you posted, is not XML. — ThW
– ThW, Commented Jun 10, 2015 at 15:57
That file was just an example, I just wrote it. Don't bother about the rules, it's not the exact file. Thank you anyway. — LeDerp
– LeDerp, Commented Jun 10, 2015 at 16:08

Shar1er80 · Accepted Answer · 2015-06-10 16:45:54Z

3

Give this pattern a try:

"<(tag[13])>(.+?)</tag[13]>"

Usage:

public static void main(String[] args) throws Exception {
    String xmlString = "<MainTag><element><tag1>Key1</tag1><tag2>Not intrested</tag2><tag3>Value1</tag3></element><element><tag1>Key2</tag1><tag2>Not intrested</tag2></element><element><tag1>Key3</tag1><tag2>Not intrested</tag2><tag3>Value3</tag3></element></MainTag>";

    Matcher matcher = Pattern.compile("<(tag[13])>(.+?)</tag[13]>").matcher(xmlString);
    while (matcher.find()) {
        System.out.println(matcher.group(1) + " " + matcher.group(2));
    }
}

Results:

tag1 Key1
tag3 Value1
tag1 Key2
tag1 Key3
tag3 Value3

NON REGEX

Or you could use the Document & DocumentBuilderFactory from the org.wc3.dom package.

Something like:

public static void main(String[] args) throws Exception {
    String xmlString = "<MainTag><element><tag1>Key1</tag1><tag2>Not intrested</tag2><tag3>Value1</tag3></element><element><tag1>Key2</tag1><tag2>Not intrested</tag2></element><element><tag1>Key3</tag1><tag2>Not intrested</tag2><tag3>Value3</tag3></element></MainTag>";
    Document xmlDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new InputSource(new ByteArrayInputStream(xmlString.getBytes("utf-8"))));

    Node rootNode = xmlDocument.getFirstChild();
    if (rootNode.hasChildNodes()) {
        // Get each element child node
        NodeList elementsList = rootNode.getChildNodes();
        for (int i = 0; i < elementsList.getLength(); i++) {
            if (elementsList.item(i).hasChildNodes()) {
                // Get each tag child node to element node
                NodeList tagsList = elementsList.item(i).getChildNodes();
                for (int i2 = 0; i2 < tagsList.getLength(); i2++) {
                    Node tagNode = tagsList.item(i2);
                    if (tagNode.getNodeName().matches("tag1|tag3")) {
                        System.out.println(tagNode.getNodeName() + " " + tagNode.getTextContent());
                    }
                }
            }
        }
    }
}

Results:

tag1 Key1
tag3 Value1
tag1 Key2
tag1 Key3
tag3 Value3

answered Jun 10, 2015 at 16:45

Shar1er80

9,0512 gold badges24 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

LeDerp Over a year ago

Can you give me a pattern if, the tags that i need to extract are different like <apple>key1<\apple> and <orange>value 1<\orange> instead of tag1 and tag 3. Which groups should I search for ?

Shar1er80 Over a year ago

@LeDerp Can tag1 and tag3 names be completely random in your data?

LeDerp Over a year ago

no it's not random, it's not tag1 and tag3 hence i can't use tag[13], anyways i solved it, using <(tag1)>(.+?)</tag1>|<(tag3)>(.+?)</tag3>

BobMcGee · Accepted Answer · 2016-05-18 21:23:42Z

2

The tool you want to be using is XPath -- it was specifically designed for exactly what you're doing.

If you can't parse an XML document with a standard tool, there's a reason and it's usually easier to fix that than do a regex.

Do you see an error if you enable more verbose parsing, and if so, what kind? (It may be helpful to use a command-line XML parser rather than java libraries, in this case, for better output).

The three most common problems I've seen in XML parsing:

Misconfigured namespaces: you'll get errors in validation/extraction
A subtly malformed XML document (for example, illegal characters such as 0x02). Sometimes these are unprintable, so you will not even see them.
Too big to parse in memory -- run out of memory during parsing (DOM problem generally, not SAX)

Some parsers are more or less strict about such things, you might want to try a couple tools, or enable less-strict modes.

JTidy or TagSoup may be able to fix some issues with improper XML, if it originals with HTML.

edited May 18, 2016 at 21:23

answered Jun 10, 2015 at 15:56

BobMcGee

20.2k10 gold badges48 silver badges58 bronze badges

2 Comments

LeDerp Over a year ago

I'm trying to parse a very big xml file(~ 500MB) Problem while using the sax/stax parser is it is not extracting the entire text in the tag, it is just extracting a part of it , i don't know why. usually the text in between the tags are not intended and is not in proper format and contains all kinds of chars including !@#$%^&() and & , ™ etc are also used

BobMcGee Over a year ago

DOM will probably run out of memory, but there are (limited) SAX based xpath implementations -- what part of the output are they trimming out? Does the document validate as correct XML? If so, a SAX XPATH implementation will handle special characters correctly, but your xpath may not be correct.

Collectives™ on Stack Overflow

Use Java Regex to parse xml file

2 Answers 2

NON REGEX

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

NON REGEX

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related