1

For some reason I cannot use Sax and DOM parsers and need to parse it with regex.

I want to extract the values in Key-value pairs(Key being content in tag1, value being content in tag 3) . but some of the keys don't have any key values in between, I have to ignore those keys.

XML file

<Main Tag><element><tag1>Key1</tag1><tag2>Not intrested</tag2><tag3>Value1</tag3></element><element><tag1>Key2</tag1><tag2>Not intrested</tag2></element><element><tag1>Key3</tag1><tag2>Not intrested</tag2><tag3>Value3</tag3></element></Main Tag>

The above xml file with indentation:

<Main Tag>
    <element>
        <tag1>Key1</tag1>
        <tag2>Not intrested</tag2>
        <tag3>Value1</tag3>
    </element>
    <element>
        <tag1>Key2</tag1>
        <tag2>Not intrested</tag2>
    </element>
    <element>
        <tag1>Key3</tag1>
        <tag2>Not intrested</tag2>
        <tag3>Value3</tag3>
    </element>
</Main Tag> 

So from above file I need to extract Key1-Value1 and Key3-Value3, Ignoring Key2 because it doesn't have a value.

Using the matcher:

final Pattern pattern = Pattern.compile("<tag1>(.+?)</tag1>.*<tag3>(.+?)</tag3>");
final Matcher matcher = pattern.matcher(above string);
matcher.find();
System.out.println(matcher.group(1)); // gives Key1 
System.out.println(matcher.group(1)); // gives Value3 // instead of Value1  
5
  • 3
    You'd probably be better off using an actual XML parser. Commented Jun 10, 2015 at 15:48
  • I tried using sax and stax, It's not parsing the entire file, that's why I've choose to parse using regex Commented Jun 10, 2015 at 15:49
  • It'll take you much longer to figure out how to parse XML successfully using regex, and you're also in for a world of hurt. You're better off trying to figure out why using the XML parser is working for you. Commented Jun 10, 2015 at 15:53
  • 1
    If you check the highlighting in your source you will see that <Main Tag> is not allowed. Valid XML tag names can not contain whitespace and attributes always use the long form (unlike HTML). In other words, the example source you posted, is not XML. Commented Jun 10, 2015 at 15:57
  • That file was just an example, I just wrote it. Don't bother about the rules, it's not the exact file. Thank you anyway. Commented Jun 10, 2015 at 16:08

2 Answers 2

3

Give this pattern a try:

"<(tag[13])>(.+?)</tag[13]>"

Usage:

public static void main(String[] args) throws Exception {
    String xmlString = "<MainTag><element><tag1>Key1</tag1><tag2>Not intrested</tag2><tag3>Value1</tag3></element><element><tag1>Key2</tag1><tag2>Not intrested</tag2></element><element><tag1>Key3</tag1><tag2>Not intrested</tag2><tag3>Value3</tag3></element></MainTag>";

    Matcher matcher = Pattern.compile("<(tag[13])>(.+?)</tag[13]>").matcher(xmlString);
    while (matcher.find()) {
        System.out.println(matcher.group(1) + " " + matcher.group(2));
    }
}

Results:

tag1 Key1
tag3 Value1
tag1 Key2
tag1 Key3
tag3 Value3

NON REGEX

Or you could use the Document & DocumentBuilderFactory from the org.wc3.dom package.

Something like:

public static void main(String[] args) throws Exception {
    String xmlString = "<MainTag><element><tag1>Key1</tag1><tag2>Not intrested</tag2><tag3>Value1</tag3></element><element><tag1>Key2</tag1><tag2>Not intrested</tag2></element><element><tag1>Key3</tag1><tag2>Not intrested</tag2><tag3>Value3</tag3></element></MainTag>";
    Document xmlDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new InputSource(new ByteArrayInputStream(xmlString.getBytes("utf-8"))));

    Node rootNode = xmlDocument.getFirstChild();
    if (rootNode.hasChildNodes()) {
        // Get each element child node
        NodeList elementsList = rootNode.getChildNodes();
        for (int i = 0; i < elementsList.getLength(); i++) {
            if (elementsList.item(i).hasChildNodes()) {
                // Get each tag child node to element node
                NodeList tagsList = elementsList.item(i).getChildNodes();
                for (int i2 = 0; i2 < tagsList.getLength(); i2++) {
                    Node tagNode = tagsList.item(i2);
                    if (tagNode.getNodeName().matches("tag1|tag3")) {
                        System.out.println(tagNode.getNodeName() + " " + tagNode.getTextContent());
                    }
                }
            }
        }
    }
}

Results:

tag1 Key1
tag3 Value1
tag1 Key2
tag1 Key3
tag3 Value3
Sign up to request clarification or add additional context in comments.

3 Comments

Can you give me a pattern if, the tags that i need to extract are different like <apple>key1<\apple> and <orange>value 1<\orange> instead of tag1 and tag 3. Which groups should I search for ?
@LeDerp Can tag1 and tag3 names be completely random in your data?
no it's not random, it's not tag1 and tag3 hence i can't use tag[13], anyways i solved it, using <(tag1)>(.+?)</tag1>|<(tag3)>(.+?)</tag3>
2

The tool you want to be using is XPath -- it was specifically designed for exactly what you're doing.

If you can't parse an XML document with a standard tool, there's a reason and it's usually easier to fix that than do a regex.

Do you see an error if you enable more verbose parsing, and if so, what kind? (It may be helpful to use a command-line XML parser rather than java libraries, in this case, for better output).

The three most common problems I've seen in XML parsing:

  • Misconfigured namespaces: you'll get errors in validation/extraction
  • A subtly malformed XML document (for example, illegal characters such as 0x02). Sometimes these are unprintable, so you will not even see them.
  • Too big to parse in memory -- run out of memory during parsing (DOM problem generally, not SAX)

Some parsers are more or less strict about such things, you might want to try a couple tools, or enable less-strict modes.

JTidy or TagSoup may be able to fix some issues with improper XML, if it originals with HTML.

2 Comments

I'm trying to parse a very big xml file(~ 500MB) Problem while using the sax/stax parser is it is not extracting the entire text in the tag, it is just extracting a part of it , i don't know why. usually the text in between the tags are not intended and is not in proper format and contains all kinds of chars including !@#$%^&() and &amp; , &trade; etc are also used
DOM will probably run out of memory, but there are (limited) SAX based xpath implementations -- what part of the output are they trimming out? Does the document validate as correct XML? If so, a SAX XPATH implementation will handle special characters correctly, but your xpath may not be correct.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.