3

I am trying to a regular expression which extracs the data from a string like

<B Att="text">Test</B><C>Test1</C>

The extracted output needs to be Test and Test1. This is what I have done till now:

public class HelloWorld {
    public static void main(String[] args)
    {
        String s = "<B>Test</B>";
        String reg = "<.*?>(.*)<\\/.*?>";
        Pattern p = Pattern.compile(reg);
        Matcher m = p.matcher(s);
        while(m.find())
        {
            String s1 = m.group();
            System.out.println(s1);
        }
    }
}

But this is producing the result <B>Test</B>. Can anybody point out what I am doing wrong?

2
  • Why don't you use an XML parser? Commented Sep 15, 2010 at 16:01
  • I don't have an complex XML file. These are nodes without any child nodes (i.e. flat structure). So I thought regex is good enough. Commented Sep 15, 2010 at 16:06

4 Answers 4

7

Three problems:

  • Your test string is incorrect.
  • You need a non-greedy modifier in the group.
  • You need to specify which group you want (group 1).

Try this:

String s = "<B Att=\"text\">Test</B><C>Test1</C>"; // <-- Fix 1
String reg = "<.*?>(.*?)</.*?>";                   // <-- Fix 2
// ...
String s1 = m.group(1);                            // <-- Fix 3

You also don't need to escape a forward slash, so I removed that.

See it running on ideone.

(Also, don't use regular expressions to parse HTML - use an HTML parser.)

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks..but this produces the output <B Att="text">Test</B> for first iteration and <C>Test1</C>during second iteration. But I want only Test and Test1 as output.
@Asha: String s1 = m.group(1);
Working fine now..I had tried it before but gave the index as 0. Didn't realize it is starting from 1.
@Asha: Group 0 means the entire match.
2

If u are using eclipse there is nice plugin that will help you check your regular expression without writing any class to check it. Here is link: http://regex-util.sourceforge.net/update/ You will need to show view by choosing Window -> Show View -> Other, and than Regex Util

I hope it will help you fighting with regular expressions

Comments

1

It almost looks like you're trying to use regex on XML and/or HTML. I'd suggest not using regex and instead creating a parser or lexer to handle this type of arrangement.

Comments

1

I think the bestway to handle and get value of XML nodes is just treating it as an XML.

If you really want to stick to regex try:

<B[^>]*>(.+?)</B\s*>

understanding that you will get always the value of B tag.

Or if you want the value of any tag you will be using something like:

<.*?>(.*?)</.*?>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.