Java regular expression for extracting the data between tags

Question

I am trying to a regular expression which extracs the data from a string like

<B Att="text">Test</B><C>Test1</C>

The extracted output needs to be Test and Test1. This is what I have done till now:

public class HelloWorld {
    public static void main(String[] args)
    {
        String s = "<B>Test</B>";
        String reg = "<.*?>(.*)<\\/.*?>";
        Pattern p = Pattern.compile(reg);
        Matcher m = p.matcher(s);
        while(m.find())
        {
            String s1 = m.group();
            System.out.println(s1);
        }
    }
}

But this is producing the result <B>Test</B>. Can anybody point out what I am doing wrong?

I don't have an complex XML file. These are nodes without any child nodes (i.e. flat structure). So I thought regex is good enough. — Asha
– Asha, Commented Sep 15, 2010 at 16:06

Community · Accepted Answer · 2020-06-20 09:12:55Z

7

Three problems:

Your test string is incorrect.
You need a non-greedy modifier in the group.
You need to specify which group you want (group 1).

Try this:

String s = "<B Att=\"text\">Test</B><C>Test1</C>"; // <-- Fix 1
String reg = "<.*?>(.*?)</.*?>";                   // <-- Fix 2
// ...
String s1 = m.group(1);                            // <-- Fix 3

You also don't need to escape a forward slash, so I removed that.

See it running on ideone.

(Also, don't use regular expressions to parse HTML - use an HTML parser.)

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Sep 15, 2010 at 15:59

Mark Byers

844k202 gold badges1.6k silver badges1.5k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Asha Over a year ago

Thanks..but this produces the output <B Att="text">Test</B> for first iteration and <C>Test1</C>during second iteration. But I want only Test and Test1 as output.

Mark Byers Over a year ago

@Asha: String s1 = m.group(1);

Asha Over a year ago

Working fine now..I had tried it before but gave the index as 0. Didn't realize it is starting from 1.

Mark Byers Over a year ago

@Asha: Group 0 means the entire match.

Marek · Accepted Answer · 2010-09-15 16:11:49Z

2

If u are using eclipse there is nice plugin that will help you check your regular expression without writing any class to check it. Here is link: http://regex-util.sourceforge.net/update/ You will need to show view by choosing Window -> Show View -> Other, and than Regex Util

I hope it will help you fighting with regular expressions

answered Sep 15, 2010 at 16:11

Marek

4578 silver badges17 bronze badges

Comments

wheaties · Accepted Answer · 2010-09-15 16:00:34Z

1

It almost looks like you're trying to use regex on XML and/or HTML. I'd suggest not using regex and instead creating a parser or lexer to handle this type of arrangement.

answered Sep 15, 2010 at 16:00

wheaties

36.1k15 gold badges99 silver badges135 bronze badges

Comments

Garis M Suero · Accepted Answer · 2010-09-15 16:01:34Z

1

I think the bestway to handle and get value of XML nodes is just treating it as an XML.

If you really want to stick to regex try:

<B[^>]*>(.+?)</B\s*>

understanding that you will get always the value of B tag.

Or if you want the value of any tag you will be using something like:

<.*?>(.*?)</.*?>

answered Sep 15, 2010 at 16:01

Garis M Suero

8,1997 gold badges49 silver badges68 bronze badges

Collectives™ on Stack Overflow

Java regular expression for extracting the data between tags

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related