91

I have a file with some custom tags and I'd like to write a regular expression to extract the string between the tags. For example if my tag is:

[customtag]String I want to extract[/customtag]

How would I write a regular expression to extract only the string between the tags. This code seems like a step in the right direction:

Pattern p = Pattern.compile("[customtag](.+?)[/customtag]");
Matcher m = p.matcher("[customtag]String I want to extract[/customtag]");

Not sure what to do next. Any ideas? Thanks.

1
  • 1
    For starters, you need to escape the [] square brackets which are metacharacters in a regex. Commented Jul 3, 2011 at 2:28

8 Answers 8

199

You're on the right track. Now you just need to extract the desired group, as follows:

final Pattern pattern = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher("<tag>String I want to extract</tag>");
matcher.find();
System.out.println(matcher.group(1)); // Prints String I want to extract

If you want to extract multiple hits, try this:

public static void main(String[] args) {
    final String str = "<tag>apple</tag><b>hello</b><tag>orange</tag><tag>pear</tag>";
    System.out.println(Arrays.toString(getTagValues(str).toArray())); // Prints [apple, orange, pear]
}

private static final Pattern TAG_REGEX = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);

private static List<String> getTagValues(final String str) {
    final List<String> tagValues = new ArrayList<String>();
    final Matcher matcher = TAG_REGEX.matcher(str);
    while (matcher.find()) {
        tagValues.add(matcher.group(1));
    }
    return tagValues;
}

However, I agree that regular expressions are not the best answer here. I'd use XPath to find elements I'm interested in. See The Java XPath API for more info.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks so much, that is just what I needed. I will look into XPaths, but for now I think this solution will work. My applications is very simple and will probably stay that way. Thanks again!
What about this string "<tag>apple</tag><b>hello</b><tag>orange</tag><tag>pear"? How can we get pear without close tag?
To generalize: private String extractDataFromTags(String tag) { Pattern pattern = Pattern.compile("<.+?>(.+?)</.+?>"); Matcher matcher = pattern.matcher(tag); matcher.find(); return (matcher.group(1)); // Prints String I want to extract or throws exception }
17

To be quite honest, regular expressions are not the best idea for this type of parsing. The regular expression you posted will probably work great for simple cases, but if things get more complex you are going to have huge problems (same reason why you cant reliably parse HTML with regular expressions). I know you probably don't want to hear this, I know I didn't when I asked the same type of questions, but string parsing became WAY more reliable for me after I stopped trying to use regular expressions for everything.

jTopas is an AWESOME tokenizer that makes it quite easy to write parsers by hand (I STRONGLY suggest jtopas over the standard java scanner/etc.. libraries). If you want to see jtopas in action, here are some parsers I wrote using jTopas to parse this type of file

If you are parsing XML files, you should be using an xml parser library. Dont do it youself unless you are just doing it for fun, there are plently of proven options out there

1 Comment

Thanks for the suggestion. I've bookmarked them and I will certainly look into using this in future projects. For now the regex method is probably the one I will go with since the file I'm parsing is very small/simple.
8

A generic,simpler and a bit primitive approach to find tag, attribute and value

    Pattern pattern = Pattern.compile("<(\\w+)( +.+)*>((.*))</\\1>");
    System.out.println(pattern.matcher("<asd> TEST</asd>").find());
    System.out.println(pattern.matcher("<asd TEST</asd>").find());
    System.out.println(pattern.matcher("<asd attr='3'> TEST</asd>").find());
    System.out.println(pattern.matcher("<asd> <x>TEST<x>asd>").find());
    System.out.println("-------");
    Matcher matcher = pattern.matcher("<as x> TEST</as>");
    if (matcher.find()) {
        for (int i = 0; i <= matcher.groupCount(); i++) {
            System.out.println(i + ":" + matcher.group(i));
        }
    }

2 Comments

What would be the pattern If there is a sequence of different tags or nested tags like <h2>Mac</h2><h1>loves it</h1> or <h2>Mac<h1>liked your answer</h1></h2>?
please edit i < matcher.groupCount(); to i <= matcher.groupCount(); to include first matching substring ie. at 0th index
5
    String s = "<B><G>Test</G></B><C>Test1</C>";

    String pattern ="\\<(.+)\\>([^\\<\\>]+)\\<\\/\\1\\>";

       int count = 0;

        Pattern p = Pattern.compile(pattern);
        Matcher m =  p.matcher(s);
        while(m.find())
        {
            System.out.println(m.group(2));
            count++;
        }

Comments

4

Try this:

Pattern p = Pattern.compile(?<=\\<(any_tag)\\>)(\\s*.*\\s*)(?=\\<\\/(any_tag)\\>);
Matcher m = p.matcher(anyString);

For example:

String str = "<TR> <TD>1Q Ene</TD> <TD>3.08%</TD> </TR>";
Pattern p = Pattern.compile("(?<=\\<TD\\>)(\\s*.*\\s*)(?=\\<\\/TD\\>)");
Matcher m = p.matcher(str);
while(m.find()){
   Log.e("Regex"," Regex result: " + m.group())       
}

Output:

10 Ene

3.08%

Comments

2
    final Pattern pattern = Pattern.compile("tag\\](.+?)\\[/tag");
    final Matcher matcher = pattern.matcher("[tag]String I want to extract[/tag]");
    matcher.find();
    System.out.println(matcher.group(1));

1 Comment

how about prefix for tag (if prefix is dynamic)
1

I prefix this reply with "you shouldn't use a regular expression to parse XML -- it's only going to result in edge cases that don't work right, and a forever-increasing-in-complexity regex while you try to fix it."

That being said, you need to proceed by matching the string and grabbing the group you want:

if (m.matches())
{
   String result = m.group(1);
   // do something with result
}

Comments

0

This works for me, use in your main method below Scanner input. Works for Hackerrank "Tag Content Extractor" also.

  boolean matchFound = false;
        Pattern r = Pattern.compile("<(.+)>([^<]+)</\\1>");
        Matcher m = r.matcher(line);

        while (m.find()) {
            System.out.println(m.group(2));
            matchFound = true;
        }
        if ( ! matchFound) {
            System.out.println("None");
        }
        
        testCases--;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.