2

I have the following code which should remove all HTML from a part of string, which is quoted by dollar signs (could be more of them). This works fine, but I also need to preserve those dollar signs. Any suggestions, thanks

private static String removeMarkupBetweenDollars(String input){
    if ((input.length()-input.replaceAll("\\$","").length())%2!=0)
    {
        throw new RuntimeException("Missing or extra: dollar");
    }
    Pattern pattern = Pattern.compile("\\$(.*?)\\$",Pattern.DOTALL);
    Matcher matcher = pattern.matcher(input);

    StringBuffer sb =new StringBuffer();

    while(matcher.find())
         { //prepending does NOT work, if sth. is in front of first dollar
        matcher.appendReplacement(sb,matcher.group(1).replaceAll("\\<.*?\\>", ""));
        sb.append("$"); //note this manual appending
    }
    matcher.appendTail(sb);
    System.out.println(sb.toString());

    return sb.toString();
}

Thanks for help!

        String input="<p>$<em>something</em>$</p>  <p>anything else</p>";
    String output="<p>$something$</p>  <p>anything else</p>";

More complicated input and output:

String input="<p>$ bar  <b>foo</b>  bar <span style=\"text-decoration: underline;\">foo</span>  $</p><p>another foos</p> $ foo bar <em>bar</em>$";
String output="<p>$ bar  foo  bar foo  $</p><p>another foos</p> $ foo bar bar$"
5
  • HTML matching should not be done with regular expressions. Commented Jul 19, 2012 at 18:50
  • Can you provide an input/output example please. Commented Jul 19, 2012 at 18:51
  • I, know, but REGEX is the simpliest way to get rid of it. I don't need to do anything else with it... Commented Jul 19, 2012 at 18:52
  • I have no idea what you are trying to achieve. Please post sample input and expected output - examples speak much loader and clearer than words or code. And delete all your code - there's probably a better way Commented Jul 19, 2012 at 18:52
  • Sorry,sample posted, I'm trying to remove all HTML which is inside 2 dollarsigns... Commented Jul 19, 2012 at 18:55

2 Answers 2

1

Just some minor tweaks to your code:

private static String removeMarkupBetweenDollars(String input) {
    if ((input.length() - input.replaceAll("\\$", "").length()) % 2 != 0) {
        throw new RuntimeException("Missing or extra: dollar");
    }

    Pattern pattern = Pattern.compile("\\$(.*?)\\$", Pattern.DOTALL);
    Matcher matcher = pattern.matcher(input);

    StringBuffer sb = new StringBuffer();

    while (matcher.find()) {
        String s = matcher.group().replaceAll("<[^>]+>", "");
        matcher.appendReplacement(sb, Matcher.quoteReplacement(s));
    }
    matcher.appendTail(sb);

    return sb.toString();
}
Sign up to request clarification or add additional context in comments.

Comments

0
String output = input.replaceAll("\\$<.*?>(.*?)<.*?>\\$", "\\$$1\\$");

One key point in the regex is the ? in .*? - it means a "non greedy" match, which in turn means "consume the least possible input you can". Without this, the regex would try to consume as much as possible - up to the end of a subsequent occurrence of $<html>foo</html>$ in the input if one existed.

Here's a test:

public static void main(String[] args) throws Exception {
    String input = "<p>$<em>something</em>$</p> <p>and $<em>anything</em>$ else</p>";
    String output = input.replaceAll("\\$<.*?>(.*?)<.*?>\\$", "\\$$1\\$");
    System.out.println(output);
}

Output:

<p>$something$</p> <p>and $anything$ else</p>

6 Comments

Thank you for your fast answer, but what if the input is more complicated? see my edited question?
This works for a single embedded tag but fails if you have multiple e.g. "<p>$<em>something</em>$</p> <p>and $<em>anything</em>$ else</p>" which returns "<p>$something</em>$</p> <p>and $<em>anything$ else</p>" (wrong).
@davidpeterson Would you believe I left out a single ? from the regex. It's fixed now.
@MartinM I fixed the problem. My group one capture was also greedy - I added an extra ? to the regex and it works properly now. I also updated the sample test.
However note that your new input differs from the original in that html no longer comes immediately after the dollar.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.