0

I'm trying to detect <code>...</code> chunks inside an HTML source code file in order to remove them from the file. I am using Java 8 Pattern and Matcher classes to implement RegEx. For example, this method prints out every <code>...</code> finding.

protected void printSourceCodeChunks() {
  // Design a regular expression to detect code chunks
  String patternString = "<code>.*<\\/code>";
  Pattern pattern = Pattern.compile(patternString);
  Matcher matcher = pattern.matcher(source);
  
  // Loop over findings
  int i = 1;
  while (matcher.find())
    System.out.println(i++ + ": " + matcher.group());
}

A typical output would be:

1: <code> </code>
2: <code></code>
3: <code>System.out.println("Hello World");</code>

As I am using the special character dot and the source code chunks may include line breaks (\n or \r), no code blocks including line breaks will be detected. Fortunately Pattern class can be instructed to include line breaks into the meaning of dot, just adding

  Pattern pattern = Pattern.compile(patternString, Pattern.DOTALL);

The problem with this approach is that only one (fake) <code>...</code> block is detected: the one starting with the first occurrence of <code> and the last occurrence of </code> in the HTML file. The output includes now all the HTML code between these two tags.

How may I alter the regex expression to match every single code block?

Solution proposal

As many of you posted, and for the benefit of future readers, it was that easy as changing my regex by

<code>.*?<\\/code>

as * takes all chars up to the last </code> it finds.

5
  • 3
    Be kind to yourself and use html parser 😊 Commented Jan 31, 2019 at 12:17
  • 6
    Don't parse HTML with RegExp: stackoverflow.com/a/1732454/345027 Commented Jan 31, 2019 at 12:18
  • Make the match all expression reluctant, i.e. .*? which will make it match as little as possible. However, please be aware that code (Java, Html etc.) is an irregular problem domain and regex are generally no good fit for that. Commented Jan 31, 2019 at 12:18
  • Aside from @king_nak link it may be worth reading Can you provide some examples of why it is hard to parse XML and HTML with a regex?, Using regular expressions to parse HTML: why not? Commented Jan 31, 2019 at 12:19
  • Thank you Thomas .*? works fine now. In fact my source is not HTML but an XML dialect that include some HTML tags and some non-HTML tags. It is worth to use regex for this special case but I've learn from your comment that it is not a good solution for the general HTML case. Commented Jan 31, 2019 at 12:23

2 Answers 2

4

You don't use regex to manipulate html!

Instead, parse the html, for example with jsoup, and remove the elements properly.

String html = "<html><head><title>First parse</title></head>"
        + "<body><p>Parsed HTML into a doc.</p><code>foo</code><code></code><code> </code></body></html>";
Document doc = Jsoup.parse(html);
Elements codes = doc.body().getElementsByTag("code");
codes.remove();
System.out.println(doc.toString());
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you @bambam. I use Jsoup elsewhere. I agree using Jsoup is the best solution for HTML and XML tags. I was using regular expressions here because the general case I am in mixes HTML tags with some extra non-XML markup, namely Markdown markups. Generally speaking, my source is SGML compliant but not XHTML compliant. In fact, the code I was trying to fix is part of a validator/compilator, that translates Markdown markups into regular XHTML tags for further XHTML and Schema validations.
2

You can do that with the non-greedy ?:

String patternString = "<code>.*?<\\/code>"

By default the * will match everything it gets, from the first occurance of <code> to the last of </code>. With the questionmark ? it will stop matching at the first occurance.

Though I highly recommend to not "parse" any structure with regex, better use a dedicated HTML parser

2 Comments

Using regex to parse html is not really a good idea. There are so many edge cases like spaces or attributes inside the tag. nested tags, tags with no closing tags. If you are certain that you will never have these cases you can get away with it but just remember that sometimes when you use regex to solve a problem you can easily end up with 2 problems.
Yes, I agree with you Damo. This case of mine is an internal and well-controlled case, but parsing anonymous or external HTML with regex surely leads to the issues you comment. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.