Regex expression detect <code>...</code> code chunks

Question

I'm trying to detect <code>...</code> chunks inside an HTML source code file in order to remove them from the file. I am using Java 8 Pattern and Matcher classes to implement RegEx. For example, this method prints out every <code>...</code> finding.

protected void printSourceCodeChunks() {
  // Design a regular expression to detect code chunks
  String patternString = "<code>.*<\\/code>";
  Pattern pattern = Pattern.compile(patternString);
  Matcher matcher = pattern.matcher(source);
  
  // Loop over findings
  int i = 1;
  while (matcher.find())
    System.out.println(i++ + ": " + matcher.group());
}

A typical output would be:

1: <code> </code>
2: <code></code>
3: <code>System.out.println("Hello World");</code>

As I am using the special character dot and the source code chunks may include line breaks (\n or \r), no code blocks including line breaks will be detected. Fortunately Pattern class can be instructed to include line breaks into the meaning of dot, just adding

  Pattern pattern = Pattern.compile(patternString, Pattern.DOTALL);

The problem with this approach is that only one (fake) <code>...</code> block is detected: the one starting with the first occurrence of <code> and the last occurrence of </code> in the HTML file. The output includes now all the HTML code between these two tags.

How may I alter the regex expression to match every single code block?

Solution proposal

As many of you posted, and for the benefit of future readers, it was that easy as changing my regex by

<code>.*?<\\/code>

as * takes all chars up to the last </code> it finds.

Don't parse HTML with RegExp: stackoverflow.com/a/1732454/345027 — king_nak
– king_nak, Commented Jan 31, 2019 at 12:18
Make the match all expression reluctant, i.e. .*? which will make it match as little as possible. However, please be aware that code (Java, Html etc.) is an irregular problem domain and regex are generally no good fit for that. — Thomas
– Thomas, Commented Jan 31, 2019 at 12:18
Aside from @king_nak link it may be worth reading Can you provide some examples of why it is hard to parse XML and HTML with a regex?, Using regular expressions to parse HTML: why not? — Pshemo
– Pshemo, Commented Jan 31, 2019 at 12:19
Thank you Thomas .*? works fine now. In fact my source is not HTML but an XML dialect that include some HTML tags and some non-HTML tags. It is worth to use regex for this special case but I've learn from your comment that it is not a good solution for the general HTML case. — coterobarros
– coterobarros, Commented Jan 31, 2019 at 12:23

baao · Accepted Answer · 2019-01-31 12:21:21Z

4

You don't use regex to manipulate html!

Instead, parse the html, for example with jsoup, and remove the elements properly.

String html = "<html><head><title>First parse</title></head>"
        + "<body><p>Parsed HTML into a doc.</p><code>foo</code><code></code><code> </code></body></html>";
Document doc = Jsoup.parse(html);
Elements codes = doc.body().getElementsByTag("code");
codes.remove();
System.out.println(doc.toString());

answered Jan 31, 2019 at 12:21

baao

73.5k18 gold badges152 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

coterobarros Over a year ago

Thank you @bambam. I use Jsoup elsewhere. I agree using Jsoup is the best solution for HTML and XML tags. I was using regular expressions here because the general case I am in mixes HTML tags with some extra non-XML markup, namely Markdown markups. Generally speaking, my source is SGML compliant but not XHTML compliant. In fact, the code I was trying to fix is part of a validator/compilator, that translates Markdown markups into regular XHTML tags for further XHTML and Schema validations.

Lino · Accepted Answer · 2019-01-31 12:18:00Z

2

You can do that with the non-greedy ?:

String patternString = "<code>.*?<\\/code>"

By default the * will match everything it gets, from the first occurance of <code> to the last of </code>. With the questionmark ? it will stop matching at the first occurance.

Though I highly recommend to not "parse" any structure with regex, better use a dedicated HTML parser

answered Jan 31, 2019 at 12:18

Lino

20k6 gold badges55 silver badges73 bronze badges

2 Comments

Damo Over a year ago

Using regex to parse html is not really a good idea. There are so many edge cases like spaces or attributes inside the tag. nested tags, tags with no closing tags. If you are certain that you will never have these cases you can get away with it but just remember that sometimes when you use regex to solve a problem you can easily end up with 2 problems.

coterobarros Over a year ago

Yes, I agree with you Damo. This case of mine is an internal and well-controlled case, but parsing anonymous or external HTML with regex surely leads to the issues you comment. Thank you.

Collectives™ on Stack Overflow

Regex expression detect <code>...</code> code chunks

Solution proposal

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Solution proposal

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related