Java Regex won't match

Question

I'm trying to write a program that will return all the text between \begin{theorem} and \end{theorem} and between \begin{proof} and \end{proof}.

It seems natural to use regex's, but because there are a lot of potential metacharacters, they will need to be escaped.

Here's the code I have written:

import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LatexTheoremProofExtractor {

    // This is the LaTeX source that will be processed
    private String source = null;

    // These are the list of theorems and proofs that are extracted, respectively 
    private ArrayList<String> theorems = null;
    private ArrayList<String> proofs = null;

    // These are the patterns to match theorems and proofs, respectively 
    private static final Pattern THEOREM_REGEX = Pattern.compile("\\begin\\{theorem\\}(.+?)\\end\\{theorem\\}");
    private static final Pattern PROOF_REGEX = Pattern.compile("\\begin\\{proof\\}(.+?)\\end\\{proof\\}");

    LatexTheoremProofExtractor(String source) {
        this.source = source;
    }

    public void parse() {
        extractEntity("theorem");
        extractEntity("proof");
    }

    private void extractTheorems() {
        if(theorems != null) {
            return;
        }

        theorems = new ArrayList<String>();

        final Matcher matcher = THEOREM_REGEX.matcher(source);
        while (matcher.find()) {
            theorems.add(new String(matcher.group(1)));
        }   
    }

    private void extractProofs() {
        if(proofs != null) {
            return;
        }

        proofs = new ArrayList<String>();

        final Matcher matcher = PROOF_REGEX.matcher(source);
        while (matcher.find()) {
            proofs.add(new String(matcher.group(1)));
        }       
    }

    private void extractEntity(final String entity) {   
        if(entity.equals("theorem")) {
            extractTheorems();
        } else if(entity.equals("proof")) {
            extractProofs();
        } else {
            // TODO: Throw an exception or something
        }       
    }

    public ArrayList<String> getTheorems() {
        return theorems;
    }

}

and below is my test that fails

@Test 
public void testTheoremExtractor() {
    String source = "\\begin\\{theorem\\} Hello, World! \\end\\{theorem\\}";
    LatexTheoremProofExtractor extractor = new LatexTheoremProofExtractor(source);
    extractor.parse();
    ArrayList<String> theorems = extractor.getTheorems();
    assertEquals(theorems.get(0).trim(), "Hello, World!");
}

Clearly my test suggests I'd like there to only be one match in this test, and it should be "Hello, World!" (post trimming).

Currently theorems is an empty, non-null array. Thus my Matchers aren't matching the pattern. Can anyone help me understand why?

Thanks, erip

@stribizhev I made the change but it gives me the same result - size is 0. — erip
– erip, Commented Oct 28, 2015 at 21:49
if the text is constant as you said \begin{theorem}, etc.. I don't think you need regex for that, why don't you just split the data according to those delimiters? or by using indexof() — LiranBo
– LiranBo, Commented Oct 28, 2015 at 21:52
@LiranBo Pretty sure String#split uses Pattern/Matcher under the hood. — erip
– erip, Commented Oct 28, 2015 at 21:58
Have a look at this code, it works. Look at the 2 updated regexes in your extractor. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 28, 2015 at 22:17

Wiktor Stribiżew · Accepted Answer · 2015-10-28 22:26:57Z

1

Here is the update you need to make to your code - the 2 regexes in the extractor method should be changed to

private static final Pattern THEOREM_REGEX = Pattern.compile(Pattern.quote("\\begin\\{theorem\\}") + "(.+?)" + Pattern.quote("\\end\\{theorem\\}"));
private static final Pattern PROOF_REGEX = Pattern.compile(Pattern.quote("\\begin\\{proof\\}") + "(.+?)" + Pattern.quote("\\end\\{proof\\}"));

The result will be "Hello, World!". See IDEONE demo.

The string you have is actually \begin\{theorem\} Hello, World! \end\{theorem\}. The literal backslashes in Java strings are doubled and when you need to match a literal backslash in Java with a regex, you need to use \\\\. To avoid the backslash hell, Pattern.quote can be of help that will tell the regex to treat all the subpattern inside it as a literal.

More details about Pattern.quote can be found in the documentation:

Returns a literal pattern String for the specified String.
This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.

Metacharacters or escape sequences in the input sequence will be given no special meaning.

edited Oct 28, 2015 at 22:26

answered Oct 28, 2015 at 22:23

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Wiktor Stribiżew Over a year ago

I added more fun links and explanations. Actually, you can use \\Q and \\E instead, but it is important to apply that to both the regexps.

anubhava · Accepted Answer · 2015-10-28 21:52:45Z

0

Your first regex needs to be:

Pattern THEOREM_REGEX = Pattern.compile("\\\\begin\\\\\\{theorem\\\\\\}(.+?)\\\\end\\\\\\{theorem\\\\\\}");

as you're trying to match a backslash that requires \\\\ in your regex.

answered Oct 28, 2015 at 21:52

anubhava

790k67 gold badges603 silver badges671 bronze badges

Comments

Alan Moore · Accepted Answer · 2015-10-29 01:18:07Z

0

There seems to be an error in your test code that the other answers don't address. You create the test string like this:

String source = "\\begin\\{theorem\\} Hello, World! \\end\\{theorem\\}";

...but in the text you say the source string is supposed to be:

\begin{theorem} Hello, World! \end{theorem}

If that's true, the string literal should be:

"\\begin{theorem} Hello, World! \\end{theorem}"

To create the regex, you would use:

Pattern.quote("\\begin{theorem}") + "(.*?)" + Pattern.quote("\\end{theorem}")

...or escape it manually:

"\\\\begin\\{theorem\\}(.*?)\\\end\\{theorem\\}"

answered Oct 29, 2015 at 1:18

Alan Moore

75.6k13 gold badges109 silver badges161 bronze badges

Collectives™ on Stack Overflow

Java Regex won't match

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related