1

I'm trying to write a program that will return all the text between \begin{theorem} and \end{theorem} and between \begin{proof} and \end{proof}.

It seems natural to use regex's, but because there are a lot of potential metacharacters, they will need to be escaped.

Here's the code I have written:

import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LatexTheoremProofExtractor {

    // This is the LaTeX source that will be processed
    private String source = null;

    // These are the list of theorems and proofs that are extracted, respectively 
    private ArrayList<String> theorems = null;
    private ArrayList<String> proofs = null;

    // These are the patterns to match theorems and proofs, respectively 
    private static final Pattern THEOREM_REGEX = Pattern.compile("\\begin\\{theorem\\}(.+?)\\end\\{theorem\\}");
    private static final Pattern PROOF_REGEX = Pattern.compile("\\begin\\{proof\\}(.+?)\\end\\{proof\\}");

    LatexTheoremProofExtractor(String source) {
        this.source = source;
    }

    public void parse() {
        extractEntity("theorem");
        extractEntity("proof");
    }

    private void extractTheorems() {
        if(theorems != null) {
            return;
        }

        theorems = new ArrayList<String>();

        final Matcher matcher = THEOREM_REGEX.matcher(source);
        while (matcher.find()) {
            theorems.add(new String(matcher.group(1)));
        }   
    }

    private void extractProofs() {
        if(proofs != null) {
            return;
        }

        proofs = new ArrayList<String>();

        final Matcher matcher = PROOF_REGEX.matcher(source);
        while (matcher.find()) {
            proofs.add(new String(matcher.group(1)));
        }       
    }

    private void extractEntity(final String entity) {   
        if(entity.equals("theorem")) {
            extractTheorems();
        } else if(entity.equals("proof")) {
            extractProofs();
        } else {
            // TODO: Throw an exception or something
        }       
    }

    public ArrayList<String> getTheorems() {
        return theorems;
    }

}

and below is my test that fails

@Test 
public void testTheoremExtractor() {
    String source = "\\begin\\{theorem\\} Hello, World! \\end\\{theorem\\}";
    LatexTheoremProofExtractor extractor = new LatexTheoremProofExtractor(source);
    extractor.parse();
    ArrayList<String> theorems = extractor.getTheorems();
    assertEquals(theorems.get(0).trim(), "Hello, World!");
}

Clearly my test suggests I'd like there to only be one match in this test, and it should be "Hello, World!" (post trimming).

Currently theorems is an empty, non-null array. Thus my Matchers aren't matching the pattern. Can anyone help me understand why?

Thanks, erip

5
  • Use \\\\ to match one \. Commented Oct 28, 2015 at 21:45
  • @stribizhev I made the change but it gives me the same result - size is 0. Commented Oct 28, 2015 at 21:49
  • if the text is constant as you said \begin{theorem}, etc.. I don't think you need regex for that, why don't you just split the data according to those delimiters? or by using indexof() Commented Oct 28, 2015 at 21:52
  • @LiranBo Pretty sure String#split uses Pattern/Matcher under the hood. Commented Oct 28, 2015 at 21:58
  • Have a look at this code, it works. Look at the 2 updated regexes in your extractor. Commented Oct 28, 2015 at 22:17

3 Answers 3

1

Here is the update you need to make to your code - the 2 regexes in the extractor method should be changed to

private static final Pattern THEOREM_REGEX = Pattern.compile(Pattern.quote("\\begin\\{theorem\\}") + "(.+?)" + Pattern.quote("\\end\\{theorem\\}"));
private static final Pattern PROOF_REGEX = Pattern.compile(Pattern.quote("\\begin\\{proof\\}") + "(.+?)" + Pattern.quote("\\end\\{proof\\}"));

The result will be "Hello, World!". See IDEONE demo.

The string you have is actually \begin\{theorem\} Hello, World! \end\{theorem\}. The literal backslashes in Java strings are doubled and when you need to match a literal backslash in Java with a regex, you need to use \\\\. To avoid the backslash hell, Pattern.quote can be of help that will tell the regex to treat all the subpattern inside it as a literal.

More details about Pattern.quote can be found in the documentation:

Returns a literal pattern String for the specified String.
This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.

Metacharacters or escape sequences in the input sequence will be given no special meaning.

Sign up to request clarification or add additional context in comments.

1 Comment

I added more fun links and explanations. Actually, you can use \\Q and \\E instead, but it is important to apply that to both the regexps.
0

Your first regex needs to be:

Pattern THEOREM_REGEX = Pattern.compile("\\\\begin\\\\\\{theorem\\\\\\}(.+?)\\\\end\\\\\\{theorem\\\\\\}");

as you're trying to match a backslash that requires \\\\ in your regex.

Comments

0

There seems to be an error in your test code that the other answers don't address. You create the test string like this:

String source = "\\begin\\{theorem\\} Hello, World! \\end\\{theorem\\}";

...but in the text you say the source string is supposed to be:

\begin{theorem} Hello, World! \end{theorem}

If that's true, the string literal should be:

"\\begin{theorem} Hello, World! \\end{theorem}"

To create the regex, you would use:

Pattern.quote("\\begin{theorem}") + "(.*?)" + Pattern.quote("\\end{theorem}")

...or escape it manually:

"\\\\begin\\{theorem\\}(.*?)\\\end\\{theorem\\}"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.