I'm trying to write a program that will return all the text between \begin{theorem} and \end{theorem} and between \begin{proof} and \end{proof}.
It seems natural to use regex's, but because there are a lot of potential metacharacters, they will need to be escaped.
Here's the code I have written:
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LatexTheoremProofExtractor {
// This is the LaTeX source that will be processed
private String source = null;
// These are the list of theorems and proofs that are extracted, respectively
private ArrayList<String> theorems = null;
private ArrayList<String> proofs = null;
// These are the patterns to match theorems and proofs, respectively
private static final Pattern THEOREM_REGEX = Pattern.compile("\\begin\\{theorem\\}(.+?)\\end\\{theorem\\}");
private static final Pattern PROOF_REGEX = Pattern.compile("\\begin\\{proof\\}(.+?)\\end\\{proof\\}");
LatexTheoremProofExtractor(String source) {
this.source = source;
}
public void parse() {
extractEntity("theorem");
extractEntity("proof");
}
private void extractTheorems() {
if(theorems != null) {
return;
}
theorems = new ArrayList<String>();
final Matcher matcher = THEOREM_REGEX.matcher(source);
while (matcher.find()) {
theorems.add(new String(matcher.group(1)));
}
}
private void extractProofs() {
if(proofs != null) {
return;
}
proofs = new ArrayList<String>();
final Matcher matcher = PROOF_REGEX.matcher(source);
while (matcher.find()) {
proofs.add(new String(matcher.group(1)));
}
}
private void extractEntity(final String entity) {
if(entity.equals("theorem")) {
extractTheorems();
} else if(entity.equals("proof")) {
extractProofs();
} else {
// TODO: Throw an exception or something
}
}
public ArrayList<String> getTheorems() {
return theorems;
}
}
and below is my test that fails
@Test
public void testTheoremExtractor() {
String source = "\\begin\\{theorem\\} Hello, World! \\end\\{theorem\\}";
LatexTheoremProofExtractor extractor = new LatexTheoremProofExtractor(source);
extractor.parse();
ArrayList<String> theorems = extractor.getTheorems();
assertEquals(theorems.get(0).trim(), "Hello, World!");
}
Clearly my test suggests I'd like there to only be one match in this test, and it should be "Hello, World!" (post trimming).
Currently theorems is an empty, non-null array. Thus my Matchers aren't matching the pattern. Can anyone help me understand why?
Thanks, erip
\\\\to match one\.\begin{theorem}, etc.. I don't think you need regex for that, why don't you just split the data according to those delimiters? or by usingindexof()String#splitusesPattern/Matcherunder the hood.