6

I'm just learning how to use regex's:

I'm reading in a text file that is split into sections of two different sorts, demarcated by <:==]:> and <:==}:> . I need to know for each section whether it's a ] or } , so I can't just do

pattern.compile("<:==]:>|<:==}:>"); pattern.split(text)

Doing this:

pattern.compile("<:=="); pattern.split(text)

works, and then I can just look at the first char in each substring, but this seems sloppy to me, and I think I'm only resorting to it because I'm not fully grasping something I need to grasp about regex's:

What would be the best practice here? Also, is there any way to split a string up while leaving the delimiter in the resulting strings- such that each begins with the delimiter?

EDIT: the file is laid out like this:

Old McDonald had a farm 
<:==}:> 
EIEIO. And on that farm he had a cow 
<:==]:> 
And on that farm he....
3
  • My initial solution (enclosing the delimiter in a capturing group) appears not to work in Java (other languages like Python would have worked), so I need to rethink this. Could you provide a small sample file? I'm not quite sure I understand how exactly the sections are delimited. Are they surrounded by pairs of delimiters, or does a section start after one delimiter and end with the next delimiter? Commented Nov 22, 2013 at 11:33
  • @TimPietzcker Yeah I had the same realization. See my edit for an example of how the file's laid out. They are not pairs of delimeters, the end of each is signaled by the start of the next. Also, I should note that <:?:> signify several other types of tags Commented Nov 22, 2013 at 11:38
  • So what exactly do you want as output? The section of text along with either a ] or }? If so then what do you want for the first/last section that is not delimited? Do you need the section of text or is it enough to just have the delimiters? Commented Nov 22, 2013 at 11:52

1 Answer 1

6

It may be a better idea not to use split() for this. You could instead do a match:

List<String> delimList = new ArrayList<String>();
List<String> sectionList = new ArrayList<String>();
Pattern regex = Pattern.compile(
    "(<:==[\\]}]:>)     # Match a delimiter, capture it in group 1.\n" +
    "(                  # Match and capture in group 2:\n" +
    " (?:               # the following group which matches...\n" +
    "  (?!<:==[\\]}]:>) # (unless we're at the start of another delimiter)\n" +
    "  .                # any character\n" +
    " )*                # any number of times.\n" +
    ")                  # End of group 2", 
    Pattern.COMMENTS | Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    delimList.add(regexMatcher.group(1));
    sectionList.add(regexMatcher.group(2));
} 
Sign up to request clarification or add additional context in comments.

2 Comments

Looks like you grokked this completely. I think the answer to all your questions is Yes. For details, check out this regular expressions tutorial by Jan Goyvaerts, especially the sections on capturing groups and lookaround assertions. As for your last question, can you be more specific? Perhaps in the form of another question since comments are not really well suited for this?
I like this example with the comments, but note that a static regex is usually compiled statically (once) and reused multiple times. Also see: stackoverflow.com/questions/4935216/… also see stackoverflow.com/questions/1360113/is-java-regex-thread-safe

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.