4

I am trying to replace all instances of sentence terminators such as '.', '?', and '!', but I do not want to replace strings like "dr." and "mr.".

I have tried the following:

text = text.replaceAll("(?![mr|mrs|ms|dr])(\\s*[\\.\\?\\!]\\s*)", "\n");

...but that does not seem to work. Any suggestions would be appreciated.


Edit: After the feedback here and a bit of tweeking this is the working solution to my problem.

private String convertText(String text) {
  text = text.replaceAll("\\s+", " ");
  text = text.replaceAll("[\n\r\\(\\)\"\\,\\:]", "");
  text = text.replaceAll("(?i)(?<!dr|mr|mrs|ms|jr|sr|\\s\\w)(\\s*[\\.\\?\\!\\;](?:\\s+|$))","\r\n");
  return text.trim();
}

The code will extract all* compound and single sentences from an excerpt of text, removing all punctuation and extraneous white-space.
*There are some exceptions...

3
  • Try removing the brackets, [], from around the list of exceptions: (?!mr|mrs|ms|dr). They stand for "character set", not "full strings" as you're using them. Don't know if it will entirely solve your problem, but it's worth a start Commented Dec 6, 2012 at 5:20
  • There's several problems with trying to do that though. How are you going to handle sequences like J. H. Ronaldo says that the train is running on time.... Is he right?. Commented Dec 6, 2012 at 5:25
  • @Anthill, I have added support for ignoring single characters that precede a period. Is this the correct way of is there an even easier method? Commented Dec 7, 2012 at 1:24

2 Answers 2

2

You need to use negative lookbehind instead of negative lookahead like this

String x = "dr. house.";
System.out.println(x.replaceAll("(?<!mr|mrs|ms|dr)(\\s*[\\.\\?\\!]\\s*)","\n"));

Also the list of mr/dr/ms/mrs should not be inside character classes.

Sign up to request clarification or add additional context in comments.

1 Comment

I was so close, I remember negative look-behinds vaguely. Thanks.
-1

You're going to need to have a complete list of the letter combinations which are allowed to precede .. Then, you can replace dr. and mr. (and any other allowed combos) with something unique like dr28dsj458sj and mr28dsj458sj. Ideally you should check that your temp substitute value exists nowhere else in the document. Then go through and remove all your sentence terminators, then go through again and replace the occurrences of 28dsj458sj with . again.

1 Comment

I like this hacky workaround, but it may slow down performance. The negative look-behind was what I was going for. Thanks for your time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.