1

I am working with XML on an android app that sometimes leaves sentences bumped up against each other.

Like: First sentence.Another sentence

I know I need to use [a-z] (lowercase letters), [A-Z] (uppercase letters), and all digits ([0-9]?) to search before and after the period, and then add a space after the period.

Maybe something like:

myString = myString.replaceAll("(\\p{Ll})(\\p{Lu})", "$1 $2");

My searches and efforts have been useless so far, so any and all help is welcomed. Thanks

6
  • 1
    Couldn't you come up with a better title than I can not find this regex? Commented Feb 24, 2014 at 7:29
  • 1
    Your title sounds like you've lost your regex, and you need help finding it. Commented Feb 24, 2014 at 7:31
  • Never parse XML with regex.XML is not a regular language.Use well known XML parsers instead.See this question : stackoverflow.com/questions/8577060/… Commented Feb 24, 2014 at 7:33
  • at the time of me making edits to XML, it is already a well formatted string Commented Feb 24, 2014 at 7:34
  • At what point are these sentences stuck together without a space? Does the XML itself have sentences joined improperly, with no spaces or tags between them? Commented Feb 24, 2014 at 7:35

1 Answer 1

3

You were almost there, you just forgot to match the dot:

myString = myString.replaceAll("(\\p{Ll})\\.(\\p{Lu})", "$1. $2");

And since you're not actually doing anything with the letter before and after the dot, you can speed things up a bit by using lookaround assertions:

myString = myString.replaceAll("(?<=\\p{Ll})\\.(?=\\p{Lu})", ". ");
Sign up to request clarification or add additional context in comments.

6 Comments

Of course, now we're putting extra spaces into acronyms written with periods. We could try to tell whether we're looking at an acronym, but then we run into more edge cases. Natural language correction is messy.
yes, but this is still missing the fact that it could be a number, lowercase letter, or uppercase letter before and after the period.
I know this is a messy thing to edit... but there will be very very few of these cases I think
If you also want to replace dots after uppercase letters and digits, just use [\\p{L}\\d] instead of \\p{Ll}, but then you'd also replace C.I.A. with C. I. A..
@TimPietzcker: Didn't see that the lookarounds were specifically lowercase and uppercase. It means we're missing different weird edge cases, but C.I.A. is currently fine.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.