3

I have the following string:

String str = "Klaße, STRAßE, FUß";

Using of combined regex I want to replace German ß letter to ss or SS respectively. To perform this I have:

String replaceUml = str
        .replaceAll("ß", "ss")
        .replaceAll("A-Z|ss$", "SS")
        .replaceAll("^(?=^A-Z)(?=.*A-Z$)(?=.*ss).*$", "SS");

Expected result:

Klasse, STRASSE, FUSS

Actual result:

Klasse, STRAssE, FUSS

Where I'm wrong?

3
  • 1
    Not sure I understand what you think your expressions do. The first one replaces the ß with ss, the second one takes ss at the end of a word (or the string A-Z) and replaces it with SS (which is how FUSS happens to be right), but I cannot figure out what you think the third one is supposed to do... Can you clarify? Commented Aug 20, 2013 at 15:31
  • @Floris The third one is suppose to find string which starts with Uppercases and ends also with Uppercases and have ss inbetween. If all mentioned expressions are true then replace the ss to SS Commented Aug 20, 2013 at 15:51
  • If I understand what you are trying to do correctly then you are trying to replace with small ss if the string contains only lowercase characters and to upper SS if it is all uppercase, the third replace I still can't figure out what you are trying to achieve and you are probably confusing character class ranges with literal character matching. Commented Aug 20, 2013 at 15:54

4 Answers 4

4

First of all, if you're trying to match some character in the range A-Z, you need to put it in square brackets. This

.replaceAll("A-Z|ss$", "SS")

will look for the three characters A-Z in the source, which isn't what you want. Second, I think you're confused about what | means. If you say this:

.replaceAll("[A-Z]|ss$", "SS")

it will replace any upper-case letter at the end of the word with SS, because | means look for this or that.

A third problem with your approach is that the second and third replaceAll's will look for any ss that was in the original string, even if it didn't come from a ß. This may or may not be what you want.

Here's what I'd do:

String replaceUml = str
    .replaceAll("(?<=[A-Z])ß", "SS")
    .replaceAll("ß", "ss");

This will first replace all ß by SS if the character before the ß is an upper-case letter; then if there are any ß's left over, they get replaced by ss. Actually, this won't work if the character before ß is an umlaut like Ä, so you probably should change this to

String replaceUml = str
    .replaceAll("(?<=[A-ZÄÖÜ])ß", "SS")
    .replaceAll("ß", "ss");

(There may be a better way to specify an "upper-case Unicode letter"; I'll look for it.)

EDIT:

String replaceUml = str
    .replaceAll("(?<=\\p{Lu})ß", "SS")
    .replaceAll("ß", "ss");

A problem is that it won't work if ß is the second character in the text, and the first letter of the word is upper-cased but the rest of the word isn't. In that case you probably want lower-case "ss".

String replaceUml = str
    .replaceAll("(?<=\\b\\p{Lu})ß(?=\\P{Lu})", "ss")
    .replaceAll("(?<=\\p{Lu})ß", "SS")
    .replaceAll("ß", "ss");

Now the first one will replace ß by ss if it's preceded by an upper-case letter that is the first letter of the word but followed by a character that isn't an upper-case letter. \P{Lu} with an upper-case P will match any character other than an upper-case letter (it's the negative of \p{Lu} with a lower-case p). I also included \b to test for the first character of a word.

Sign up to request clarification or add additional context in comments.

2 Comments

+1 but I think you need a look ahead too, in case of Aßbc you want Assbc (don't know if German words exist with ß second letter, but from a pattern aspect, it's an edge case)
Yep, I was thinking about it at the same time you were. And I suspect that words like that do exist where the ß is followed by "t" or "p", but I don't remember for sure, and just to make things interesting the German government recently changed the rules for when you're supposed to use ß, so whatever I learned in high-school German is wrong now.
2
String replaceUml = str
    .replaceAll("(?<=\\p{Lu})ß", "SS")
    .replace("ß", "ss")

This uses regex with a preceding unicode upper case letter ("SÜß"), to have capital "SS".

The (?<= ... ) is a look-behind, a kind of context matching. You could also do

    .replaceAll("(\\p{Lu})ß", "$1SS")

as ß will not occure at the beginning.

Your main trouble was not using brackets [A-Z].

Comments

0

Breaking your regex into parts:

Regex 101 Demo

Regex

/ß/g

Description

ß Literal ß
g modifier: global. All matches (don't return on first match)

Visualization

Regular expression visualization


Regex 101 Demo

Regex

/([A-Z])ss$/g

Description

1st Capturing group ([A-Z]) 
    Char class [A-Z]  matches:
        A-Z A character range between Literal A and Literal Z
ss Literal ss
$ End of string
g modifier: global. All matches (don't return on first match)

Visualization

Regular expression visualization


Regex 101 Demo

Regex

/([A-Z]+)ss([A-Z]+)/g

Description

1st Capturing group ([A-Z]+) 
    Char class [A-Z] 1 to infinite times [greedy] matches:
        A-Z A character range between Literal A and Literal Z
ss Literal ss
2nd Capturing group ([A-Z]+) 
    Char class [A-Z] 1 to infinite times [greedy] matches:
        A-Z A character range between Literal A and Literal Z
g modifier: global. All matches (don't return on first match)

Visualization

Regular expression visualization


Specifically for you

String replaceUml = str
    .replaceAll("ß", "ss")
    .replaceAll("([A-Z])ss$", "$1SS")
    .replaceAll("([A-Z]+)ss([A-Z]+)", "$1SS$2");

5 Comments

Ummm, isn't that going to delete the character before the ss?
yeah if you click the links i used the capture groups just didn't in the code for the solution editting
you need to double-backslash \1 and \2
This java, not JavaScript
Yes, I didn't pick up on that earlier. (Also, it isn't Perl.) It should be $1 and $2. \\1 and \\2 don't work.
-1

Use String.replaceFirst() instead of String.replaceAll().

replaceAll("ß", "ss")

This will replace all the occurrences of "ß". Hence the output after this statement becomes something like this :

Klasse, STRAssE, FUss

Now replaceAll("A-Z|ss$", "SS") replaces the last occurrence of "ss" with "SS", hence your final result looks like this :

Klasse, STRAssE, FUSS

To get your expected result try this out :

String replaceUml = str.replaceFirst("ß", "ss").replaceAll("ß", "SS");

4 Comments

With your suggestion I have now following result: Klasse, STRAssE, FUss
try using replaceFirst(). It will help. :-)
following your answer we have result: Klasse, STRAssE, FUSS, but I want Klasse, STRASSE, FUSS
see the last part of my answer. I have pasted the code to get your expected result. Hope this will help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.