Java replaceAll do not replace string

Question

I am parsing through some XML and sanitizing some fields.

I'm trying to do the following in Java:

nameField = nameField.replaceAll("[^a-zA-Z\\d\\s\\.,'&amp;]", "");

I do not want to replace any letters of the alphabet, any number, any whitespace, any period, any comma, any single quote or (this is where my issue is) the literal string &.

But I do want to replace occurrences of a single & or a single ;

But obviously my Regex as it sits won't work. It'll leave in all & and all ;.

For example, say the string of K&W@#9$9(AR;.0 O& is found, my expected result would be: KW99AR.0 O&.

How can I achieve this?

& inside character class will match &, a, m, p or ;. — Tushar
– Tushar, Commented Sep 9, 2016 at 13:19
Correct, which is what I'm not looking to do. I attempted to note that the Regex won't work as it sits for that reason, that I want to match a string and not all the individual parts of the string — Jsmith
– Jsmith, Commented Sep 9, 2016 at 13:42

Community · Accepted Answer · 2017-05-23 10:29:10Z

2

Why don't you simplify your regular expression and just go with a lookahead/lookbehind:

//                  |"&" not followed by "amp;"
//                  |          | or
//                  |          | ";" not preceded by "&amp"
nameField.replaceAll("&(?!amp;)|(?<!&amp);", "");

The output for "K&W@#9$9(AR;.0 O&" would be:

KW@#9$9(AR.0 O&amp;

Edit

Then, you can chain this with a cleanup, leaving your desired characters only. Here, I added the ; and & to the exclude list, since they're already cleaned up when "standalone" by the previous operation.

Also, you don't need to escape the dot in a custom character class.

.replaceAll("[^a-zA-Z\\d\\s.,;&]", "");

The two chained invocations will return:

KW99AR.0 O&amp;

Notes

As mentioned by Tushar, sequences of characters in a custom character class are not considered as sequences but alternate individual characters.
General rule of thumb: careful about using regex to parse markup. You may very well end up with a bigger mess. Regular expressions are not made to parse markup or languages with a grammar.
Your specific case is safe enough, but remember there are other XML entities such as >, < etc.

edited May 23, 2017 at 10:29

CommunityBot

11 silver badge

answered Sep 9, 2016 at 13:20

Mena

48.6k11 gold badges90 silver badges109 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Jsmith Over a year ago

I'm assuming you're suggesting that I use your solution to replace all occurrences of & and ; that are not part of the & string. And then to do another regex such as [^a-zA-Z\d\s\.,'&;] that will replace all of the remaining characters I don't want (such as @ and #) to result in the desired string of KW99AR.0 O&?

Mena Over a year ago

@Jsmith yes I haven't read your question well enough. There's likely a way to inline all this in one regex. Give me a few minutes.

Mena Over a year ago

@Jsmith actually, which are the characters you want replaced, aside from single & and ;? You still have a dot in your expected output, for instance.

Jsmith Over a year ago

anything that isn't part of a-zA-Z\d\s\.,' or the string of & . Meaning I want to keep those and I'm not sure what other junk characters may be passing through. I've had junk such as � (not sure how the browser will interpret that character) show up.

Mena Over a year ago

The "junk" characters likely denote some encoding issue. I'd go with a chained invocation so you first remove the ones that need context, then clean up with the rest. Let me update the answer...

|

Oleksiy Grechnyev · Accepted Answer · 2016-09-09 13:52:40Z

1

I think this should do it:

nameField = nameField.replaceAll("[^\\w&\\.\\s';,]","")
           .replaceAll("&amp;","%")
           .replaceAll("[&;]","")
           .replaceAll("%","&amp;");

edited Sep 9, 2016 at 13:52

answered Sep 9, 2016 at 13:30

Oleksiy Grechnyev

564 bronze badges

2 Comments

J Earls Over a year ago

this only works if you're guaranteed never to have a % in the source. (And your replaceAll("&;","") is wrong anyway, and should be replaceAll("[&;]",""))

Oleksiy Grechnyev Over a year ago

Thanks, corrected [&;]. Any % are already removed by the first replaceAll, so yes, I can guarantee there is no % after that .

Collectives™ on Stack Overflow

Java replaceAll do not replace string

2 Answers 2

8 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related