0

I am parsing through some XML and sanitizing some fields.

I'm trying to do the following in Java:

nameField = nameField.replaceAll("[^a-zA-Z\\d\\s\\.,'&]", "");

I do not want to replace any letters of the alphabet, any number, any whitespace, any period, any comma, any single quote or (this is where my issue is) the literal string &.

But I do want to replace occurrences of a single & or a single ;

But obviously my Regex as it sits won't work. It'll leave in all & and all ;.

For example, say the string of K&W@#9$9(AR;.0 O& is found, my expected result would be: KW99AR.0 O&.

How can I achieve this?

2
  • 3
    & inside character class will match &, a, m, p or ;. Commented Sep 9, 2016 at 13:19
  • Correct, which is what I'm not looking to do. I attempted to note that the Regex won't work as it sits for that reason, that I want to match a string and not all the individual parts of the string Commented Sep 9, 2016 at 13:42

2 Answers 2

2

Why don't you simplify your regular expression and just go with a lookahead/lookbehind:

//                  |"&" not followed by "amp;"
//                  |          | or
//                  |          | ";" not preceded by "&amp"
nameField.replaceAll("&(?!amp;)|(?<!&amp);", "");

The output for "K&W@#9$9(AR;.0 O&amp;" would be:

KW@#9$9(AR.0 O&amp;

Edit

Then, you can chain this with a cleanup, leaving your desired characters only. Here, I added the ; and & to the exclude list, since they're already cleaned up when "standalone" by the previous operation.

Also, you don't need to escape the dot in a custom character class.

.replaceAll("[^a-zA-Z\\d\\s.,;&]", "");

The two chained invocations will return:

KW99AR.0 O&amp;

Notes

  • As mentioned by Tushar, sequences of characters in a custom character class are not considered as sequences but alternate individual characters.
  • General rule of thumb: careful about using regex to parse markup. You may very well end up with a bigger mess. Regular expressions are not made to parse markup or languages with a grammar.
  • Your specific case is safe enough, but remember there are other XML entities such as &gt;, &lt; etc.
Sign up to request clarification or add additional context in comments.

8 Comments

I'm assuming you're suggesting that I use your solution to replace all occurrences of & and ; that are not part of the &amp; string. And then to do another regex such as [^a-zA-Z\d\s\.,'&;] that will replace all of the remaining characters I don't want (such as @ and #) to result in the desired string of KW99AR.0 O&amp;?
@Jsmith yes I haven't read your question well enough. There's likely a way to inline all this in one regex. Give me a few minutes.
@Jsmith actually, which are the characters you want replaced, aside from single & and ;? You still have a dot in your expected output, for instance.
anything that isn't part of a-zA-Z\d\s\.,' or the string of &amp; . Meaning I want to keep those and I'm not sure what other junk characters may be passing through. I've had junk such as (not sure how the browser will interpret that character) show up.
The "junk" characters likely denote some encoding issue. I'd go with a chained invocation so you first remove the ones that need context, then clean up with the rest. Let me update the answer...
|
1

I think this should do it:

nameField = nameField.replaceAll("[^\\w&\\.\\s';,]","")
           .replaceAll("&amp;","%")
           .replaceAll("[&;]","")
           .replaceAll("%","&amp;");

2 Comments

this only works if you're guaranteed never to have a % in the source. (And your replaceAll("&;","") is wrong anyway, and should be replaceAll("[&;]",""))
Thanks, corrected [&;]. Any % are already removed by the first replaceAll, so yes, I can guarantee there is no % after that .

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.