Java replace all occurences of regex with another regex

Question

Let's say I have a string with an xml many occurences of <tagA>:

String example = " (...) some xml here (...)
                    <tagA>283940</tagA>
                   (...) some xml here (...)
                    <tagA>& 9940</tagA>
                    <tagA>- 99440</tagA>
                    <tagA>< 99440</tagA>
                    <tagA>99440</tagA>
                   (...) more xml here (...) "

The content should contain only digits, but sometimes it has a random character followed by a whitespace and the the digits. I want to remove the unwanted character and the whitespace. How to do that?

So far I know I should be looking for a regex "<tagA>. [0-9]*<\/tagA>" but I am stuck here.

I want to replace the characters because among those characters there are "&", ">", "<" signs which make the xml invalid (which prevents me from treating this as an XML).

I cannot parse XML because it is not valid. (The ampersand character makes the xml invalid) — Simon
– Simon, Commented Jun 14, 2017 at 15:46
You can replace all & occurrences with something else before parsing it. Or URL-Encode the file. — Alex Roig
– Alex Roig, Commented Jun 14, 2017 at 15:49
@AlexRoig there are & occurences in other places of the string (inside CDATA) so this has to be in the tagA tag — Simon
– Simon, Commented Jun 14, 2017 at 15:50

Alex Roig · Accepted Answer · 2017-06-14 16:28:55Z

The regex that you're looking for is: <(\w+)>(\D{0,})(\d+)

On the search Group 1 you'll get the TAG, on the Group 2 you'll get your weird stuff (everything that is not a digit) and in Group 3 there's the number.

There's an "enhanced version" of this regex that might work in more situations: (\w{0,})(<\w+>)(\D{0,})(\d+)(\D{0,})(<\/\w+>)(\w{0,})

This will place in the Group 1 any whitespace that might be before the tag. Group 7 will take care of the trailing whitespaces. Group 2 and 6 will match the opening tag and closing tag. Group 3 and 5 will match any weird character that you might have between your value. Group 4 will contain your value.

With the String::replaceAll, you can filter and sanitize by printing only the group 2, 4 and 6, getting rid of the rest.

//input data
String s = "<tagA>283940</tagA>\n" +
"                    <tagA>& 9940<</tagA>\n" +
"                    <tagA>- 99440</tagA>\n" +
"                    <tagA>< 99440</tagA>\n" +
"                    <tagA>99440</tagA>"
                + "<13243> asdfasdf </>";


    String replaced = s.replaceAll("(\\s{0,})(<\\w+>)(\\D{0,})(\\d+)(\\D{0,})(<\\/\\w+>)(\\s{0,})", "$2$4$6");
    System.out.println(replaced);

Output: <tagA>283940</tagA><tagA>9940</tagA><tagA>99440</tagA><tagA>99440</tagA><tagA>99440</tagA><13243> asdfasdf </>

Collectives™ on Stack Overflow

Java replace all occurences of regex with another regex

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related