0

Let's say I have a string with an xml many occurences of <tagA>:

String example = " (...) some xml here (...)
                    <tagA>283940</tagA>
                   (...) some xml here (...)
                    <tagA>& 9940</tagA>
                    <tagA>- 99440</tagA>
                    <tagA>< 99440</tagA>
                    <tagA>99440</tagA>
                   (...) more xml here (...) "

The content should contain only digits, but sometimes it has a random character followed by a whitespace and the the digits. I want to remove the unwanted character and the whitespace. How to do that?

So far I know I should be looking for a regex "<tagA>. [0-9]*<\/tagA>" but I am stuck here.

I want to replace the characters because among those characters there are "&", ">", "<" signs which make the xml invalid (which prevents me from treating this as an XML).

6
  • 2
    Do not use regular expressions to parse XML. Commented Jun 14, 2017 at 15:43
  • 1
    Use XPath and the starts-with function in the predicate. Commented Jun 14, 2017 at 15:43
  • 1
    I cannot parse XML because it is not valid. (The ampersand character makes the xml invalid) Commented Jun 14, 2017 at 15:46
  • You can replace all & occurrences with something else before parsing it. Or URL-Encode the file. Commented Jun 14, 2017 at 15:49
  • @AlexRoig there are & occurences in other places of the string (inside CDATA) so this has to be in the tagA tag Commented Jun 14, 2017 at 15:50

1 Answer 1

2

The regex that you're looking for is: <(\w+)>(\D{0,})(\d+)

On the search Group 1 you'll get the TAG, on the Group 2 you'll get your weird stuff (everything that is not a digit) and in Group 3 there's the number.

There's an "enhanced version" of this regex that might work in more situations: (\w{0,})(<\w+>)(\D{0,})(\d+)(\D{0,})(<\/\w+>)(\w{0,})

This will place in the Group 1 any whitespace that might be before the tag. Group 7 will take care of the trailing whitespaces. Group 2 and 6 will match the opening tag and closing tag. Group 3 and 5 will match any weird character that you might have between your value. Group 4 will contain your value.

With the String::replaceAll, you can filter and sanitize by printing only the group 2, 4 and 6, getting rid of the rest.

//input data
String s = "<tagA>283940</tagA>\n" +
"                    <tagA>& 9940<</tagA>\n" +
"                    <tagA>- 99440</tagA>\n" +
"                    <tagA>< 99440</tagA>\n" +
"                    <tagA>99440</tagA>"
                + "<13243> asdfasdf </>";


    String replaced = s.replaceAll("(\\s{0,})(<\\w+>)(\\D{0,})(\\d+)(\\D{0,})(<\\/\\w+>)(\\s{0,})", "$2$4$6");
    System.out.println(replaced);

Output: <tagA>283940</tagA><tagA>9940</tagA><tagA>99440</tagA><tagA>99440</tagA><tagA>99440</tagA><13243> asdfasdf </>

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.