0

I'm having a difficulty splitting a string without removing whitespaces but removing all other non-characters. I have this school task to read in with BufferedReader and the text consists of lots of characters which even eclipse couldn't show. The elements i read in are in form of element1;element 2; element 3 (Element 4; Element 5 $Element 6 etc.. and one of the delimeters to remove should be ";".

I've tried .split(//W) but this removed all the whitespaces and some elements stayed completely empty although it removed characters well.

Right now i've used .split("[;(),$]") but this does not work properly since there are still characters which i can't recognize..

2 Answers 2

1

Instead of trying to split on the all the characters you don't want, you could include all the characters you do want. e.g.

String[] words = s.split("[^ a-zA-Z0-9]+");

Note: the ^ means anything but these characters.

BTW: none of the characters are non-characters.

Sign up to request clarification or add additional context in comments.

4 Comments

+1 for simple regex. Anyway it could be good idea to exclude all whitespaces from split to prevent potential splitting of element\n10.
This option, aswell as the one below leaves me empty elements into arrays, which i could fix by creating a method to do that (did that before aswell tho, but in the way of creating my code, i deleted it and thought it's not longer neccessery). Or is there an another way to avoid empty elements?
@PshemoI would add the whitespace you expect because there is quite a few and developers don't always think about it.
@charen Added a + to skip empty elements. This won't drop a leading empty element.
0

If you claim that \\W worked fine for you but only problem was that it also split on whitespace then you can use intersection of \\W and \\S which will remove all whitespaces from \\W.

Use split("[\\W&&\\S]+")

Also to remove whitespaces surrounding results like _eleement 3 (where _ represents whitespace) you can surround regex with \\s*. To add support for Unicode in predefined character class just add (?U) flag to regex.

Demo:

String data = "element1;element 2; element 3 (Element 4; Element 5 $Element 6 ";
for (String s:data.split("(?U)\\s*[\\W&&\\S]+\\s*")){
    System.out.println(s);
}

Output:

element1
element 2
element 3
Element 4
Element 5
Element 6 

7 Comments

This seems to work fine, but now it seems like //W took away also non-ascii characters (in which my language uses) so if i read in from text file "ä", "ö", "ü" or "ö" it will split from them aswell. Any idea what to add so it would skip these aswell?
Hmm, it still splits from so called non-ascii character. vĆ Exception in thread "main" java.lang.NumberFormatException: For input string: "ga kiire"
NumberFormatException regex does't throw NumberFormatException, it seems that you are trying to parse ga kiire with something like Integer.parseInt or something similar. For now I can only guess what is problem with your data/code. To make your question answerable please include example which could be used to reproduce your problem.
Nono, NumberFormatException is given because it does split from wrong place. Without this one split from the Estonian word "väga kiire" there would be another element in this position which is an integer. Right now the problem causing word is " vĆga kiire" i believe.
I understand that, but without seeing how exactly your data should be split I will not be able to help you. I provided answer which solves your problem as it is written now. As I said earlier, you need to provide example which will let me reproduce your current problem. Post data you are trying to split, expected split result and how it is actually being split so I could see what could cause this behaviour.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.