1

Let's say I have two Java Patterns, one for finding whitespace at the beginning of the line, and the other for finding non-whitespace at the beginning of the line:

Pattern ws  = Pattern.compile("^\\s+");
Pattern nws = Pattern.compile("^\\S+");
String text = "\tSome \n\t text \n that needs \t parsing.";

I want to loop through the text, separating it blocks of whitespace and blocks of non-whitespace, removing each token from the beginning of text:

while(text.length() > 0) {
    String nextToken = "";
    try {
        //TODO: detect grouping and move it to nextToken.
    } catch (Exception e) {
        //TODO: error handling
    }
    if(nextToken.length() > 0)
        _tokens.add(nextToken);
}

I don't just want to replace stuff. "\tSome \n\t text \n that needs \t parsing." should split to ["\t", "Some", "\n\t ", "text", ...]

How would you accomplish something like this?

3 Answers 3

2

You could use a Scanner and a single Pattern which matches either kind of token.

Pattern tokenPattern  = Pattern.compile("\\s+|\\S+");
String text = "\tSome \n\t text \n that needs \t parsing.";
List<String> tokens = new ArrayList<String>();
Scanner scanner = new Scanner(text);
while (true) {
    String token = scanner.findWithinHorizon(tokenPattern, 0);
    if (token == null) break;
    tokens.add(token);
}
System.out.println(tokens);
Sign up to request clarification or add additional context in comments.

Comments

1

This would remove all the spaces or non-space characters which was present at the start,

System.out.println(str.replaceAll("^(?:\\s+|\\S+)", ""));

3 Comments

I don't just want to replace stuff. "\tSome \n\t text \n that needs \t parsing." should split to ["\t", "Some", "\n\t ", "text", ...]
@AaronF In that case why there is "removing" in "I want to loop through the text, removing blocks of whitespace and blocks of non-whitespace"? Did you perhaps want to write "separating"?
Sorry for the confusion. I said "removing" because I want to successively remove the blocks of characters from text. "Hello world." becomes " world" becomes "world" becomes "", and [] becomes ["Hello"] becomes ["Hello", " "] becomes ["Hello", " ", "world."]
1

After your update it seems that your goal may be to separate whitespaces from non-whitespaces. In that case place on which you should split can be described by regex which will use look-around mechanisms. In other words regex should be matching places which have

  • non-whitespace before and whitespace after it
  • or whitespace before and non-whitespace character after it.

Such regex can look like "(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)" and you can use it in split method

String text = "\tSome \n\t text \n that needs \t parsing.";
for (String s:text.split("(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)"))
    System.out.println("'"+s+"'");

On the other hand you may wan't to also use alternation operator - OR which is represented by | and find method from Matcher to iterate over text and find matching substrings.

String text = "\tSome \n\t text \n that needs \t parsing.";

Pattern p = Pattern.compile("\\s+|\\S+");
Matcher m = p.matcher(text);
while(m.find())
    System.out.println("'"+m.group()+"'");

In both cases output will be

'   '
'Some'
' 
     '
'text'
' 
 '
'that'
' '
'needs'
'    '
'parsing.'

(I surrounded results with ' to show that for instance firs result does in fact contain tabulator \t which is printed as ' ')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.