Parsing string with multiple regex's

Question

Let's say I have two Java Patterns, one for finding whitespace at the beginning of the line, and the other for finding non-whitespace at the beginning of the line:

Pattern ws  = Pattern.compile("^\\s+");
Pattern nws = Pattern.compile("^\\S+");
String text = "\tSome \n\t text \n that needs \t parsing.";

I want to loop through the text, separating it blocks of whitespace and blocks of non-whitespace, removing each token from the beginning of text:

while(text.length() > 0) {
    String nextToken = "";
    try {
        //TODO: detect grouping and move it to nextToken.
    } catch (Exception e) {
        //TODO: error handling
    }
    if(nextToken.length() > 0)
        _tokens.add(nextToken);
}

I don't just want to replace stuff. "\tSome \n\t text \n that needs \t parsing." should split to ["\t", "Some", "\n\t ", "text", ...]

How would you accomplish something like this?

Robert Tupelo-Schneck · Accepted Answer · 2014-09-04 18:00:16Z

2

You could use a Scanner and a single Pattern which matches either kind of token.

Pattern tokenPattern  = Pattern.compile("\\s+|\\S+");
String text = "\tSome \n\t text \n that needs \t parsing.";
List<String> tokens = new ArrayList<String>();
Scanner scanner = new Scanner(text);
while (true) {
    String token = scanner.findWithinHorizon(tokenPattern, 0);
    if (token == null) break;
    tokens.add(token);
}
System.out.println(tokens);

answered Sep 4, 2014 at 18:00

Robert Tupelo-Schneck

10.6k4 gold badges52 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Avinash Raj · Accepted Answer · 2014-09-04 17:45:13Z

1

This would remove all the spaces or non-space characters which was present at the start,

System.out.println(str.replaceAll("^(?:\\s+|\\S+)", ""));

answered Sep 4, 2014 at 17:45

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

3 Comments

AaronF Over a year ago

I don't just want to replace stuff. "\tSome \n\t text \n that needs \t parsing." should split to ["\t", "Some", "\n\t ", "text", ...]

Pshemo Over a year ago

@AaronF In that case why there is "removing" in "I want to loop through the text, removing blocks of whitespace and blocks of non-whitespace"? Did you perhaps want to write "separating"?

AaronF Over a year ago

Sorry for the confusion. I said "removing" because I want to successively remove the blocks of characters from text. "Hello world." becomes " world" becomes "world" becomes "", and [] becomes ["Hello"] becomes ["Hello", " "] becomes ["Hello", " ", "world."]

Pshemo · Accepted Answer · 2014-09-04 18:02:28Z

After your update it seems that your goal may be to separate whitespaces from non-whitespaces. In that case place on which you should split can be described by regex which will use look-around mechanisms. In other words regex should be matching places which have

non-whitespace before and whitespace after it
or whitespace before and non-whitespace character after it.

Such regex can look like "(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)" and you can use it in split method

String text = "\tSome \n\t text \n that needs \t parsing.";
for (String s:text.split("(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)"))
    System.out.println("'"+s+"'");

On the other hand you may wan't to also use alternation operator - OR which is represented by | and find method from Matcher to iterate over text and find matching substrings.

String text = "\tSome \n\t text \n that needs \t parsing.";

Pattern p = Pattern.compile("\\s+|\\S+");
Matcher m = p.matcher(text);
while(m.find())
    System.out.println("'"+m.group()+"'");

In both cases output will be

'   '
'Some'
' 
     '
'text'
' 
 '
'that'
' '
'needs'
'    '
'parsing.'

(I surrounded results with ' to show that for instance firs result does in fact contain tabulator \t which is printed as ' ')

Collectives™ on Stack Overflow

Parsing string with multiple regex's

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related