2

I have a text file which contains lines, and some of them are in the following format:

  • 3 tabs,
  • after if few words and line break at the end.
  • I need to catch the words in these lines, one by one (with the index of each word in the text).

I thought about a solution using 2 regex patterns and 2 loops (added the code below), but I would like to know if there is a better solution using only one regex pattern.

Here is an example for lines from the text:

            Hello I am studying regex!
            This is a line in the text.
                Don't need to add this line
        nor this line.
            But this line should be included.
Map<String, Integer> wordsMap = New HashMap<>();

Pattern p = Pattern.compile("\\t{3}(.*)\\n");
Matcher m = p.matcher(text);

Pattern p2 = Pattern.compile("(\S+)");
Matcher m2 = p.matcher(");

while(m.find()) {
    m2.reset(m.group(1));
    while(m2.find()) {
        wordsMap.add(m2.group(1), m.start(1) + m2.start(1));
    }
}
3
  • How about a solution without regex? 0. Check include-criteria 1. Trim, 2. Split by whitespace, 3. rule out empty/whitespace-only entries. Commented Jun 14, 2019 at 7:06
  • Hey @Flidor and thanks for the fast reply. I need to use regex here for this exercise, and I also need the index of each word in the original text. Commented Jun 14, 2019 at 7:08
  • You should add that using regex is mandatory to the question. Commented Jun 14, 2019 at 7:10

1 Answer 1

1

You may use

(?:\G(?!^)\h+|^\t{3})(\S+)

See the regex demo. Compile the pattern with Pattern.MULTILINE flag.

Get Group 1 data.

Details

  • (?:\G(?!^)\h+|^\t{3}) - either the end of the previous match but not at the start of a line followed with 1+ horizontal whitespace chars or three tabs at the start of a line
  • (\S+) - Group 1: any 1+ non-whitespace chars.

Java demo:

String s = "\t\t\tHello I am studying regex!\n\t\t\tThis is a line in the text.\n\t\t\t\tDon't need to add this line\n\t\tnor this line.\n\t\t\tBut this line should be included.";
Pattern pattern = Pattern.compile("(?:\\G(?!^)\\h+|^\t{3})(\\S+)", Pattern.MULTILINE);
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    System.out.println("Match: '" + matcher.group(1) + "', Start: " + matcher.start(1)); 
} 

Output:

Match: 'Hello', Start: 3
Match: 'I', Start: 9
Match: 'am', Start: 11
Match: 'studying', Start: 14
Match: 'regex!', Start: 23
Match: 'This', Start: 33
Match: 'is', Start: 38
Match: 'a', Start: 41
Match: 'line', Start: 43
Match: 'in', Start: 48
Match: 'the', Start: 51
Match: 'text.', Start: 55
Match: 'But', Start: 113
Match: 'this', Start: 117
Match: 'line', Start: 122
Match: 'should', Start: 127
Match: 'be', Start: 134
Match: 'included.', Start: 137
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.