0

I want to parse the column data and extract the required column info based on regex. Below shared is link that I tried,

^\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\d+)\s*(\d+)

https://regex101.com/r/lwzfQA/1

From that above link, I want to parse onlu the three rows data but it is matching with other details like "Sat Jan 30 15:56:06.144 UTC". I see first two matches in the above link are not proper but last two looks fine. which regex that I can use to parse only column info.

5 Answers 5

2

In your example data, the columns are separated by more than 1 whitespace, but in your pattern you make those spaces optional using \s* Also note that \S matches a non whitespace character which is a broad match.

As you tagged Java, I would suggest making use of \h{2,} to match 2 or more horizontal whitespace chars as \s can also match a newline and might give unexpected results.

You could also add an anchor $ to assert the end of the string to prevent partial matches.

^\h{2,}(\S+)\h{2,}(\S+)\h{2,}(\S+)\h{2,}(\S+)\h{2,}(\d+)\h{2,}(\d+)$

Regex demo

In Java with the doubled backslashes

String regex = "^\\h{2,}(\\S+)\\h{2,}(\\S+)\\h{2,}(\\S+)\\h{2,}(\\S+)\\h{2,}(\\d+)\\h{2,}(\\d+)$";
Sign up to request clarification or add additional context in comments.

2 Comments

Had a possibility to start the column data without space. Hence, added * in my regex. How can handle that case
@Prasad Then you could use * for the first match or omit it at all if there is no whitespace ^\h*(\S+)\h+(\S+)\h+(\S+)\h+(\S+)\h+(\d+)\h+(\d+)$ See regex101.com/r/KbngKK/1 or regex101.com/r/82cl5M/1
1

Try to replace your first "*" by a "+".

What it means:

  • "*" = 0 or more
  • "+" = 1 or more

Given the fact that all your columns begin with some spaces, it excludes the date line which does not begin with a space.

^\s+(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\d+)\s*(\d+)

1 Comment

Had a possibility to start the column data without space. Hence, added * in my regex.
1

As Peter mentioned, your regex will function if using the + (one or more) operator on your space matching at the beginning instead of * (zero or more).

I would further encourage you to recognize that this is a "Fixed Width" format table, meaning each of the columns are simply padded with spaces to a predetermined width. If you will be parsing a large file this way, you will find it much more predictable and easy to debug by using regex to match the line of all hyphens to chop of the beginning, then going line by line with simple substrings and trim at the column length for each column.

If you wish to continue using regex for this, you could also explore other range quantifiers and named groups. This would make the regex a little more clear and help identify issues with formatting later. Please see the following example:

https://regex101.com/r/KEnutx/1

The (?<name>\d+), for example, names the capture group. In many languages, you can then refer to the group by this name, easily pulling out your data and not making your code specific to the index of the groups. Also, it is much easier to find that name when you are debugging or improving your regex to accomodate changes.

Comments

0

You can try changing your first whitespace filter from ^\s* to ^\s+. The effectively filters out the date line since it does not begin with whitespace. Also, if possible, it might be helpful to change the filters to be more specific to the data your searching. For example, with "BE100" you could use \D+\d+, or something even more specific depending on the data.

Comments

0

\s matches line breaks, exclude them from \s with [^\S\r\n] and use + instead of *:

^[^\S\r\n]+(\S+)[^\S\r\n]+(\S+)[^\S\r\n]+(\S+)[^\S\r\n]+(\S+)[^\S\r\n]+(\d+)[^\S\r\n]+(\d+)

See proof

Explanation

--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  [^\S\r\n]+               any character except: non-whitespace (all
                           but \n, \r, \t, \f, and " "), '\r'
                           (carriage return), '\n' (newline) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  [^\S\r\n]+               any character except: non-whitespace (all
                           but \n, \r, \t, \f, and " "), '\r'
                           (carriage return), '\n' (newline) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  [^\S\r\n]+               any character except: non-whitespace (all
                           but \n, \r, \t, \f, and " "), '\r'
                           (carriage return), '\n' (newline) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \3
--------------------------------------------------------------------------------
  [^\S\r\n]+               any character except: non-whitespace (all
                           but \n, \r, \t, \f, and " "), '\r'
                           (carriage return), '\n' (newline) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \4:
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \4
--------------------------------------------------------------------------------
  [^\S\r\n]+               any character except: non-whitespace (all
                           but \n, \r, \t, \f, and " "), '\r'
                           (carriage return), '\n' (newline) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \5:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \5
--------------------------------------------------------------------------------
  [^\S\r\n]+               any character except: non-whitespace (all
                           but \n, \r, \t, \f, and " "), '\r'
                           (carriage return), '\n' (newline) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \6:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \6

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.