Regex to parse columns data

Question

I want to parse the column data and extract the required column info based on regex. Below shared is link that I tried,

^\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\d+)\s*(\d+)

https://regex101.com/r/lwzfQA/1

From that above link, I want to parse onlu the three rows data but it is matching with other details like "Sat Jan 30 15:56:06.144 UTC". I see first two matches in the above link are not proper but last two looks fine. which regex that I can use to parse only column info.

The fourth bird · Accepted Answer · 2021-01-30 17:41:05Z

2

In your example data, the columns are separated by more than 1 whitespace, but in your pattern you make those spaces optional using \s* Also note that \S matches a non whitespace character which is a broad match.

As you tagged Java, I would suggest making use of \h{2,} to match 2 or more horizontal whitespace chars as \s can also match a newline and might give unexpected results.

You could also add an anchor $ to assert the end of the string to prevent partial matches.

^\h{2,}(\S+)\h{2,}(\S+)\h{2,}(\S+)\h{2,}(\S+)\h{2,}(\d+)\h{2,}(\d+)$

Regex demo

In Java with the doubled backslashes

String regex = "^\\h{2,}(\\S+)\\h{2,}(\\S+)\\h{2,}(\\S+)\\h{2,}(\\S+)\\h{2,}(\\d+)\\h{2,}(\\d+)$";

answered Jan 30, 2021 at 17:41

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Prasad Over a year ago

Had a possibility to start the column data without space. Hence, added * in my regex. How can handle that case

The fourth bird Over a year ago

@Prasad Then you could use * for the first match or omit it at all if there is no whitespace ^\h*(\S+)\h+(\S+)\h+(\S+)\h+(\S+)\h+(\d+)\h+(\d+)$ See regex101.com/r/KbngKK/1 or regex101.com/r/82cl5M/1

EricC-59 · Accepted Answer · 2021-01-30 16:44:27Z

1

Try to replace your first "*" by a "+".

What it means:

"*" = 0 or more
"+" = 1 or more

Given the fact that all your columns begin with some spaces, it excludes the date line which does not begin with a space.

^\s+(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\d+)\s*(\d+)

answered Jan 30, 2021 at 16:44

EricC-59

1592 silver badges10 bronze badges

1 Comment

Prasad Over a year ago

Had a possibility to start the column data without space. Hence, added * in my regex.

Anonymike · Accepted Answer · 2021-01-30 17:04:12Z

As Peter mentioned, your regex will function if using the + (one or more) operator on your space matching at the beginning instead of * (zero or more).

I would further encourage you to recognize that this is a "Fixed Width" format table, meaning each of the columns are simply padded with spaces to a predetermined width. If you will be parsing a large file this way, you will find it much more predictable and easy to debug by using regex to match the line of all hyphens to chop of the beginning, then going line by line with simple substrings and trim at the column length for each column.

If you wish to continue using regex for this, you could also explore other range quantifiers and named groups. This would make the regex a little more clear and help identify issues with formatting later. Please see the following example:

https://regex101.com/r/KEnutx/1

The (?<name>\d+), for example, names the capture group. In many languages, you can then refer to the group by this name, easily pulling out your data and not making your code specific to the index of the groups. Also, it is much easier to find that name when you are debugging or improving your regex to accomodate changes.

Peter Knall · Accepted Answer · 2021-01-30 16:39:37Z

0

You can try changing your first whitespace filter from ^\s* to ^\s+. The effectively filters out the date line since it does not begin with whitespace. Also, if possible, it might be helpful to change the filters to be more specific to the data your searching. For example, with "BE100" you could use \D+\d+, or something even more specific depending on the data.

answered Jan 30, 2021 at 16:39

Peter Knall

865 bronze badges

Comments

Ryszard Czech · Accepted Answer · 2021-01-30 21:39:11Z

\s matches line breaks, exclude them from \s with [^\S\r\n] and use + instead of *:

^[^\S\r\n]+(\S+)[^\S\r\n]+(\S+)[^\S\r\n]+(\S+)[^\S\r\n]+(\S+)[^\S\r\n]+(\d+)[^\S\r\n]+(\d+)

See proof

Explanation

--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  [^\S\r\n]+               any character except: non-whitespace (all
                           but \n, \r, \t, \f, and " "), '\r'
                           (carriage return), '\n' (newline) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  [^\S\r\n]+               any character except: non-whitespace (all
                           but \n, \r, \t, \f, and " "), '\r'
                           (carriage return), '\n' (newline) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  [^\S\r\n]+               any character except: non-whitespace (all
                           but \n, \r, \t, \f, and " "), '\r'
                           (carriage return), '\n' (newline) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \3
--------------------------------------------------------------------------------
  [^\S\r\n]+               any character except: non-whitespace (all
                           but \n, \r, \t, \f, and " "), '\r'
                           (carriage return), '\n' (newline) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \4:
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \4
--------------------------------------------------------------------------------
  [^\S\r\n]+               any character except: non-whitespace (all
                           but \n, \r, \t, \f, and " "), '\r'
                           (carriage return), '\n' (newline) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \5:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \5
--------------------------------------------------------------------------------
  [^\S\r\n]+               any character except: non-whitespace (all
                           but \n, \r, \t, \f, and " "), '\r'
                           (carriage return), '\n' (newline) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \6:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \6

Collectives™ on Stack Overflow

Regex to parse columns data

5 Answers 5

2 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related