Regex: Get nested repeating group

Question

How do you match a repeating group within a repeating group?

For example getting all valid records in a log file:

---: 
TS : 150602000006S
EC1: 02429.523
EC2: 05604.110
---
---: 
TS : 150603000006S
---: 
TS : 150603000006S
EP1: 3333.523
---

Like the following matches:

[ 
  [
    ['TS ', '150602000006S'], 
    ['EC1', '02429.523'],
    ['EC2', '05604.110']
  ], 
  [
    ['TS', '150603000006S'], 
    ['EP1', '3333.523']
  ]
]

Retrieving the individual record properties can be done with(See on regex101):

([A-Z0-9 ]{3,3}): ([0-9SW]+ )?([0-9\.SW]{3,})\n

However, when placing regex in a record group(like seen here), property groups stop matching in a repeating fashion.

How is this properly done?

I'm not saying it's categorically not possible, but generally regular expressions are good for parsing text not contextually to a given grammar (e.g. regex against markup is generally a very bad idea). When you have nested elements and rules for nesting, regular expressions become very cumbersome very soon. Assuming you find the right way to match your hierarchical records, the expression itself will be long, likely unreadable, and very hard to maintain. Typically you'd want to implement your own parser for this. — Mena
– Mena, Commented Nov 4, 2016 at 13:48

Michael · Accepted Answer · 2016-11-04 17:19:26Z

1

In order to keep this maintainable, I would try to split this into a couple of regular expressions.

First, you want to do some kind of basic check to ensure the data is in a format you expect. I would count the number of occurrences of each of the following expressions. If they do not match then simply give up*.

---:\n
---(\n|$)

Once you know these are equal, you probably want to match the whole string against a pattern to break it into sections, e.g.

---:\n.*?---(\n|$)

This represents a literal ---: followed by a newline, followed by as little text as possible (*? is lazy), followed by either a newline or the end of the string. You would need to run this with the single line flag.

This would give you three matches on your example string. You could then run your pattern on each of the resultant matches.

*Giving up may seem like the easy way out here, but it is difficult to make any accurate guesses about incorrectly formatted data. Considering your earlier example, we have two choices if we want to normalise this data, both added as comments:

---:
TS : 150602000006S
EC1: 02429.523
EC2: 05604.110
---
---:
TS : 150603000006S
       // Add a closing tag here?
---:   // Remove this opening tag?
TS : 150603000006S
EP1: 3333.523
---

What are the consequences if we guess incorrectly? Are there any benefits to carrying on in the presence of errors? It will entirely depend on your application.

edited Nov 4, 2016 at 17:19

answered Nov 4, 2016 at 14:13

Michael

44.5k12 gold badges97 silver badges143 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

JasperJ Over a year ago

I agree with splitting the regex up in two parts. And this does match to the individual records, however, it does not take care of invalid records as seen here. Is there a regex solution for this?

Michael Over a year ago

@JasperJ I have amended my answer. Hope it helps.

Collectives™ on Stack Overflow

Regex: Get nested repeating group

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related