1

How do you match a repeating group within a repeating group?

For example getting all valid records in a log file:

---: 
TS : 150602000006S
EC1: 02429.523
EC2: 05604.110
---
---: 
TS : 150603000006S
---: 
TS : 150603000006S
EP1: 3333.523
---

Like the following matches:

[ 
  [
    ['TS ', '150602000006S'], 
    ['EC1', '02429.523'],
    ['EC2', '05604.110']
  ], 
  [
    ['TS', '150603000006S'], 
    ['EP1', '3333.523']
  ]
]

Retrieving the individual record properties can be done with(See on regex101):

([A-Z0-9 ]{3,3}): ([0-9SW]+ )?([0-9\.SW]{3,})\n

However, when placing regex in a record group(like seen here), property groups stop matching in a repeating fashion.

How is this properly done?

3
  • 1
    Likely not with regex... Commented Nov 4, 2016 at 13:43
  • @Mena Why would this not be possible? Commented Nov 4, 2016 at 13:45
  • I'm not saying it's categorically not possible, but generally regular expressions are good for parsing text not contextually to a given grammar (e.g. regex against markup is generally a very bad idea). When you have nested elements and rules for nesting, regular expressions become very cumbersome very soon. Assuming you find the right way to match your hierarchical records, the expression itself will be long, likely unreadable, and very hard to maintain. Typically you'd want to implement your own parser for this. Commented Nov 4, 2016 at 13:48

1 Answer 1

1

In order to keep this maintainable, I would try to split this into a couple of regular expressions.

First, you want to do some kind of basic check to ensure the data is in a format you expect. I would count the number of occurrences of each of the following expressions. If they do not match then simply give up*.

---:\n
---(\n|$)

Once you know these are equal, you probably want to match the whole string against a pattern to break it into sections, e.g.

---:\n.*?---(\n|$)

This represents a literal ---: followed by a newline, followed by as little text as possible (*? is lazy), followed by either a newline or the end of the string. You would need to run this with the single line flag.

This would give you three matches on your example string. You could then run your pattern on each of the resultant matches.


*Giving up may seem like the easy way out here, but it is difficult to make any accurate guesses about incorrectly formatted data. Considering your earlier example, we have two choices if we want to normalise this data, both added as comments:

---:
TS : 150602000006S
EC1: 02429.523
EC2: 05604.110
---
---:
TS : 150603000006S
       // Add a closing tag here?
---:   // Remove this opening tag?
TS : 150603000006S
EP1: 3333.523
---

What are the consequences if we guess incorrectly? Are there any benefits to carrying on in the presence of errors? It will entirely depend on your application.

Sign up to request clarification or add additional context in comments.

2 Comments

I agree with splitting the regex up in two parts. And this does match to the individual records, however, it does not take care of invalid records as seen here. Is there a regex solution for this?
@JasperJ I have amended my answer. Hope it helps.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.