1

I'm trying to write a Python library to parse our version format strings. The (simplified) version string format is as follows:

<product>-<x>.<y>.<z>[-alpha|beta|rc[.<n>]][.<extra>]][.centos|redhat|win][.snb|ivb]

This is:

  • product, ie foo
  • numeric version, ie: 0.1.0
  • [optional] pre-release info, ie: beta, rc.1, alpha.extrainfo
  • [optional] operating system, ie: centos
  • [optional] platform, ie: snb, ivb

So the following are valid version strings:

1) foo-1.2.3
2) foo-2.3.4-alpha
3) foo-3.4.5-rc.2
4) foo-4.5.6-rc.2.extra
5) withos-5.6.7.centos
6) osandextra-7.8.9-rc.extra.redhat
7) all-4.4.4-rc.1.extra.centos.ivb

For all of those examples, the following regex works fine:

^(?P<prod>\w+)-(?P<maj>\d).(?P<min>\d).(?P<bug>\d)(?:-(?P<pre>alpha|beta|rc)(?:\.(?P<pre_n>\d))?(?:\.(?P<pre_x>\w+))?)?(?:\.(?P<os>centos|redhat|win))?(?:\.(?P<plat>snb|ivb))?$

But the problem comes in versions of this type (no 'extra' pre-release information, but with os and/or platform):

8) issue-0.1.0-beta.redhat.snb

With the above regex, for string #8, redhat is being picked up in the pre-release extra info pre_x, instead of the os group.

I tried using look-behind to avoid picking the os or platform strings in pre_x:

...(?:\.(?P<pre_x>\w+))?(?<!centos|redhat|win|ivb|snb))...

That is:

^(?P<prod>\w+)-(?P<maj>\d).(?P<min>\d).(?P<bug>\d)(?:-(?P<pre>alpha|beta|rc)(?:\.(?P<pre_n>\d))?(?:\.(?P<pre_x>\w+))?(?<!centos|redhat|win|ivb|snb))?(?:\.(?P<os>centos|redhat|win))?(?:\.(?P<plat>snb|ivb))?$

This would work fine if Python's standard module re could accept variable width look behind. I would rather try to stick to the standard module, rather than using regex as my library is quite likely to be distributed to a large number machines, where I want to limit dependencies.

I've also had a look at similar questions: this, this and this are not aplicable.

Any ideas on how to achieve this?

My regex101 link: https://regex101.com/r/bH0qI7/3

[For those interested, this is the full regex I'm actually using: https://regex101.com/r/lX7nI6/2]

2
  • 1
    Could transforming your regex to use lookaheads help with anything? Commented May 27, 2015 at 13:47
  • Yes, I don't mind using lookaheads, I just want to stick to regex and the standard re module. TBH, I'm lost transforming this to lookaheads. Commented May 27, 2015 at 13:49

2 Answers 2

2

You need to use negative lookahead assertion to make (?P<pre_x>\w+) to match any except for centos or redhat.

^(?P<prod>\w+)-(?P<maj>\d)\.(?P<min>\d)\.(?P<bug>\d)(?:-(?P<pre>alpha|beta|rc)(?:\.(?P<pre_n>\d))?(?:\.(?:(?!centos|redhat)\w)+)?)?(?:\.(?P<os>centos|redhat))?(?:\.(?P<plat>snb|ivb))?$

DEMO

Sign up to request clarification or add additional context in comments.

2 Comments

That was easy!! Thanks a lot
A nit pick @Avinash Raj ? shouldn't it be \d+ in maj/min/bug? I followed the link in the demo for say - foo-3.4.15-rc.2 That doesn't match. (with release early and release fast! ;-) it's not too hard to have version numbers that can go in two digits (if not 3? :-) ).
1

Actually I'd avoid using the regex, since it looks pretty horrible already, and you told us it's only simplified. It's much more readable to parse it by hand:

def extract(text):
    parts = text.split('-')
    ret = {}
    ret['name'] = parts.pop(0)
    ret['version'] = parts.pop(0).split('.')

    if len(parts) > 0:
        rest_parts = parts.pop(0).split('.')
        if rest_parts[-1] in ['snb', 'ivb']:
            ret['platform'] = rest_parts.pop(-1)
        if rest_parts[-1] in ['redhat', 'centos', 'win']:
            ret['os'] = rest_parts.pop(-1)
        ret['extra'] = rest_parts

    return ret

tests = \
[
    'foo-1.2.3',
    'foo-2.3.4-alpha',
    'foo-3.4.5-rc.2',
    'foo-4.5.6-rc.2.extra',
    'withos-5.6.7.centos',
    'osandextra-7.8.9-rc.extra.redhat',
    'all-4.4.4-rc.1.extra.centos.ivb',
    'issue-0.1.0-beta.redhat.snb',
]

for test in tests:
    print(test, extract(test))

Result:

('foo-1.2.3', {'version': ['1', '2', '3'], 'name': 'foo'})
('foo-2.3.4-alpha', {'version': ['2', '3', '4'], 'name': 'foo', 'extra': ['alpha']})
('foo-3.4.5-rc.2', {'version': ['3', '4', '5'], 'name': 'foo', 'extra': ['rc', '2']})
('foo-4.5.6-rc.2.extra', {'version': ['4', '5', '6'], 'name': 'foo', 'extra': ['rc', '2', 'extra']})
('withos-5.6.7.centos', {'version': ['5', '6', '7', 'centos'], 'name': 'withos'})
('osandextra-7.8.9-rc.extra.redhat', {'version': ['7', '8', '9'], 'os': 'redhat', 'name': 'osandextra', 'extra': ['rc', 'extra']})
('all-4.4.4-rc.1.extra.centos.ivb', {'platform': 'ivb', 'version': ['4', '4', '4'], 'os': 'centos', 'name': 'all', 'extra': ['rc', '1', 'extra']})
('issue-0.1.0-beta.redhat.snb', {'platform': 'snb', 'version': ['0', '1', '0'], 'os': 'redhat', 'name': 'issue', 'extra': ['beta']})

1 Comment

Thanks, this indeed looks much cleaner, but if you start adding more complexity it becomes quite dirty pretty quick: ie: extend to allow for a more flexible version format like foo_1.2.3, foo1.2.3, foo.1.2.3, or even missing bugfix: foo-1.2 (=foo-1.2.0)... Keep doing this for every token and you'll end up with a massive piece of code, even more difficult to debug than a regex

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.