1

I am trying to parse a string using regex which is in particular format to get details out of it. I can have my string in two formats -

First format

One way is to have a foldername-version.tgz. Here foldername can be any string in any format. It can have another or multiple - in it or anything else.

For example:

  • hello-1234.tgz: This should give me FolderName as hello and Version as 1234
  • world-12345.tgz: This should give me FolderName as world and Version as 12345
  • hello-21234-12345.tgz: This should give me FolderName as hello-21234 and Version as 12345
  • hello-21234-a-12345.tgz: This should give me FolderName as hello-21234-a and Version as 12345

Second format

Other way is to have foldername-version-environment.tgz. In this case also foldername can be any string in any format. Also environment string can only be dev, stage, prod and nothing else so I need to add check on that as well.

For example:

  • hello-1234-dev.tgz: This should give me FolderName as hello and Version as 1234
  • world-12345-stage.tgz: This should give me FolderName as world and Version as 12345
  • hello-21234-12345-prod.tgz: This should give me FolderName as hello-21234 and Version as 12345
  • hello-21234-a-12345-prod.tgz: This should give me FolderName as hello-21234-a and Version as 12345

Problem Statement

So with the above two format - I need to extract FolderName and Version from my string. I tried with below regex but it doesn't work on my strings which are in second format but I want my code to work on both the formats.

#sample example string which can be in first or second format
exampleString = hello-21234-12345-prod.tgz
build_found = re.search(r'[\d.-]+.tgz', exampleString)
version = build_found.group().replace(".tgz", "")
folderName = exampleString.split(version)[0]

What is wrong I am doing here?

2
  • is version always an integer? Commented Sep 2, 2020 at 22:29
  • yes version is always an inetegr Commented Sep 2, 2020 at 22:49

5 Answers 5

1

I would use:

inp = "some text hello-21234-a-12345.tgz some more text"
parts = re.findall(r'\b([^\s-]+(?:-[^-]+)*)-(\d+)(?:-[^-]+)*\.\w+\b', inp)
print("FolderName: " + parts[0][0])
print("Version: " + parts[0][1])

This prints:

FolderName: hello-21234-a
Version: 12345
Sign up to request clarification or add additional context in comments.

Comments

0

Use groups to specify the different sections of the pattern. You can name them for easier extraction later, too:

pattern = re.compile(r"(?P<FolderName>.+)-(?P<Version>\d+)(?:-(?P<Env>dev|stage|prod))?\.tgz")

m = pattern.match(ex)
print(m.groups())
# ('hello-21234', '12345', 'prod')
print(m.group('FolderName'), m.group('Version'), m.group('Env'))
# ('hello-21234', '12345', 'prod')

ex2 = "hello-21234-1234.tgz" # No environment
m = pattern.match(ex)
print(m.groups())
# ('hello-21234', '12345', None)
print(m.group('FolderName'), m.group('Version'), m.group('Env'))
# ('hello-21234', '12345', None)

Comments

0

See if this pattern works

import re
exampleString = 'hello-21234-12345-prod.tgz'
build_found = re.search(r'([\w-]+)-(\d+)-(dev|stage|prod)?', exampleString)

folder_name = build_found[1]
version = build_found[2]
environment = build_found[3]

print(folder_name)
print(version)
print(environment)

Output

hello-21234
12345
prod

Comments

0

Surely not the best approach, but here's one idea.

Start by determining whether you have the first or second case.

-(dev|stage|prod)\.tgz$

This regex will determine whether or not you have case 1 or 2.

If it's case 1, you can extract the foldername with:

.*-

And you can extract the version with:

-\d+.tgz$

If it's case 2, you can extract the combined foldername/versionnumber with:

.*-

From there, you can extract the foldername with (again):

.*-

And the version number with:

-\d+

Comments

0

You need to use a regular expression that captures the components you're looking for within the string, then use .groups() to extract the captures. This worked in my testing:

re.search(r'^(.+)-(\d+)\D*$', exampleString)

example in ipython:

In [1]: import re

In [2]: s1 = 'hello-21234-12345-prod.tgz'

In [3]: s2 = 'hello-1234.tgz'

In [4]: re.search(r'^(.+)-(\d+)\D*$', s1).groups()
Out[4]: ('hello-21234', '12345')

In [5]: re.search(r'^(.+)-(\d+)\D*$', s2).groups()
Out[5]: ('hello', '1234')

The trick is the capture groups ((...)) within the regular expression r'^(.+)-(\d+)\D*$'. There are two groups - it's actually easier to decode it by looking at the second capture group first, then the first.

The second part of the regex - r'(\d+)\D*$' matches the final series of \d digits. You know it is the final series of digits, because the \D*$ part will match and swallow up all non-digit characters up to the end of the string.

The first part of the regex - r'^(.+)-' matches everything before the second part. It captures everything except the "-" character, and gives you the FolderName

Note that you'll need something a bit more complex if you have any digit characters in your environment or in the file ending (such as if you're using bzip2 compression)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.