Extract string and number from a string which is in multiple format using regex in python?

Question

I am trying to parse a string using regex which is in particular format to get details out of it. I can have my string in two formats -

First format

One way is to have a foldername-version.tgz. Here foldername can be any string in any format. It can have another or multiple - in it or anything else.

For example:

hello-1234.tgz: This should give me FolderName as hello and Version as 1234
world-12345.tgz: This should give me FolderName as world and Version as 12345
hello-21234-12345.tgz: This should give me FolderName as hello-21234 and Version as 12345
hello-21234-a-12345.tgz: This should give me FolderName as hello-21234-a and Version as 12345

Second format

Other way is to have foldername-version-environment.tgz. In this case also foldername can be any string in any format. Also environment string can only be dev, stage, prod and nothing else so I need to add check on that as well.

For example:

hello-1234-dev.tgz: This should give me FolderName as hello and Version as 1234
world-12345-stage.tgz: This should give me FolderName as world and Version as 12345
hello-21234-12345-prod.tgz: This should give me FolderName as hello-21234 and Version as 12345
hello-21234-a-12345-prod.tgz: This should give me FolderName as hello-21234-a and Version as 12345

Problem Statement

So with the above two format - I need to extract FolderName and Version from my string. I tried with below regex but it doesn't work on my strings which are in second format but I want my code to work on both the formats.

#sample example string which can be in first or second format
exampleString = hello-21234-12345-prod.tgz
build_found = re.search(r'[\d.-]+.tgz', exampleString)
version = build_found.group().replace(".tgz", "")
folderName = exampleString.split(version)[0]

What is wrong I am doing here?

is version always an integer?

Matt
– Matt

2020-09-02 22:29:05 +00:00
Commented Sep 2, 2020 at 22:29 — Matt
– Matt, Commented Sep 2, 2020 at 22:29
yes version is always an inetegr

David Todd
– David Todd

2020-09-02 22:49:23 +00:00
Commented Sep 2, 2020 at 22:49 — David Todd
– David Todd, Commented Sep 2, 2020 at 22:49

Tim Biegeleisen · Accepted Answer · 2020-09-02 22:37:13Z

1

I would use:

inp = "some text hello-21234-a-12345.tgz some more text"
parts = re.findall(r'\b([^\s-]+(?:-[^-]+)*)-(\d+)(?:-[^-]+)*\.\w+\b', inp)
print("FolderName: " + parts[0][0])
print("Version: " + parts[0][1])

This prints:

FolderName: hello-21234-a
Version: 12345

answered Sep 2, 2020 at 22:37

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

tzaman · Accepted Answer · 2020-09-02 22:39:17Z

0

Use groups to specify the different sections of the pattern. You can name them for easier extraction later, too:

pattern = re.compile(r"(?P<FolderName>.+)-(?P<Version>\d+)(?:-(?P<Env>dev|stage|prod))?\.tgz")

m = pattern.match(ex)
print(m.groups())
# ('hello-21234', '12345', 'prod')
print(m.group('FolderName'), m.group('Version'), m.group('Env'))
# ('hello-21234', '12345', 'prod')

ex2 = "hello-21234-1234.tgz" # No environment
m = pattern.match(ex)
print(m.groups())
# ('hello-21234', '12345', None)
print(m.group('FolderName'), m.group('Version'), m.group('Env'))
# ('hello-21234', '12345', None)

answered Sep 2, 2020 at 22:39

tzaman

48k11 gold badges93 silver badges118 bronze badges

Comments

RichieV · Accepted Answer · 2020-09-02 22:39:55Z

0

See if this pattern works

import re
exampleString = 'hello-21234-12345-prod.tgz'
build_found = re.search(r'([\w-]+)-(\d+)-(dev|stage|prod)?', exampleString)

folder_name = build_found[1]
version = build_found[2]
environment = build_found[3]

print(folder_name)
print(version)
print(environment)

Output

hello-21234
12345
prod

answered Sep 2, 2020 at 22:39

RichieV

5,1832 gold badges13 silver badges24 bronze badges

Comments

Matt · Accepted Answer · 2020-09-02 22:41:22Z

0

Surely not the best approach, but here's one idea.

Start by determining whether you have the first or second case.

-(dev|stage|prod)\.tgz$

This regex will determine whether or not you have case 1 or 2.

If it's case 1, you can extract the foldername with:

.*-

And you can extract the version with:

-\d+.tgz$

If it's case 2, you can extract the combined foldername/versionnumber with:

.*-

From there, you can extract the foldername with (again):

.*-

And the version number with:

-\d+

answered Sep 2, 2020 at 22:41

Matt

1,1762 gold badges12 silver badges22 bronze badges

Comments

Drew Shafer · Accepted Answer · 2020-09-02 22:46:08Z

You need to use a regular expression that captures the components you're looking for within the string, then use .groups() to extract the captures. This worked in my testing:

re.search(r'^(.+)-(\d+)\D*$', exampleString)

example in ipython:

In [1]: import re

In [2]: s1 = 'hello-21234-12345-prod.tgz'

In [3]: s2 = 'hello-1234.tgz'

In [4]: re.search(r'^(.+)-(\d+)\D*$', s1).groups()
Out[4]: ('hello-21234', '12345')

In [5]: re.search(r'^(.+)-(\d+)\D*$', s2).groups()
Out[5]: ('hello', '1234')

The trick is the capture groups ((...)) within the regular expression r'^(.+)-(\d+)\D*$'. There are two groups - it's actually easier to decode it by looking at the second capture group first, then the first.

The second part of the regex - r'(\d+)\D*$' matches the final series of \d digits. You know it is the final series of digits, because the \D*$ part will match and swallow up all non-digit characters up to the end of the string.

The first part of the regex - r'^(.+)-' matches everything before the second part. It captures everything except the "-" character, and gives you the FolderName

Note that you'll need something a bit more complex if you have any digit characters in your environment or in the file ending (such as if you're using bzip2 compression)

Collectives™ on Stack Overflow

Extract string and number from a string which is in multiple format using regex in python?

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related