1

I need to extract 3 different details out of 1 string.

The pattern is:

  1. "C" followed by 3 digits.
  2. Character and number of any kind. However, an order of one/two character(s) followed by a single digit is always the case.
  3. "S" followed by numbers and can include special characters like "-" and "_".
  4. However, the last "_" separates an iterator, which can be discarded
  5. Sometimes there is no second or third element.

Examples:

Input                   |      Expected output
---------------------------------------------------
C001F1S15_08            =>     ['C001','F1','S15']
C312PH2S1-06_5-0_12     =>     ['C312','PH2','S1-06_5-0']
C023_05                 =>     ['C023']
C002M5_02               =>     ['C002','M5']

How can this be done?

All the best

3
  • What have you tried so far? Commented May 5, 2022 at 6:35
  • Your second test case doesn't match the rules you describe - it has two characters after the C-[0-9]{3} group Commented May 5, 2022 at 6:39
  • @kalatabe, you are right, sometimes there is two characters - sorry for not being clear on this beforehand Commented May 5, 2022 at 19:20

5 Answers 5

2

Try this:

(C\d{3})([A-RT-Z\d]+)?(S[\d\-_]+)?(?:_\d+)

Result: https://regex101.com/r/FETn0U/1

Sign up to request clarification or add additional context in comments.

Comments

1

You can extract values like this (using Avinash's regex)

import re

regex = re.compile(r"(C\d{3})([A-RT-Z\d]+)?(S[\d\-_]+)?(?:_\d+)")
text = "C001F1S15_08"
match = regex.match(text)
print(match.group(1))   # C001
print(match.group(2))   # F1
print(match.group(3))   # S15
print(match.groups())   # ('C001', 'F1', 'S15')
print(list(match.groups()[:3])) # ['C001', 'F1', 'S15']

See here for more information. Keep in mind that .group(0) refers to the entire match, in this case the input string.

Comments

1
import re

lines = ["C001F1S15_08",          
"C312PH2S1-06_5-0_12",
"C023_05",               
"C002M5_02"]

for line in lines:
    parts = line.split("_")

    if len(parts) > 1:
        parts = parts[:-1]
    
    line = "_".join(parts)
    print(line)

    print(re.findall("C\d{3}|S[A-Za-z0-9_@./#&+-]+|[A-Za-z]+\d+",line))

1 Comment

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.
0

The pattern following will do what what you want.We discard the last group.

^(C\d{3})([A-Z]+\d)?([-a-zA-Z\d]+_[\d-]+)?(_\w+)?

See https://regex101.com/r/CKasXZ/2

Comments

0
result = []
str = ''.join(str.split('_')[:-1]) # For removing values after the last '_'.
result.append(str[0:4]) # for retrieve the 1st part of 4 elements.
for i in re.findall('[\w]{1,2}[0-9-]+', str[4:]): # The regex cut the end after each group of 1 or 2 letters + numbers and '-'. 
    result.append(i) # for retrive each values from the regex otherwise you have a list in a list.
result

I guess you can simplify the loop but i don't know how.

1 Comment

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.