Extracting multiple substrings from one string

Question

I have the following string which I am parsing from another file : "CHEM1(5GL) CH3M2(55LB) CHEM3954114(50KG)" What I want to do is split them up into individual values, which I achieve using the .split() function. So I get them as an array:

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']

Now I want to further split them into 3 segments, and store them in 3 other variables so I can write them to excel as such :

a = CHEM1
b = 5
c = GL

for the first array, then I will loop back for the second array:

a = CH3M2
b = 55
c = LB

and finally :

a = CHEM3954114
b = 50
c = KG

I am unsure how to go about that as I am still new in python. To the best of my acknowledge I iterate multiple times with the split function, but I believe there has to be a better way to do it than that.

Thank you.

nikeros · Accepted Answer · 2022-01-17 07:12:18Z

4

You should use the re package:

import re

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']

pattern = re.compile("([^\(]+)\((\d+)(.+)\)")

for x1 in x:
    m = pattern.search(x1)
    if m:
        a, b, c = m.group(1), int(m.group(2)), m.group(3)

FOLLOW UP:

The regex topic is enormous and extremely well covered on this site - as Tim has highlighted above. I can share my thinking for this specific case. Essentially, there are 3 groups of characters you want to extract:

All the characters (letters and numbers) up to the ( - not included
The digits after the (
The letters after the digits extracted in the previous step - up to the ) - not included.

A group is anything included between brackets (): in this specific case, it may become confusing because, as stressed above, you have brackets as part of sentence - which will need to be escaped with a \ to be distinguished from the ones used in the regular expression.

The first group is ([^\(]+), which essentially means: match one or more characters which are not ( (the ^ is the negation, and the bracket ( needs to be escaped here, for the reasons described above). Note that characters may include not only letters and numbers but also special characters like $, £, - and so forth. I wanted to keep my options open here, but you can be more laser guided if you need (including, for example, only numbers and letters using [\w]+)
The second group is (\d+), which is essentially matching 1 or more (expressed with +) digits (expressed with \d).
The last group is (.+) - match any remaining characters, with the final \) making sure that you match any remaining characters up to the closing bracket.

edited Jan 17, 2022 at 7:12

answered Jan 17, 2022 at 6:30

nikeros

3,3792 gold badges12 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ahmed Over a year ago

You're amazing man! Thank you so much, it worked perfectly, however, I am at a loss as to what this function does, is there an explanation for how it works? just so I can use it again in the future.

Tim Biegeleisen · Accepted Answer · 2022-01-17 06:33:03Z

4

Using re.findall we can try:

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
for inp in x:
    matches = re.findall(r'(\w+)\((\d+)(\w+)\)', inp)
    print(matches)

# [('CHEM1', '5', 'GL')]
# [('CH3M2', '55', 'LB')]
# [('CHEM3954114', '50', 'KG')]

answered Jan 17, 2022 at 6:33

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

3 Comments

Ahmed Over a year ago

This is a great answer as well, thank you very much, however, nikeros's answer fit perfectly with what I was asking. I just can't seem to make sense of how to use the function/package. Could you maybe provide some documentation or a guide?

Tim Biegeleisen Over a year ago

There are some good tutorial sites on the net for Python's re package. Also, read some of the canonical questions and answers on Stack Overflow. Both good places to start. FWIW if you had more than one input to handle at once, my version is probably what you'd want.

Ahmed Over a year ago

Thank you very much once again, I have upvoted the comment and bookmarked it for future reference, I am very grateful, thank you.

Circuit Planet · Accepted Answer · 2022-01-17 07:02:14Z

1

Considering the elements you provided in your question, I assume that there can not be '(' more than once in an element.

Here is the function I wrote.

def decontruct(chem):
  name = chem[:chem.index('(')]
  qty = chem[chem.index('(') + 1:-1]
  mag, unit = "", ""
  for char in qty:
      if char.isalpha():
          unit += char
      else:
          mag += char
  return {"name": name, "mag": float(mag), "unit": unit} # If you don't want to convert mag into float then just use int(mag) instead of float(mag).

Usage:

x = ['CHEM1(5.4GL)', 'CH3M2(55LB)', 'CHEM3954114(50KG)']

for chem in x:
  d = decontruct(chem)
  print(d["name"], d["mag"], d["unit"])

answered Jan 17, 2022 at 7:02

Circuit Planet

1911 silver badge10 bronze badges

Comments

Vladimir Botka · Accepted Answer · 2022-01-17 07:03:49Z

0

Use re and create a list of dictionaries

import re

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
keys =['a', 'b', 'c']
y = []
for s in x:
    vals = re.sub(r'(.*?)\((\d*)(.*?)\)', r'\1 \2 \3', s).split()
    y.append(dict(zip(keys, vals)))

[print("a: %s\nb: %s\nc: %s\n" % (i['a'], i['b'], i['c'])) for i in y]

gives

a: CHEM1
b: 5
c: GL

a: CH3M2
b: 55
c: LB

a: CHEM3954114
b: 50
c: KG

answered Jan 17, 2022 at 7:03

Vladimir Botka

69.9k7 gold badges45 silver badges78 bronze badges

Collectives™ on Stack Overflow

Extracting multiple substrings from one string

4 Answers 4

1 Comment

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related