2

I have the following string which I am parsing from another file : "CHEM1(5GL) CH3M2(55LB) CHEM3954114(50KG)" What I want to do is split them up into individual values, which I achieve using the .split() function. So I get them as an array:

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']

Now I want to further split them into 3 segments, and store them in 3 other variables so I can write them to excel as such :

a = CHEM1
b = 5
c = GL

for the first array, then I will loop back for the second array:

a = CH3M2
b = 55
c = LB

and finally :

a = CHEM3954114
b = 50
c = KG

I am unsure how to go about that as I am still new in python. To the best of my acknowledge I iterate multiple times with the split function, but I believe there has to be a better way to do it than that.

Thank you.

4 Answers 4

4

You should use the re package:

import re

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']

pattern = re.compile("([^\(]+)\((\d+)(.+)\)")

for x1 in x:
    m = pattern.search(x1)
    if m:
        a, b, c = m.group(1), int(m.group(2)), m.group(3)

FOLLOW UP:

The regex topic is enormous and extremely well covered on this site - as Tim has highlighted above. I can share my thinking for this specific case. Essentially, there are 3 groups of characters you want to extract:

  1. All the characters (letters and numbers) up to the ( - not included
  2. The digits after the (
  3. The letters after the digits extracted in the previous step - up to the ) - not included.

A group is anything included between brackets (): in this specific case, it may become confusing because, as stressed above, you have brackets as part of sentence - which will need to be escaped with a \ to be distinguished from the ones used in the regular expression.

  • The first group is ([^\(]+), which essentially means: match one or more characters which are not ( (the ^ is the negation, and the bracket ( needs to be escaped here, for the reasons described above). Note that characters may include not only letters and numbers but also special characters like $, £, - and so forth. I wanted to keep my options open here, but you can be more laser guided if you need (including, for example, only numbers and letters using [\w]+)
  • The second group is (\d+), which is essentially matching 1 or more (expressed with +) digits (expressed with \d).
  • The last group is (.+) - match any remaining characters, with the final \) making sure that you match any remaining characters up to the closing bracket.
Sign up to request clarification or add additional context in comments.

1 Comment

You're amazing man! Thank you so much, it worked perfectly, however, I am at a loss as to what this function does, is there an explanation for how it works? just so I can use it again in the future.
4

Using re.findall we can try:

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
for inp in x:
    matches = re.findall(r'(\w+)\((\d+)(\w+)\)', inp)
    print(matches)

# [('CHEM1', '5', 'GL')]
# [('CH3M2', '55', 'LB')]
# [('CHEM3954114', '50', 'KG')]

3 Comments

This is a great answer as well, thank you very much, however, nikeros's answer fit perfectly with what I was asking. I just can't seem to make sense of how to use the function/package. Could you maybe provide some documentation or a guide?
There are some good tutorial sites on the net for Python's re package. Also, read some of the canonical questions and answers on Stack Overflow. Both good places to start. FWIW if you had more than one input to handle at once, my version is probably what you'd want.
Thank you very much once again, I have upvoted the comment and bookmarked it for future reference, I am very grateful, thank you.
1

Considering the elements you provided in your question, I assume that there can not be '(' more than once in an element.

Here is the function I wrote.

def decontruct(chem):
  name = chem[:chem.index('(')]
  qty = chem[chem.index('(') + 1:-1]
  mag, unit = "", ""
  for char in qty:
      if char.isalpha():
          unit += char
      else:
          mag += char
  return {"name": name, "mag": float(mag), "unit": unit} # If you don't want to convert mag into float then just use int(mag) instead of float(mag).

Usage:

x = ['CHEM1(5.4GL)', 'CH3M2(55LB)', 'CHEM3954114(50KG)']

for chem in x:
  d = decontruct(chem)
  print(d["name"], d["mag"], d["unit"])

Comments

0

Use re and create a list of dictionaries

import re

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
keys =['a', 'b', 'c']
y = []
for s in x:
    vals = re.sub(r'(.*?)\((\d*)(.*?)\)', r'\1 \2 \3', s).split()
    y.append(dict(zip(keys, vals)))

[print("a: %s\nb: %s\nc: %s\n" % (i['a'], i['b'], i['c'])) for i in y]

gives

a: CHEM1
b: 5
c: GL

a: CH3M2
b: 55
c: LB

a: CHEM3954114
b: 50
c: KG

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.