7

I'm trying to create a function (in Python) that takes its input (a chemical formula) and splits in into a list. For example, if the input was "HC2H3O2", it would turn it into:

molecule_list = ['H', 1, 'C', 2, 'H', 3, 'O', 2]

This, works well so far, but if I input an element with two letters in it, for example sodium (Na), it would split it into:

['N', 'a']

I'm searching for a way to make my function look through the string for keys found in a dictionary called elements. I'm also considering using regex for this, but I'm not sure how to implement it. This is what my function is right now:

def split_molecule(inputted_molecule):
    """Take the input and split it into a list
    eg: C02 => ['C', 1, 'O', 2]
    """
    # step 1: convert inputted_molecule to a list
    # step 2a: if there are two periodic elements next to each other, insert a '1'
    # step 2b: if the last element is an element, append a '1'
    # step 3: convert all numbers in list to ints

    # step 1:
    # problem: it splits Na into 'N', 'a'
    # it needs to split by periodic elements
    molecule_list = list(inputted_molecule)

    # because at most, the list can double when "1" is inserted
    max_length_of_molecule_list = 2*len(molecule_list)
    # step 2a:
    for i in range(0, max_length_of_molecule_list):
        try:
            if (molecule_list[i] in elements) and (molecule_list[i+1] in elements):
                molecule_list.insert(i+1, "1")
        except IndexError:
            break
    # step2b:     
    if (molecule_list[-1] in elements):
        molecule_list.append("1")

    # step 3:
    for i in range(0, len(molecule_list)):
        if molecule_list[i].isdigit():
            molecule_list[i] = int(molecule_list[i])

    return molecule_list

3 Answers 3

6

How about

import re
print re.findall('[A-Z][a-z]?|[0-9]+', 'Na2SO4MnO4')

result

['Na', '2', 'S', 'O', '4', 'Mn', 'O', '4']

Regex explained:

Find everything that is either

    [A-Z]   # A,B,...Z, ie. an uppercase letter
    [a-z]   # followed by a,b,...z, ie. a lowercase latter
    ?       # which is optional
    |       # or
    [0-9]   # 0,1,2...9, ie a digit
    +       # and perhaps some more of them

This expression is pretty dumb since it accepts arbitrary "elements", like "Xy". You can improve it by replacing the [A-Z][a-z]? part with the actual list of elements' names, separated by |, like Ba|Na|Mn...|C|O

Of course, regular expressions can only handle very simple formulas, to parse something like

  8(NH4)3P4Mo12O40 + 64NaNO3 + 149NH4NO3 + 135H2O

you're going to need a real parser, e.g. pyparsing (be sure to check "chemical formulas" under "Examples"). Good luck!

Sign up to request clarification or add additional context in comments.

2 Comments

That's brilliant, thank you! Would you mind explaining the regex?
+1 for mentioning that you would need a real parser, instead of a regex parser
2

An expression like this will match all parts of interest:

[A-Z][a-z]*|\d+

You can use it with re.findall and then add the quantifier for atoms that have none.

Or you could use a regex for that as well:

molecule = 'NaHC2H3O2'
print re.findall(r'[A-Z][a-z]*|\d+', re.sub('[A-Z][a-z]*(?![\da-z])', r'\g<0>1', molecule))

Output:

['Na', '1', 'H', '1', 'C', '2', 'H', '3', 'O', '2']

The sub adds a 1 after all atoms not followed by a number.

Comments

1

The non-regex approach, which is a bit hackish and probably not the best, but it works:

import string

formula = 'HC2H3O2Na'
m_list = list()
for x in formula:
   if x in string.lowercase:
      m_list.append(formula[formula.index(x)-1]+x)
      _ = m_list.pop(len(m_list)-2)
   else:
      m_list.append(x)
print m_list
['H', 'C', '2', 'H', '3', 'O', '2', 'Na']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.