1

How do I find how many times does substring appears in string? I have molecular formula and if the letters are uppercase it is one element(e.g.H), if it has first Upper case letter and second lower case than it is one element(e.g.Ba), and if there is number after element i have to add that number to elment

example: input : Ba4H2Ba5Li3

if i search Ba it should print number 9(i have Ba4 and Ba5, that is 9), if i search H it should print 2(one letter H but number 2 after it), and Li it should print number 3.

2
  • What if the compound is mercury(II) hydride? Should the count of H in HgH2 be 2 or 3? Commented Mar 16, 2014 at 2:26
  • @user2357112: my solution will at least count that correctly as {'Hg': 1, 'H': 2}. :-) Commented Mar 16, 2014 at 10:13

3 Answers 3

3

You can use Regular Expression, like this

data = "Ba4H2Ba5Li3"
import re
result = {}
for element, count in re.findall(r"([A-Z][a-z]?)(\d*)", data):
    result[element] = result.get(element, 0) + int(1 if count == "" else count)
print result
# {'H': 2, 'Ba': 9, 'Li': 3}

Now, you can get the count of each items from the result, like this

print result.get("Ba", 0)
# 9
print result.get("H", 0)
# 2
print result.get("Li", 0)
# 3
print result.get("Sa", 0)
# 0
Sign up to request clarification or add additional context in comments.

4 Comments

If the string isn't followed by a number, it's supposed to count as 1 occurrence.
What about hydrogen cyanide? HCN parses as one HC and one N.
@user2357112 Should that be parsed as {H:1, C:1, N:1}?
Yes. Your most recent edit appears to have fixed that, though. Unless the compound contains temporary symbols (Uuo, for example) or other oddities like compound ions (which are beyond the scope of the question), this should work.
2

I'd parse the whole input string into a dictionary; a regular expression would help here:

import re
from collections import defaultdict

molecule = re.compile(r'([A-Z][a-z]?)(\d*)')

def parse_formula(f):
    counts = defaultdict(int)
    for name, count in molecule.findall(f):
        counts[name] += int(count or 1)
    return counts

This will count molecules without a digit after the symbol as 1; 'H3O' thus would still be counted correctly.

Now you can simply look up your elements:

counts = parse_formula('Ba4H2Ba5Li3')
print counts['Ba']
print counts['H']

Demo:

>>> counts = parse_formula('Ba4H2Ba5Li3')
>>> counts
defaultdict(<type 'int'>, {'H': 2, 'Ba': 9, 'Li': 3})
>>> counts['H']
2
>>> counts['Ba']
9
>>> parse_formula('H3O')
defaultdict(<type 'int'>, {'H': 3, 'O': 1})

Comments

1

Here is a somewhat more robust approach which will properly handle formulas with nested sub-expressions, such as Na(OH)2 or Al(NO3)3:

# Loosely based on example code from
# http://pyparsing.wikispaces.com/file/detail/chemicalFormulas.py
from pyparsing import Group, Forward, Literal, nums, oneOf, OneOrMore, Optional, Word

# from http://pyparsing-public.wikispaces.com/Helpful+Expressions
# element("He") => "He"
element = oneOf(
    """H He Li Be B C N O F Ne Na Mg Al Si P S Cl
    Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge
    As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag
    Cd In Sn Sb Te I Xe Cs Ba Lu Hf Ta W Re Os
    Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Lr Rf
    Db Sg Bh Hs Mt Ds Rg Uub Uut Uuq Uup Uuh Uus
    Uuo La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm
    Yb Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No"""
)

# integer("123") => 123
to_int = lambda tokens: int(tokens[0])
integer = Word(nums).setParseAction(to_int)

# item("He") => {"He": 1}
# item("O2") => {"O": 2}
item_to_dict = lambda tokens: {a:b for a,b in tokens}
item = Group(element + Optional(integer, default=1)).setParseAction(item_to_dict)

# allow recursive definition of formula
Formula = Forward()

# expr("(OH)2") => {"O": 2, "H": 2}
lpar    = Literal("(").suppress()
rpar    = Literal(")").suppress()
expr_to_dict = lambda tokens: {el: num*tokens[1] for el,num in tokens[0].items()}
expr = (lpar + Formula + rpar + integer).setParseAction(expr_to_dict)

# ... complete the recursive definition
def formula_to_dict(tokens):
    total = {}
    for expr in tokens:
        for el,num in expr.items():
            total[el] = total.get(el, 0) + num
    return total
Formula <<= OneOrMore(item | expr).setParseAction(formula_to_dict)

# Finally, wrap it in an easy-to-use function:
def get_elements(s):
    return Formula.parseString(s)[0]

You can use it like:

>>> get_elements("Na(OH)2")
{'H': 2, 'Na': 1, 'O': 2}

>>> get_elements("Al(NO3)3")
{'Al': 1, 'N': 3, 'O': 9}

>>> get_elements("Ba4H2Ba5Li3")
{'Ba': 9, 'H': 2, 'Li': 3}

1 Comment

Pyparsing is no longer hosted on wikispaces.com. Go to github.com/pyparsing/pyparsing

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.