Python count substrings

Question

How do I find how many times does substring appears in string? I have molecular formula and if the letters are uppercase it is one element(e.g.H), if it has first Upper case letter and second lower case than it is one element(e.g.Ba), and if there is number after element i have to add that number to elment

example: input : Ba4H2Ba5Li3

if i search Ba it should print number 9(i have Ba4 and Ba5, that is 9), if i search H it should print 2(one letter H but number 2 after it), and Li it should print number 3.

What if the compound is mercury(II) hydride? Should the count of H in HgH2 be 2 or 3? — user2357112
– user2357112, Commented Mar 16, 2014 at 2:26
@user2357112: my solution will at least count that correctly as {'Hg': 1, 'H': 2}. :-) — Martijn Pieters
– Martijn Pieters, Commented Mar 16, 2014 at 10:13

thefourtheye · Accepted Answer · 2014-03-16 02:27:39Z

3

You can use Regular Expression, like this

data = "Ba4H2Ba5Li3"
import re
result = {}
for element, count in re.findall(r"([A-Z][a-z]?)(\d*)", data):
    result[element] = result.get(element, 0) + int(1 if count == "" else count)
print result
# {'H': 2, 'Ba': 9, 'Li': 3}

Now, you can get the count of each items from the result, like this

print result.get("Ba", 0)
# 9
print result.get("H", 0)
# 2
print result.get("Li", 0)
# 3
print result.get("Sa", 0)
# 0

edited Mar 16, 2014 at 2:27

answered Mar 16, 2014 at 2:03

thefourtheye

241k53 gold badges466 silver badges505 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user2357112 Over a year ago

If the string isn't followed by a number, it's supposed to count as 1 occurrence.

user2357112 Over a year ago

What about hydrogen cyanide? HCN parses as one HC and one N.

thefourtheye Over a year ago

@user2357112 Should that be parsed as {H:1, C:1, N:1}?

user2357112 Over a year ago

Yes. Your most recent edit appears to have fixed that, though. Unless the compound contains temporary symbols (Uuo, for example) or other oddities like compound ions (which are beyond the scope of the question), this should work.

Martijn Pieters · Accepted Answer · 2014-03-16 02:09:16Z

2

I'd parse the whole input string into a dictionary; a regular expression would help here:

import re
from collections import defaultdict

molecule = re.compile(r'([A-Z][a-z]?)(\d*)')

def parse_formula(f):
    counts = defaultdict(int)
    for name, count in molecule.findall(f):
        counts[name] += int(count or 1)
    return counts

This will count molecules without a digit after the symbol as 1; 'H3O' thus would still be counted correctly.

Now you can simply look up your elements:

counts = parse_formula('Ba4H2Ba5Li3')
print counts['Ba']
print counts['H']

Demo:

>>> counts = parse_formula('Ba4H2Ba5Li3')
>>> counts
defaultdict(<type 'int'>, {'H': 2, 'Ba': 9, 'Li': 3})
>>> counts['H']
2
>>> counts['Ba']
9
>>> parse_formula('H3O')
defaultdict(<type 'int'>, {'H': 3, 'O': 1})

edited Mar 16, 2014 at 2:09

answered Mar 16, 2014 at 2:04

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Comments

Hugh Bothwell · Accepted Answer · 2014-03-17 00:45:46Z

Here is a somewhat more robust approach which will properly handle formulas with nested sub-expressions, such as Na(OH)2 or Al(NO3)3:

# Loosely based on example code from
# http://pyparsing.wikispaces.com/file/detail/chemicalFormulas.py
from pyparsing import Group, Forward, Literal, nums, oneOf, OneOrMore, Optional, Word

# from http://pyparsing-public.wikispaces.com/Helpful+Expressions
# element("He") => "He"
element = oneOf(
    """H He Li Be B C N O F Ne Na Mg Al Si P S Cl
    Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge
    As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag
    Cd In Sn Sb Te I Xe Cs Ba Lu Hf Ta W Re Os
    Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Lr Rf
    Db Sg Bh Hs Mt Ds Rg Uub Uut Uuq Uup Uuh Uus
    Uuo La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm
    Yb Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No"""
)

# integer("123") => 123
to_int = lambda tokens: int(tokens[0])
integer = Word(nums).setParseAction(to_int)

# item("He") => {"He": 1}
# item("O2") => {"O": 2}
item_to_dict = lambda tokens: {a:b for a,b in tokens}
item = Group(element + Optional(integer, default=1)).setParseAction(item_to_dict)

# allow recursive definition of formula
Formula = Forward()

# expr("(OH)2") => {"O": 2, "H": 2}
lpar    = Literal("(").suppress()
rpar    = Literal(")").suppress()
expr_to_dict = lambda tokens: {el: num*tokens[1] for el,num in tokens[0].items()}
expr = (lpar + Formula + rpar + integer).setParseAction(expr_to_dict)

# ... complete the recursive definition
def formula_to_dict(tokens):
    total = {}
    for expr in tokens:
        for el,num in expr.items():
            total[el] = total.get(el, 0) + num
    return total
Formula <<= OneOrMore(item | expr).setParseAction(formula_to_dict)

# Finally, wrap it in an easy-to-use function:
def get_elements(s):
    return Formula.parseString(s)[0]

You can use it like:

>>> get_elements("Na(OH)2")
{'H': 2, 'Na': 1, 'O': 2}

>>> get_elements("Al(NO3)3")
{'Al': 1, 'N': 3, 'O': 9}

>>> get_elements("Ba4H2Ba5Li3")
{'Ba': 9, 'H': 2, 'Li': 3}

Pyparsing is no longer hosted on wikispaces.com. Go to github.com/pyparsing/pyparsing

Collectives™ on Stack Overflow

Python count substrings

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related