Is there a function to compare how many characters two strings (of the same length) differ by? I mean only substitutions. For example, AAA would differ from AAT by 1 character.
4 Answers
This will work:
>>> str1 = "AAA"
>>> str2 = "AAT"
>>> sum(1 for x,y in enumerate(str1) if str2[x] != y)
1
>>> str1 = "AAABBBCCC"
>>> str2 = "ABCABCABC"
>>> sum(1 for x,y in enumerate(str1) if str2[x] != y)
6
>>>
The above solution uses sum, enumerate, and a generator expression.
Because True can evaluate to 1, you could even do:
>>> str1 = "AAA"
>>> str2 = "AAT"
>>> sum(str2[x] != y for x,y in enumerate(str1))
1
>>>
But I personally prefer the first solution because it is clearer.
Comments
This is a nice use case for the zip function!
def count_substitutions(s1, s2):
return sum(x != y for (x, y) in zip(s1, s2))
Usage:
>>> count_substitutions('AAA', 'AAT')
1
From the docs:
zip(...)
zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]
Return a list of tuples, where each tuple contains the i-th element
from each of the argument sequences. The returned list is truncated
in length to the length of the shortest argument sequence.
3 Comments
Building on what poke said I would suggest the jellyfish package. It has several distance measures like what you are asking for. Example from the documentation:
IN [1]: jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
OUT[1]: 1
or using your example:
IN [2]: jellyfish.damerau_levenshtein_distance('AAA','AAT')
OUT[2]: 1
This will work for many different string lengths and should be able to handle most of what you throw at it.
Comments
Similar to simon's answer, but you don't have to zip things in order to just call a function on the resulting tuples because that's what map does anyway (and itertools.imap in Python 2). And there's a handy function for != in operator. Hence:
sum(map(operator.ne, s1, s2))
3 Comments
import lines at the top of your file :-)