2

Background

The Python module regex allows fuzzy matching.

You can specify the allowable number of substitutions (s), insertions (i), deletions (d), and total errors (e).

The fuzzy_counts property of a match result returns a tuple (0,0,0), where:

match.fuzzy_counts[0] = count for 's' 
match.fuzzy_counts[1] = count for 'i' 
match.fuzzy_counts[2] = count for 'd'

Problem

The deletions and insertions are counted as expected, but not the substitutions.

In the example below, the only change is a single character deleted in the query, yet the substitutions count is 6 (7 if the BESTMATCH option is removed).

How are the substitutions counted?

I would be grateful of someone can anyone explain how this works to me.

>>> import regex
>>> reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"
>>> query = "TATGGACCAAAGTCTCAAGCCATGTG" 
>>> match = regex.search(reference, query, regex.BESTMATCH)
>>> print(match.fuzzy_counts)
(6,0,1)

2 Answers 2

2

The issue seems to be related to the value in the allowed error setting.

Reducing the s to s < 3 changes the fuzzy match tuple score downwards:

>>> reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<3,i<3,d<3,e<4}" 
>>> query = "TATGGACCAAAGTCTCAAGCCATGTG"  
>>> match = regex.search(reference, query, regex.BESTMATCH)
>>> print(match.fuzzy_counts) 
(1,0,1)

reducing the allowed error for 's' even further returns the expected 's' score for this match:

>>> reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<2,i<3,d<3,e<4}"
>>> query = "TATGGACCAAAGTCTCAAGCCATGTG" 
>>> match = regex.search(reference, query, regex.BESTMATCH)
>>> print(match.fuzzy_counts)
(0,0,1)

Why it behaves in this way is still a mystery to me.

Sign up to request clarification or add additional context in comments.

1 Comment

This behaviour seems to have been fixed (regex version 2.4.106) all of the above examples now return the correct indel/substitution score (0,0,1)
0

This was caused by what looks to be a bug in the regex module's cost calculations. It was still present up until regex version 2015.10.05, but was fixed in the next version, 2015.10.22, as shown below:

$ sudo pip3 install regex==2015.10.05
Processing /root/.cache/pip/wheels/24/cb/ae/9653e30c8f801544a645e17d26fa6803aeaf76ad0482663c27/regex-2015.10.5-cp38-cp38-linux_x86_64.whl
Installing collected packages: regex
Successfully installed regex-2015.10.5
$ python3 -c 'import regex; reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"; query = "TATGGACCAAAGTCTCAAGCCATGTG"; match = regex.search(reference, query, regex.BESTMATCH);print(match.fuzzy_counts)'
(5, 0, 1)
$ sudo pip3 install regex==2015.10.22
Processing /root/.cache/pip/wheels/60/f6/9a/23e723633e62a79064cb301c54a3b50482b8c690f86c9983ee/regex-2015.10.22-cp38-cp38-linux_x86_64.whl
Installing collected packages: regex
  Found existing installation: regex 2015.10.5
    Uninstalling regex-2015.10.5:
      Successfully uninstalled regex-2015.10.5
Successfully installed regex-2015.10.22
$ python3 -c 'import regex; reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"; query = "TATGGACCAAAGTCTCAAGCCATGTG"; match = regex.search(reference, query, regex.BESTMATCH);print(match.fuzzy_counts)'
(0, 0, 1)

Given these dates, I infer that the commit that fixed the bug was https://bitbucket.org/mrabarnett/mrab-regex/commits/296c1daf86619039c6fe55868e7d861097d01aae, with description

Hg issue 161: Unexpected fuzzy match results

Fixed the bug and did some related tidying up.

The referenced bug is https://bitbucket.org/mrabarnett/mrab-regex/issues/161.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.