Can this regex be achieved with only string methods in Python?

Question

Say I have a string:

foo: bar:baz : moo:mar:maz

I want to count the number of times a colon appears in this string, with a non-whitespace character immediately to the left or right of it. So foo: counts for one instance, bar:baz count for two more, and moo:mar:maz count for four instances total. We count mar twice because it's on both the right and left of a colon. The lone colon : doesn't count for anything, because it's got no adjacent non-whitespace character.

The count for the above string should therefore be 7.

I can do this by regex, as in:

str = "foo: bar:baz : moo:mar:maz"
left = len(re.findall("\S:", str))
right = len(re.findall(":\S", str))
offset = left + right

But I want to do this without regex, as I'm running a script that needs to be as optimised as possible. Is there any way to do this using only string functions?

Here's one method I tried, which basically splits up the string by spaces, then examines each substring and splits that up by colons, counting the number of elements in the resulting list and adding it to the total.

spl = str.split(" ")
count=0
print(spl)
for element in spl:
    subspl = element.split(':')
    print(subspl)
    if len(subspl) > 1:
        count += len([s for s in subspl if s != ''])

This almost works, but it fails on moo:mar:maz - the [s for s in subspl if s != ''] list comprehension returns ['moo', 'mar', 'maz'], which has three elements. This should add four to the total, not three.

Is there a way to do this using only string methods, or which is faster than regexes?

EDIT: An edge case I hadn't considered was pointed out. If the string is foo::bar foo::::bar or foo: bar: I want the code to count 2 in all cases. A colon adjacent to another colon shouldn't count towards the total, so ::: and :: and ::::::: should all count for 0. I only want to record the number of times where a non-colon, non-whitespace character is immediately adjacent to a colon.

Note that "only string methods" doesn't necessarily mean it'll be faster than using regexes. — user2357112
– user2357112, Commented Dec 15, 2020 at 19:16
Also, what if the string is something like ':::'? How should that be counted? — user2357112
– user2357112, Commented Dec 15, 2020 at 19:17
But I want to do this without regex, as I'm running a script that needs to be as optimised as possible. It is unlikely that such simple regexes are the bottleneck for reasonably sized Python code. — Timur Shtatland
– Timur Shtatland, Commented Dec 15, 2020 at 19:31
@Tarique I just did a speed test (see my answer) and it turns out string methods actually are much faster. I didn't believe it myself at first. — mapf
– mapf, Commented Dec 15, 2020 at 20:54

Booboo · Accepted Answer · 2020-12-15 20:28:40Z

2

You may discover that a regex solution is faster than a non-regex solution. I believe your regex can be, however, improved by using a single regex as follows:

import re

str = "foo: bar:baz : moo:mar:maz"
matches = re.findall(r"(\S(?=:)|(?<=:)\S)", str)
print(len(matches))

By using a lookahead and lookbehind assertion, you are in essence able to support pseudo-overlapping matches without having to use the regex package from the PyPI repository, which supports true overlapping matches. But here you have no need to actually match the colon characters.

Update:

But in case you are interested:

import regex as re

str = "foo: bar:baz : moo:mar:maz"
matches = re.findall(r"(\S:|:\S)", str, overlapped=True)
print(len(matches))

edited Dec 15, 2020 at 20:28

answered Dec 15, 2020 at 20:23

Booboo

45.7k4 gold badges46 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

mapf Over a year ago

Hi, I just did a speed test of your first solution, and it turns out, using string methods actually really is faster. Never believed that myself, but for a large dataset, it would make a difference. Can you cross-check that?

Booboo Over a year ago

@mapf See my comment to your answer.

Lou Over a year ago

Thanks, I was secretly hoping someone would also propose a better solution to my sloppy regex! This looks much neater. Interesting points about the speed tests though ...

Lou Over a year ago

I've just realised I can solve the problem of not counting colons adjacent to other colons by modifying your first regex slightly: if I change it to ([^\s:](?=:)|(?<=:)[^\s:]), it will look for characters adjacent to colons that are neither whitespace nor other colons. So far it's generated all the correct results for me. Not sure how I'd apply it to the non-regex solution, but it's still useful :)

Booboo Over a year ago

The difference between your regex and mine is the difference between "I want to count the number of times a colon appears in this string, with a non-whitespace character immediately to the left or right of it." and "I want to count the number of times a colon appears in this string, with a non-whitespace character other than a colon immediately to the left or right of it."

mapf · Accepted Answer · 2020-12-16 10:45:05Z

1

Here is an approach that only uses string methods. I've run speed tests to compare it to the other two methods proposed by @Booboo and @JeffUK.

import time
import re
from functools import partial


def use_string_method(string):
    working_copy = f' {string} '.replace(' : ', '   ')
    total = 0

    for colon_version in [': ', ' :']:
        total += working_copy.count(colon_version)
        working_copy = working_copy.replace(colon_version, '  ')
    total += 2*working_copy.count(':')

    return total


def use_regex(rex, string):
    matches = rex.findall(string)
    total = len(matches)

    return total


def use_for_loop(string):
    total = 0
    for i in range(0, len(string)):
        if string[i] == ":":
            if i > 0 and string[i - 1] != ' ':
                total += 1

            if i + 1 < len(string) and string[i + 1] != ' ':
                total += 1

    return total


def account_for_double_colons_using_string_method(string):
    working_copy = string
    length = len(working_copy) - 1
    while length < len(working_copy):
        length = len(working_copy)
        working_copy = working_copy.replace('::', ': :')

    return working_copy


def account_for_double_colons_using_regex(string):
    working_copy = re.sub('::', ': :', string)

    return working_copy


text = 'foo:: bar:baz : moo::mar::maz:'*10000

for name, method in {
    'string': account_for_double_colons_using_string_method,
    'regex': account_for_double_colons_using_regex,
}.items():
    start = time.time()
    method(text)
    stop = time.time()
    print(f'{name}: {stop-start} seconds')

regex = re.compile(r"(\S(?=:)|(?<=:)\S)")
for name, method in {
    'string': use_string_method,
    'regex': partial(use_regex, regex),
    'loop': use_for_loop,
}.items():
    start = time.time()
    for _ in range(100):
        method(text)
    stop = time.time()
    print(f'{name}: {stop-start} seconds')

To my surprise I found that the version using string methods seems to be more than 10 times faster than the other two:

# for 1000 runs @ 1000 times the original string:
string: 0.13166213035583496 seconds
regex: 2.456428050994873 seconds
loop: 3.813805341720581 seconds

Note the extra spaces in the replacement strings (' '), to make them as long as the ones they replace. By accident I found this to be important; otherwise unnecessary time is wasted on shortening the text (I guess it was something to do with memory allocation).

UPDATE

Added two approaches to account for double colons, one using string methods, the other using regex. Even though string methods require a while loop, they appear to be superior again:

# for one run @ 10000 times the original string
string: 0.0009975433349609375 seconds
regex: 0.003989458084106445 seconds

Surprisingly, even in an extreme case, the string methods are still slightly faster than regex:

# for one run @ ':'*1000000
string: 0.017957448959350586 seconds
regex: 0.024881601333618164 seconds

edited Dec 16, 2020 at 10:45

answered Dec 15, 2020 at 19:58

mapf

2,0542 gold badges19 silver badges47 bronze badges

11 Comments

Booboo Over a year ago

There is an overhead in compiling the regex expression. This needs to be amortized against a real-world run, which I assume is not just "foo: bar:baz : moo:mar:maz". Otherwise, who cares how fast it runs? So my assumption is that the actual string is large. How large? I don't know. But what if it were megabytes? Did you first compile the regex and then time it from that point? That would be a fairer measurement.

mapf Over a year ago

Yeah, I have no idea. Again, I never thought this would be faster. But it seems to be quite significant and independent of string length and number of method call repetitions.

mapf Over a year ago

Can you elaborate on the compiling part? I'm not an expert in regex, so I probably didn't do that but also don't know how. Maybe run the rest yourself and see what is says?

Booboo Over a year ago

You are timing the wrong thing. You need to separately compile the regex: rex = re.compile(r"(\S(?=:)|(?<=:)\S)"). Then pass rex to function use_regex as an argument and then that function executes rex.findall(string). Or rex can be a global variable. I can update your answer with the change, if you want.

mapf Over a year ago

I edited the code accordingly (I think). If there is something missing / wrong, feel free to make changes.

|

JeffUK · Accepted Answer · 2020-12-15 19:57:58Z

1

Of course it can be done without regex, the simple approach would be like this: (Simplified but you could easily replace !=' ' with a is_non_whitespace() function)

score = 0
for i in range(0,len(teststring)):
    if teststring[i]==":":
        if i>0 and teststring[i-1]!=' ':
            score+=1
            print(i,teststring[i-1],teststring[i])
            
        print(i,len(teststring))
        if i+1<len(teststring) and teststring[i+1]!=' ':
            score+=1
            print(i,teststring[i],teststring[i+1])
            
print(score)

Based on some rough calculations this is exponentially slower than the regex version. On your example it takes about twice as long to run. On a 2500000 character long string regex is 5 times faster.

edited Dec 15, 2020 at 19:57

answered Dec 15, 2020 at 19:28

JeffUK

4,2912 gold badges22 silver badges35 bronze badges

2 Comments

Lou Over a year ago

Thanks, that's useful to know that regex would actually be quicker! You've answered the original question and the XY problem neatly in one ...

mapf Over a year ago

Hi, I've also included your approach in my speed test. On my machine it doesn't seem too bad and is only slightly slower than the regex approach. But feel free to check yourself.

Collectives™ on Stack Overflow

Can this regex be achieved with only string methods in Python?

3 Answers 3

5 Comments

11 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

11 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related