1

Say I have a string:

foo: bar:baz : moo:mar:maz

I want to count the number of times a colon appears in this string, with a non-whitespace character immediately to the left or right of it. So foo: counts for one instance, bar:baz count for two more, and moo:mar:maz count for four instances total. We count mar twice because it's on both the right and left of a colon. The lone colon : doesn't count for anything, because it's got no adjacent non-whitespace character.

The count for the above string should therefore be 7.

I can do this by regex, as in:

str = "foo: bar:baz : moo:mar:maz"
left = len(re.findall("\S:", str))
right = len(re.findall(":\S", str))
offset = left + right

But I want to do this without regex, as I'm running a script that needs to be as optimised as possible. Is there any way to do this using only string functions?

Here's one method I tried, which basically splits up the string by spaces, then examines each substring and splits that up by colons, counting the number of elements in the resulting list and adding it to the total.

spl = str.split(" ")
count=0
print(spl)
for element in spl:
    subspl = element.split(':')
    print(subspl)
    if len(subspl) > 1:
        count += len([s for s in subspl if s != ''])

This almost works, but it fails on moo:mar:maz - the [s for s in subspl if s != ''] list comprehension returns ['moo', 'mar', 'maz'], which has three elements. This should add four to the total, not three.

Is there a way to do this using only string methods, or which is faster than regexes?

EDIT: An edge case I hadn't considered was pointed out. If the string is foo::bar foo::::bar or foo: bar: I want the code to count 2 in all cases. A colon adjacent to another colon shouldn't count towards the total, so ::: and :: and ::::::: should all count for 0. I only want to record the number of times where a non-colon, non-whitespace character is immediately adjacent to a colon.

8
  • Note that "only string methods" doesn't necessarily mean it'll be faster than using regexes. Commented Dec 15, 2020 at 19:16
  • Also, what if the string is something like ':::'? How should that be counted? Commented Dec 15, 2020 at 19:17
  • But I want to do this without regex, as I'm running a script that needs to be as optimised as possible. It is unlikely that such simple regexes are the bottleneck for reasonably sized Python code. Commented Dec 15, 2020 at 19:31
  • 1
    Using regex is faster if you use the compiled version. Commented Dec 15, 2020 at 19:33
  • @Tarique I just did a speed test (see my answer) and it turns out string methods actually are much faster. I didn't believe it myself at first. Commented Dec 15, 2020 at 20:54

3 Answers 3

2

You may discover that a regex solution is faster than a non-regex solution. I believe your regex can be, however, improved by using a single regex as follows:

import re

str = "foo: bar:baz : moo:mar:maz"
matches = re.findall(r"(\S(?=:)|(?<=:)\S)", str)
print(len(matches))

By using a lookahead and lookbehind assertion, you are in essence able to support pseudo-overlapping matches without having to use the regex package from the PyPI repository, which supports true overlapping matches. But here you have no need to actually match the colon characters.

Update:

But in case you are interested:

import regex as re

str = "foo: bar:baz : moo:mar:maz"
matches = re.findall(r"(\S:|:\S)", str, overlapped=True)
print(len(matches))
Sign up to request clarification or add additional context in comments.

5 Comments

Hi, I just did a speed test of your first solution, and it turns out, using string methods actually really is faster. Never believed that myself, but for a large dataset, it would make a difference. Can you cross-check that?
@mapf See my comment to your answer.
Thanks, I was secretly hoping someone would also propose a better solution to my sloppy regex! This looks much neater. Interesting points about the speed tests though ...
I've just realised I can solve the problem of not counting colons adjacent to other colons by modifying your first regex slightly: if I change it to ([^\s:](?=:)|(?<=:)[^\s:]), it will look for characters adjacent to colons that are neither whitespace nor other colons. So far it's generated all the correct results for me. Not sure how I'd apply it to the non-regex solution, but it's still useful :)
The difference between your regex and mine is the difference between "I want to count the number of times a colon appears in this string, with a non-whitespace character immediately to the left or right of it." and "I want to count the number of times a colon appears in this string, with a non-whitespace character other than a colon immediately to the left or right of it."
1

Here is an approach that only uses string methods. I've run speed tests to compare it to the other two methods proposed by @Booboo and @JeffUK.

import time
import re
from functools import partial


def use_string_method(string):
    working_copy = f' {string} '.replace(' : ', '   ')
    total = 0

    for colon_version in [': ', ' :']:
        total += working_copy.count(colon_version)
        working_copy = working_copy.replace(colon_version, '  ')
    total += 2*working_copy.count(':')

    return total


def use_regex(rex, string):
    matches = rex.findall(string)
    total = len(matches)

    return total


def use_for_loop(string):
    total = 0
    for i in range(0, len(string)):
        if string[i] == ":":
            if i > 0 and string[i - 1] != ' ':
                total += 1

            if i + 1 < len(string) and string[i + 1] != ' ':
                total += 1

    return total


def account_for_double_colons_using_string_method(string):
    working_copy = string
    length = len(working_copy) - 1
    while length < len(working_copy):
        length = len(working_copy)
        working_copy = working_copy.replace('::', ': :')

    return working_copy


def account_for_double_colons_using_regex(string):
    working_copy = re.sub('::', ': :', string)

    return working_copy


text = 'foo:: bar:baz : moo::mar::maz:'*10000

for name, method in {
    'string': account_for_double_colons_using_string_method,
    'regex': account_for_double_colons_using_regex,
}.items():
    start = time.time()
    method(text)
    stop = time.time()
    print(f'{name}: {stop-start} seconds')

regex = re.compile(r"(\S(?=:)|(?<=:)\S)")
for name, method in {
    'string': use_string_method,
    'regex': partial(use_regex, regex),
    'loop': use_for_loop,
}.items():
    start = time.time()
    for _ in range(100):
        method(text)
    stop = time.time()
    print(f'{name}: {stop-start} seconds')

To my surprise I found that the version using string methods seems to be more than 10 times faster than the other two:

# for 1000 runs @ 1000 times the original string:
string: 0.13166213035583496 seconds
regex: 2.456428050994873 seconds
loop: 3.813805341720581 seconds

Note the extra spaces in the replacement strings (' '), to make them as long as the ones they replace. By accident I found this to be important; otherwise unnecessary time is wasted on shortening the text (I guess it was something to do with memory allocation).

UPDATE

Added two approaches to account for double colons, one using string methods, the other using regex. Even though string methods require a while loop, they appear to be superior again:

# for one run @ 10000 times the original string
string: 0.0009975433349609375 seconds
regex: 0.003989458084106445 seconds

Surprisingly, even in an extreme case, the string methods are still slightly faster than regex:

# for one run @ ':'*1000000
string: 0.017957448959350586 seconds
regex: 0.024881601333618164 seconds

11 Comments

There is an overhead in compiling the regex expression. This needs to be amortized against a real-world run, which I assume is not just "foo: bar:baz : moo:mar:maz". Otherwise, who cares how fast it runs? So my assumption is that the actual string is large. How large? I don't know. But what if it were megabytes? Did you first compile the regex and then time it from that point? That would be a fairer measurement.
Yeah, I have no idea. Again, I never thought this would be faster. But it seems to be quite significant and independent of string length and number of method call repetitions.
Can you elaborate on the compiling part? I'm not an expert in regex, so I probably didn't do that but also don't know how. Maybe run the rest yourself and see what is says?
You are timing the wrong thing. You need to separately compile the regex: rex = re.compile(r"(\S(?=:)|(?<=:)\S)"). Then pass rex to function use_regex as an argument and then that function executes rex.findall(string). Or rex can be a global variable. I can update your answer with the change, if you want.
I edited the code accordingly (I think). If there is something missing / wrong, feel free to make changes.
|
1

Of course it can be done without regex, the simple approach would be like this: (Simplified but you could easily replace !=' ' with a is_non_whitespace() function)

score = 0
for i in range(0,len(teststring)):
    if teststring[i]==":":
        if i>0 and teststring[i-1]!=' ':
            score+=1
            print(i,teststring[i-1],teststring[i])
            
        print(i,len(teststring))
        if i+1<len(teststring) and teststring[i+1]!=' ':
            score+=1
            print(i,teststring[i],teststring[i+1])
            
print(score)

Based on some rough calculations this is exponentially slower than the regex version. On your example it takes about twice as long to run. On a 2500000 character long string regex is 5 times faster.

2 Comments

Thanks, that's useful to know that regex would actually be quicker! You've answered the original question and the XY problem neatly in one ...
Hi, I've also included your approach in my speed test. On my machine it doesn't seem too bad and is only slightly slower than the regex approach. But feel free to check yourself.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.