6

I have strings like "aaaaabbbbbbbbbbbbbbccccccccccc". The number of the chars can differ and sometimes there can be dash inside the string, like "aaaaa-bbbbbbbbbbbbbbccccccccccc".

Is there any smart way to either split it "aaaaa","bbbbbbbbbbbbbb","ccccccccccc" and get the indices of were it is split or just get the indices, without looping through every string? If the dash is between to patterns it can end up either in the left or right one as long it is always handled the same.

Any idea?

3 Answers 3

11

Regular expression MatchObject results include indices of the match. What remains is to match repeating characters:

import re

repeat = re.compile(r'(?P<start>[a-z])(?P=start)+-?')

would match only if a given letter character (a-z) is repeated at least once:

>>> for match in repeat.finditer("aaaaabbbbbbbbbbbbbbccccccccccc"):
...     print match.group(), match.start(), match.end()
... 
aaaaa 0 5
bbbbbbbbbbbbbb 5 19
ccccccccccc 19 30

The .start() and .end() methods on the match result give you the exact positions in the input string.

Dashes are included in the matches, but not non-repeating characters:

>>> for match in repeat.finditer("a-bb-cccccccc"):
...     print match.group(), match.start(), match.end()
... 
bb- 2 5
cccccccc 5 13

If you want the a- part to be a match, simply replace the + with a * multiplier:

repeat = re.compile(r'(?P<start>[a-z])(?P=start)*-?')
Sign up to request clarification or add additional context in comments.

3 Comments

How could I keep the dashes? So for example "aaaaa-","bbbbbbbbbbbbbb","ccccccccccc".
@Fritz: Sorry, I thought you didn't want them. On re-reading, I see that you do. I included them with the preceding letters.
Great! Thanks a lot. I guess I should really look deeper into regular expressions.
3

What about using itertools.groupby?

>>> s = 'aaaaabbbbbbbbbbbbbbccccccccccc'
>>> from itertools import groupby
>>> [''.join(v) for k,v in groupby(s)]
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']

This will put the - as their own substrings which could easily be filtered out.

>>> s = 'aaaaa-bbbbbbbbbbbbbb-ccccccccccc'
>>> [''.join(v) for k,v in groupby(s) if k != '-']
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']

2 Comments

Can you think of a nice way to get the indices too? The best I can think of offhand is grouped = [(k, list(g)) for k,g in groupby(enumerate(s), key=lambda x: x[1])]; [(k, g[0][0], g[-1][0]) for k,g in grouped]. In python 3 I guess you could use accumulate on the lengths too.
@DSM -- Right. I missed the part about indices ... Not sure about a good way to cleanly get that ...
0
str="aaaaabbbbbbbbbbbbbbccccccccccc"
p = [0] 
for i, c in enumerate(zip(str, str[1:])):
    if c[0] != c[1]:
        p.append(i + 1)
print p

# [0, 5, 19]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.