Python split string by pattern

Question

I have strings like "aaaaabbbbbbbbbbbbbbccccccccccc". The number of the chars can differ and sometimes there can be dash inside the string, like "aaaaa-bbbbbbbbbbbbbbccccccccccc".

Is there any smart way to either split it "aaaaa","bbbbbbbbbbbbbb","ccccccccccc" and get the indices of were it is split or just get the indices, without looping through every string? If the dash is between to patterns it can end up either in the left or right one as long it is always handled the same.

Any idea?

Martijn Pieters · Accepted Answer · 2013-04-18 15:39:36Z

11

Regular expression MatchObject results include indices of the match. What remains is to match repeating characters:

import re

repeat = re.compile(r'(?P<start>[a-z])(?P=start)+-?')

would match only if a given letter character (a-z) is repeated at least once:

>>> for match in repeat.finditer("aaaaabbbbbbbbbbbbbbccccccccccc"):
...     print match.group(), match.start(), match.end()
... 
aaaaa 0 5
bbbbbbbbbbbbbb 5 19
ccccccccccc 19 30

The .start() and .end() methods on the match result give you the exact positions in the input string.

Dashes are included in the matches, but not non-repeating characters:

>>> for match in repeat.finditer("a-bb-cccccccc"):
...     print match.group(), match.start(), match.end()
... 
bb- 2 5
cccccccc 5 13

If you want the a- part to be a match, simply replace the + with a * multiplier:

repeat = re.compile(r'(?P<start>[a-z])(?P=start)*-?')

edited Apr 18, 2013 at 15:39

answered Apr 18, 2013 at 15:25

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Trollbrot Over a year ago

How could I keep the dashes? So for example "aaaaa-","bbbbbbbbbbbbbb","ccccccccccc".

Martijn Pieters Over a year ago

@Fritz: Sorry, I thought you didn't want them. On re-reading, I see that you do. I included them with the preceding letters.

Trollbrot Over a year ago

Great! Thanks a lot. I guess I should really look deeper into regular expressions.

mgilson · Accepted Answer · 2013-04-18 15:30:42Z

3

What about using itertools.groupby?

>>> s = 'aaaaabbbbbbbbbbbbbbccccccccccc'
>>> from itertools import groupby
>>> [''.join(v) for k,v in groupby(s)]
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']

This will put the - as their own substrings which could easily be filtered out.

>>> s = 'aaaaa-bbbbbbbbbbbbbb-ccccccccccc'
>>> [''.join(v) for k,v in groupby(s) if k != '-']
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']

edited Apr 18, 2013 at 15:30

answered Apr 18, 2013 at 15:25

mgilson

312k70 gold badges656 silver badges722 bronze badges

2 Comments

DSM Over a year ago

Can you think of a nice way to get the indices too? The best I can think of offhand is

grouped = [(k, list(g)) for k,g in groupby(enumerate(s), key=lambda x: x[1])]; [(k, g[0][0], g[-1][0]) for k,g in grouped]

. In python 3 I guess you could use accumulate on the lengths too.

mgilson Over a year ago

@DSM -- Right. I missed the part about indices ... Not sure about a good way to cleanly get that ...

perreal · Accepted Answer · 2013-04-18 15:35:21Z

0

str="aaaaabbbbbbbbbbbbbbccccccccccc"
p = [0] 
for i, c in enumerate(zip(str, str[1:])):
    if c[0] != c[1]:
        p.append(i + 1)
print p

# [0, 5, 19]

answered Apr 18, 2013 at 15:35

perreal

98.7k23 gold badges159 silver badges187 bronze badges

Collectives™ on Stack Overflow

Python split string by pattern

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related