python regular expression replace two situations with one command

Question

I want to replace string like

'''1  2  3  4  5  6 abcde fghij klmno pqrst 7 8 9 10 uvwxyz abcdef 11 12 13'''

to

'''1  2  3  4  5  6
abcde fghij klmno pqrst
7 8 9 10
uvwxyz abcdef
11 12 13'''

that is my method:

s = re.sub(r'(\d) ([a-z])', r'\1\n\2', s)
s = re.sub(r'([a-z]) (\d)', r'\1\n\2', s)

how can I do this in one regular expression? and I know I can do it use re.findall and groups but I want to find a more easy way?

Jerry · Accepted Answer · 2015-03-27 12:55:56Z

2

I really think the easiest way would be to match using findall instead of splitting or sub-ing:

result = re.findall(r"\d+(?:\s+\d+)*|[a-z]+(?:\s+[a-z]+)*", text)
print('\n'.join(result))

or in one line:

result = '\n'.join(re.findall(r"\d+(?:\s+\d+)*|[a-z]+(?:\s+[a-z]+)*", text))

Gives:

1  2  3  4  5  6
abcde fghij klmno pqrst
7 8 9 10
uvwxyz abcdef
11 12 13

\d+(?:\s+\d+)* matches the parts with digits and spaces.

[a-z]+(?:\s+[a-z]+)* matches the parts with letters and spaces.

edited Mar 27, 2015 at 12:55

answered Mar 27, 2015 at 12:24

Jerry

71.8k14 gold badges106 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

WeizhongTu Over a year ago

This just gives me another way to solve the problem, awesome and concise.

XWen Over a year ago

You might want to do [a-z]+(?:\s+[a-z]+)* to match to letters and spaces

Jerry Over a year ago

@XWen Right, that was an omission when moving the code here, good catch :)

FMc · Accepted Answer · 2015-03-27 13:12:38Z

1

Here are two ways to do it with a single regex:

Use a conditional pattern. Capture \1 is straightforward. Capture \4 checks whether we grabbed \2 or \3, and then defines the rest of the pattern accordingly.
```
re.sub(r'((\d)|([a-z])) ((?(2)[a-z]|\d))', r'\1\n\4', s)
```
Replace only the space, and surround it with look-behind and look-ahead assertions.
```
re.sub(r'(?<=\d) (?=[a-z])|(?<=[a-z]) (?=\d)', '\n', s)
```

But your two simple regexes are better than all of this nonsense.

edited Mar 27, 2015 at 13:12

answered Mar 27, 2015 at 12:55

FMc

42.5k13 gold badges81 silver badges135 bronze badges

2 Comments

Jerry Over a year ago

@GuidoBouman Maybe you should be aware that using conditionals and/or lookarounds take slightly more time and resources than not using them! It's certainly negligible on a small scale though.

Guido Bouman Over a year ago

@Jerry Thanks for noting. I find the lookahead & behind approach better as you only replace that part you actually need to replace. Which makes your code less error-prone.

Guido Bouman · Accepted Answer · 2015-03-27 12:17:02Z

1

You can use the regular expression or command:

s = re.sub(r'((\d) ([a-z])|([a-z]) (\d))', r'\2\4\n\3\5', s)

It'll match or group 2 & 3 or group 4 & 5. =]

answered Mar 27, 2015 at 12:17

Guido Bouman

3,3154 gold badges25 silver badges33 bronze badges

4 Comments

FMc Over a year ago

I get this when I run your code: error: unmatched group.

Guido Bouman Over a year ago

Ouch, which version of Python are you using? This has been fixed recently: hg.python.org/cpython/rev/bd2f1ea04025

WeizhongTu Over a year ago

@GuidoBouman but it not work on Python 2.7, thank you all the same.

Guido Bouman Over a year ago

No problem, the other solutions are clearly better for your case. =]

Avinash Raj · Accepted Answer · 2015-03-27 12:31:24Z

1

You could use re.split

>>> s = '''1  2  3  4  5  6 abcde fghij klmno pqrst 7 8 9 10 uvwxyz abcdef 11 12 13'''
>>> for i in re.split(r'(?<=\d)\s+(?=[A-Za-z])|(?<=[A-Za-z])\s+(?=\d)', s):
        print(i)


1  2  3  4  5  6
abcde fghij klmno pqrst
7 8 9 10
uvwxyz abcdef
11 12 13
>>> print('\n'.join(re.split(r'(?<=\d)\s+(?=[A-Za-z])|(?<=[A-Za-z])\s+(?=\d)', s)))

OR

re.sub

>>> print(re.sub(r'(?<=\d)\s+(?=[A-Za-z])|(?<=[A-Za-z])\s+(?=\d)', r'\n', s))
1  2  3  4  5  6
abcde fghij klmno pqrst
7 8 9 10
uvwxyz abcdef
11 12 13

The above re.sub command will replace one or more spaces which exists between digit and a letter or between a letter and a digit with newline character.

edited Mar 27, 2015 at 12:31

answered Mar 27, 2015 at 12:13

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

Comments

Casimir et Hippolyte · Accepted Answer · 2015-03-27 12:45:24Z

0

You can use a replacement:

re.sub(r'(\d[\d\s]*|[a-z][a-z\s]*)', r'\1\n', s)

To be more rigorous with trailing whitespaces, you can do that:

re.sub(r'(\d(?:[\d\s]*\d)?|[a-z](?:[a-z\s]*[a-z])?)\s*', r'\1\n', s).rstrip()

edited Mar 27, 2015 at 12:45

answered Mar 27, 2015 at 12:36

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

Collectives™ on Stack Overflow

python regular expression replace two situations with one command

5 Answers 5

3 Comments

2 Comments

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

2 Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related