Regex problem with specific string in python

Question

I have problem with regex in python. I have the string:

'Aaa Bbb', 'AaaBbbCcc' ,'OneTwost.Three'

And I wanna get:

`'Aaa Bbb', 'Aaa Bbb Ccc'` 
and 'One Two st.Three' or 'One Two st. Three'

Generally, I need to insert space before every capital letter (if before capital letter is another sign than space) and if in string exist . (dot) than insert space 2 position back.

I'm very beginner at re library. I try do this based on a few topics in stack about regex, but I don't figure out this. Anyone have idea how do this?

Jan · Accepted Answer · 2019-03-08 15:53:28Z

1

You could use

(?<=\S)(?=[A-Z])|(.{2}\.)

Which needs to be replaced with a function, see a demo on regex101.com.

In Python this could be

import re

data = """
Aaa Bbb
AaaBbbCcc
OneTwost.Three
"""

rx = re.compile(r'(?<=\S)(?=[A-Z])|(.{2}\.)')

def replacer(match):
    if match.group(1):
        return " {} ".format(match.group(1))
    return r' '

data = rx.sub(replacer, data)
print(data)

Which yields

Aaa Bbb
Aaa Bbb Ccc
One Two st. Three

edited Mar 8, 2019 at 15:53

answered Mar 8, 2019 at 15:48

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

Based on what you said you want and the fact that you said, "I have the string":

I have the string

'Aaa Bbb', 'AaaBbbCcc' ,'OneTwost.Three'

These should do it.

Input:

>>> import re
>>> string = """'Aaa Bbb', 'AaaBbbCcc' ,'OneTwost.Three'"""

Output:

>>> re.sub(r'((?<![\',\s])[A-Z]+|[\S]{2}\.)', r' \1', string)
"'Aaa Bbb', 'Aaa Bbb Ccc' ,'One Two st. Three'"

.

Edit

Input (Acting on string and new variable string_1 which removes the ''s)

>>> import re
>>> string = """'Aaa Bbb', 'AaaBbbCcc' ,'OneTwost.Three'"""
>>> string_1 = """Aaa Bbb, AaaBbbCcc ,OneTwost.Three"""

Output

>>> re.sub(r'(?<!^)(?<!,)(?<!\s)(?<!\')([A-Z]+|[\S]{2}\.)', r' \1', string)
"'Aaa Bbb', 'Aaa Bbb Ccc' ,'One Two st. Three'"


>>> re.sub(r'(?:(?<!^)(?<!,)(?<!\s)(?<!\'))([A-Z]+|[\S]{2}\.)', r' \1', 
string)
"'Aaa Bbb', 'Aaa Bbb Ccc' ,'One Two st. Three'"


>>> re.sub(r'(?<!^)(?<!,)(?<!\s)(?<!\')([A-Z]+|[\S]{2}\.)', r' \1', string_1)
'Aaa Bbb, Aaa Bbb Ccc ,One Two st. Three'


>>> re.sub(r'(?:(?<!^)(?<!,)(?<!\s)(?<!\'))([A-Z]+|[\S]{2}\.)', r' \1', string_1)
'Aaa Bbb, Aaa Bbb Ccc ,One Two st. Three'

.

Explanation of the First:

Made it a string as your quote suggested
Using re.sub in this situation with the raw_string (r) option to allow for printing of dynamic/changing/variable capturing functionality and will return an edited string
With the first "(" I'm setting it up to capture everything in the subsequent query
With "(?<![\',\s])" I'm saying make sure that what follows which I am trying to capture is not preceded by a " ' " or "whitespace"
With "[A-Z]+" positioned where it is, I am saying capture any group of capital letters (BUT NOTE: This will also match ABC, SDZ, FFRD, ZXF, etc. but will not capture any lowercase letters or other symbols)
With "|" I'm telling the re engine, "OR" capture this next query
And with "[\S]{2}\." I'm saying capture if you find any 2 "non-whitespace characters" followed by a "."
The final ")" ends the capture group directive
.
With the second argument "r' \1'" I'm saying print the first group you capture (in this case I only have 1 capture group anyway) and place a single space in front of it

Edit: Slight Explanation of the Following 2 which can act on string_1

I swear, re.sub's behavior with lookarounds is just wonky. Given your comment below. Through each of the (?<!YOUR_IGNORED_CHARACTER), I'm telling re.sub to essentially not capture if the capital letters are preceded by the designated character. (?<!^), however, means do not capture if the capture group occurs at the beginning of the line.
Note also, in the string for this example I've removed the ' from the one you had given.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Mar 8, 2019 at 16:47

FailSafe

4824 silver badges12 bronze badges

4 Comments

Bob Over a year ago

Thanks for help. It's magic for me. Are you have idea how consider specific letters, like polish words : 'ŚĆŻŹĄĘŁÓŃ' ? And is possible it's not insert space before first word in string? Situation like this: string = 'BbbŚaast.Ttt' ,result = 'Bbb Śaa st.Ttt' or 'Bbb Śaa st. Ttt' ?

FailSafe Over a year ago

With regards to Polish words, I'm not sure how to do it, honestly. Regarding your second question, I'm not sure what you mean, but I will post another comment to see if that can help

FailSafe Over a year ago

I tried it out, but the Polish letters make the regex break. Sorry about that man.

Bob Over a year ago

I mean now your regex insert space before every capital letter (include first word) and 2 char's before dot. I need solution which skip first word and after insert space before every capital letter (like your code) and capital polish letter ("ŚĆŻŹĄĘŁÓŃ") and 2 char's before dot sign (".").

Collectives™ on Stack Overflow

Regex problem with specific string in python

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related