0

I have problem with regex in python. I have the string:

'Aaa Bbb', 'AaaBbbCcc' ,'OneTwost.Three'

And I wanna get:

`'Aaa Bbb', 'Aaa Bbb Ccc'` 
and 'One Two st.Three' or 'One Two st. Three'

Generally, I need to insert space before every capital letter (if before capital letter is another sign than space) and if in string exist . (dot) than insert space 2 position back.

I'm very beginner at re library. I try do this based on a few topics in stack about regex, but I don't figure out this. Anyone have idea how do this?

2 Answers 2

1

You could use

(?<=\S)(?=[A-Z])|(.{2}\.)

Which needs to be replaced with a function, see a demo on regex101.com.


In Python this could be

import re

data = """
Aaa Bbb
AaaBbbCcc
OneTwost.Three
"""

rx = re.compile(r'(?<=\S)(?=[A-Z])|(.{2}\.)')

def replacer(match):
    if match.group(1):
        return " {} ".format(match.group(1))
    return r' '

data = rx.sub(replacer, data)
print(data)

Which yields

Aaa Bbb
Aaa Bbb Ccc
One Two st. Three
Sign up to request clarification or add additional context in comments.

Comments

1

Based on what you said you want and the fact that you said, "I have the string":

I have the string

'Aaa Bbb', 'AaaBbbCcc' ,'OneTwost.Three'

These should do it.

Input:

>>> import re
>>> string = """'Aaa Bbb', 'AaaBbbCcc' ,'OneTwost.Three'"""

Output:

>>> re.sub(r'((?<![\',\s])[A-Z]+|[\S]{2}\.)', r' \1', string)
"'Aaa Bbb', 'Aaa Bbb Ccc' ,'One Two st. Three'"

.

Edit

Input (Acting on string and new variable string_1 which removes the ''s)

>>> import re
>>> string = """'Aaa Bbb', 'AaaBbbCcc' ,'OneTwost.Three'"""
>>> string_1 = """Aaa Bbb, AaaBbbCcc ,OneTwost.Three"""

Output

>>> re.sub(r'(?<!^)(?<!,)(?<!\s)(?<!\')([A-Z]+|[\S]{2}\.)', r' \1', string)
"'Aaa Bbb', 'Aaa Bbb Ccc' ,'One Two st. Three'"


>>> re.sub(r'(?:(?<!^)(?<!,)(?<!\s)(?<!\'))([A-Z]+|[\S]{2}\.)', r' \1', 
string)
"'Aaa Bbb', 'Aaa Bbb Ccc' ,'One Two st. Three'"


>>> re.sub(r'(?<!^)(?<!,)(?<!\s)(?<!\')([A-Z]+|[\S]{2}\.)', r' \1', string_1)
'Aaa Bbb, Aaa Bbb Ccc ,One Two st. Three'


>>> re.sub(r'(?:(?<!^)(?<!,)(?<!\s)(?<!\'))([A-Z]+|[\S]{2}\.)', r' \1', string_1)
'Aaa Bbb, Aaa Bbb Ccc ,One Two st. Three'

.

Explanation of the First:

  • Made it a string as your quote suggested
  • Using re.sub in this situation with the raw_string (r) option to allow for printing of dynamic/changing/variable capturing functionality and will return an edited string
  • With the first "(" I'm setting it up to capture everything in the subsequent query
  • With "(?<![\',\s])" I'm saying make sure that what follows which I am trying to capture is not preceded by a " ' " or "whitespace"
  • With "[A-Z]+" positioned where it is, I am saying capture any group of capital letters (BUT NOTE: This will also match ABC, SDZ, FFRD, ZXF, etc. but will not capture any lowercase letters or other symbols)
  • With "|" I'm telling the re engine, "OR" capture this next query
  • And with "[\S]{2}\." I'm saying capture if you find any 2 "non-whitespace characters" followed by a "."
  • The final ")" ends the capture group directive
  • .
  • With the second argument "r' \1'" I'm saying print the first group you capture (in this case I only have 1 capture group anyway) and place a single space in front of it

Edit: Slight Explanation of the Following 2 which can act on string_1

  • I swear, re.sub's behavior with lookarounds is just wonky. Given your comment below. Through each of the (?<!YOUR_IGNORED_CHARACTER), I'm telling re.sub to essentially not capture if the capital letters are preceded by the designated character. (?<!^), however, means do not capture if the capture group occurs at the beginning of the line.

  • Note also, in the string for this example I've removed the ' from the one you had given.

4 Comments

Thanks for help. It's magic for me. Are you have idea how consider specific letters, like polish words : 'ŚĆŻŹĄĘŁÓŃ' ? And is possible it's not insert space before first word in string? Situation like this: string = 'BbbŚaast.Ttt' ,result = 'Bbb Śaa st.Ttt' or 'Bbb Śaa st. Ttt' ?
With regards to Polish words, I'm not sure how to do it, honestly. Regarding your second question, I'm not sure what you mean, but I will post another comment to see if that can help
I tried it out, but the Polish letters make the regex break. Sorry about that man.
I mean now your regex insert space before every capital letter (include first word) and 2 char's before dot. I need solution which skip first word and after insert space before every capital letter (like your code) and capital polish letter ("ŚĆŻŹĄĘŁÓŃ") and 2 char's before dot sign (".").

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.