5

I have strings that can have a various amount of "groups". I need to split them, but I am having trouble doing so. The groups will always start with [A-Z]{2-5} followed by a : and a string or varying length and spaces. It will always have a space in front of the group.

Example strings:

"YellowSky AA:Hello AB:1234 AC:1F 322 AD:hj21jkhjk23"
"Billy Bob Thorton AA:213231 AB:aaaa AC:ddddd 322 AD:hj2ffs   dsfdsfd1jkhjk23"

My code thus far:

import re
D = "Test1 AA:Hello AB:1234 AC:1F 322 AD:hj21jkhjk23"
    
g = re.compile("(?<!^)\s+(?=[A-Z])(?!.\s)").split(D)

As you can see... this works for one word starting string, but not multiple words.

Works

But this fails /w spaces: Doesn't work

2
  • 3
    What is the expected output? Try (?!^)\s+(?=[A-Z]+:), see regex101.com/r/QTmjkX/1 Commented Jun 7, 2021 at 22:13
  • 2
    Don't use split. Write a regexp that matches the groups, and use re.findall() Commented Jun 7, 2021 at 22:15

2 Answers 2

2

You can use

re.split(r'(?!^)\s+(?=[A-Z]+:)', text)

See this regex demo.

Details:

  • (?!^) - a negative lookahead that matches a location not at the start of string (equal to (?<!^) but one char shorter)
  • \s+ - one or more whitespaces
  • (?=[A-Z]+:) - a positive lookahead that requires one or more uppercase ASCII letters followed with a : char immediately to the right of the current location.
Sign up to request clarification or add additional context in comments.

Comments

1
([A-Z]{2,5}:\w+(?: +\w+)*)(?=(?: +[A-Z]+:|$))

You can also use re.findall directly.

See demo.

https://regex101.com/r/6jf8EM/1

This way you don't need to filter unwanted groups later. You get what you need.

1 Comment

Thanks, it does help, though I need to first section too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.