0

Sorry if the title is a little vague can't think of a better one right now.

I'm struggling to find the correct regular expression for a little test of mine:

Input and Output:

"Hello" --------------> ("Hello", "")
"How are you doing?" -> ("How", "are you doing?")
"" -------------------> ("", "")
"!h0w are you?" ------> ("!h0w", "are you?")
"#" ------------------> ("#", "")
":::::::" ------------> (":::::::", "")

The Closest regular expression so far is "(\.?)(.*?)((\s+?)(.*?)$|$)" but it gives a lot of unwanted data, like

regex = lambda text: re.search("(\.?)(.*?)((\s+?)(.*?)$|$)", text).groups()

# Input and Output
regex("Hello") --------------> ('', 'Hello', '', None, None)
regex("How are you doing?") -> ('', 'How', ' are you doing?', ' ', 'are you doing?')
regex("") -------------------> ('', '', '', None, None)
regex("!h0w are you?") ------> ('', '!h0w', ' are you?', ' ', 'are you?')
regex("#") ------------------> ('', '#', '', None, None)
regex(":::::::") ------------> ('', ':::::::', '', None, None)

None what I would prefer is:

x, y = re.search(pattern, string).groups()

If that is not possible, can someone improve upon the existing regular expression? I've been trying to improve it for a bit but I can't seem to make it any better.

Cannot use str.split for this, trying to figure out how to do things with regular expressions.

1
  • Ply has a lexer function that may help you Commented Mar 16, 2014 at 2:13

2 Answers 2

1

It looks like you're just splitting into the parts before and after an optional space:

import re
regex = lambda text: re.match(r'(\S*)(?:\s*)(.*)', text).groups()
x, y = regex('this that')

Which gives these results:

regex("Hello")
('Hello', '')
regex("How are you doing?")
('How', 'are you doing?')
regex("")
('', '')
regex("!h0w are you?")
('!h0w' ,'are you?')
regex("#")
('#', '')
regex(":::::::")
(':::::::', '')

Basically:

  • r'string here' is a literal string where you can use \ without double-escaping it.
  • (\S*) matches every non-white-space character until the first white-space. If there's no characters before the first white-space, it returns "" (rather than None).
  • (?:\s*) matches the first stretch of white-space, but the ?: at the beginning makes it a non-matching group, so it isn't part of the output from groups().
  • (.*) at the end catches any remaining characters after the first white-space. If there are no characters after the white-space, or there was no white-space, then it returns "" (rather than None).
Sign up to request clarification or add additional context in comments.

Comments

1

The regex way to do this is still basically str.split, but with a regex split:

parts = re.split(r'\s+', text, maxsplit=1)
part1 = parts[0]
part2 = '' if len(parts) == 1 else parts[1]

\s+ matches any run of whitespace. maxsplit=1 says to only split on the first occurrence of the pattern. Note that this may not handle leading or trailing whitespace the way you want.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.