2

I am try to design a regex the will parse user input, in the form of full sentences. I am stuggling to get my expression to fully work. I know it is not well coded but I am trying hard to learn. I am currently trying to get it to parse precent as one string see under the code.

My test "sentence" = How I'm 15.5% wholesome-looking U.S.A. we RADAR () [] {} you -- are, ... you?

text = input("please type somewhat coherently: ")

pattern = r'''(?x)              # set flag to allow verbose regexps
    (?:[A-Z]\.)+                # abbreviations, e.g. U.S.A.
    |\w+(?:[-']\w+)*            # permit word-internal hyphens and apostrophes
    |[-.(]+                     # double hyphen, ellipsis, and open parenthesis
    |\S\w*                       # any sequence of word characters
    # |[\d+(\.\d+)?%]           # percentages, 82%
    |[][\{\}.,;"'?():-_`]       # these are separate tokens
    '''

parsed = re.findall(pattern, text)
print(parsed)

My output = ['How', "I'm", '15', '.', '5', '%', 'wholesome-looking', 'U.S.A.', 'we', 'RADAR', '(', ')', '[', ']', '{', '}', 'you', '--', 'are', ',', '...', 'you', '?']

I am looking to have the '15', '.', '5', '%' parsed as '15.5%'. The line that is currently commented out is what should do it, but when commented in does absolutly nothing. I searched for resources to help but they have not.

Thank you for you time.

3
  • Why is your percent pattern enclosed in a character class [...]? Commented Aug 30, 2015 at 19:52
  • You have put the \d.\d% placeholder in a character class (without repetition). Furthermore it would likely only take effect if it preceds the word+hyphens rule. Commented Aug 30, 2015 at 19:53
  • Thank you, I have fixed it and learned a lot. It now works, I just do not understand regex that much and am learning as fast as possible. Commented Aug 30, 2015 at 19:57

1 Answer 1

1

If you just want to have the percentage match as a whole entity, you really should be aware that regex engine analyzes the input string and the pattern from left to right. If you have an alternation, the leftmost alternative that matches the input string will be chosen, the rest won't be even tested.

Thus, you need to pull the alternative \d+(?:\.\d+)? up, and the capturing group should be turned into a non-capturing or findall will yield strange results:

(?x)              # set flag to allow verbose regexps
(?:[A-Z]\.)+                # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%           # percentages, 82%  <-- PULLED UP OVER HERE
|\w+(?:[-']\w+)*            # permit word-internal hyphens and apostrophes
|[-.(]+                     # double hyphen, ellipsis, and open parenthesis
|\S\w*                       # any sequence of word characters#
|[][{}.,;"'?():_`-]       # these are separate tokens

See regex demo.

Also, please note I replaced [][\{\}.,;"'?():-_`] with [][{}.,;"'?():_`-]: braces do not have to be escaped, and - was forming an unnecessary range from a colon (decimal code 58) and an underscore (decimal 95) matching ;, <, =, >, ?, @, all the uppercase Latin letters, [, \, ] and ^.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.