Python - parsing user input using a verbose regex

Question

I am try to design a regex the will parse user input, in the form of full sentences. I am stuggling to get my expression to fully work. I know it is not well coded but I am trying hard to learn. I am currently trying to get it to parse precent as one string see under the code.

My test "sentence" = How I'm 15.5% wholesome-looking U.S.A. we RADAR () [] {} you -- are, ... you?

text = input("please type somewhat coherently: ")

pattern = r'''(?x)              # set flag to allow verbose regexps
    (?:[A-Z]\.)+                # abbreviations, e.g. U.S.A.
    |\w+(?:[-']\w+)*            # permit word-internal hyphens and apostrophes
    |[-.(]+                     # double hyphen, ellipsis, and open parenthesis
    |\S\w*                       # any sequence of word characters
    # |[\d+(\.\d+)?%]           # percentages, 82%
    |[][\{\}.,;"'?():-_`]       # these are separate tokens
    '''

parsed = re.findall(pattern, text)
print(parsed)

My output = ['How', "I'm", '15', '.', '5', '%', 'wholesome-looking', 'U.S.A.', 'we', 'RADAR', '(', ')', '[', ']', '{', '}', 'you', '--', 'are', ',', '...', 'you', '?']

I am looking to have the '15', '.', '5', '%' parsed as '15.5%'. The line that is currently commented out is what should do it, but when commented in does absolutly nothing. I searched for resources to help but they have not.

Thank you for you time.

Why is your percent pattern enclosed in a character class [...]? — msw
– msw, Commented Aug 30, 2015 at 19:52
You have put the \d.\d% placeholder in a character class (without repetition). Furthermore it would likely only take effect if it preceds the word+hyphens rule. — mario
– mario, Commented Aug 30, 2015 at 19:53
Thank you, I have fixed it and learned a lot. It now works, I just do not understand regex that much and am learning as fast as possible. — Shadowhawk
– Shadowhawk, Commented Aug 30, 2015 at 19:57

Wiktor Stribiżew · Accepted Answer · 2015-08-30 21:40:04Z

If you just want to have the percentage match as a whole entity, you really should be aware that regex engine analyzes the input string and the pattern from left to right. If you have an alternation, the leftmost alternative that matches the input string will be chosen, the rest won't be even tested.

Thus, you need to pull the alternative \d+(?:\.\d+)? up, and the capturing group should be turned into a non-capturing or findall will yield strange results:

(?x)              # set flag to allow verbose regexps
(?:[A-Z]\.)+                # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%           # percentages, 82%  <-- PULLED UP OVER HERE
|\w+(?:[-']\w+)*            # permit word-internal hyphens and apostrophes
|[-.(]+                     # double hyphen, ellipsis, and open parenthesis
|\S\w*                       # any sequence of word characters#
|[][{}.,;"'?():_`-]       # these are separate tokens

See regex demo.

Also, please note I replaced [][\{\}.,;"'?():-_`] with [][{}.,;"'?():_`-]: braces do not have to be escaped, and - was forming an unnecessary range from a colon (decimal code 58) and an underscore (decimal 95) matching ;, <, =, >, ?, @, all the uppercase Latin letters, [, \, ] and ^.

Collectives™ on Stack Overflow

Python - parsing user input using a verbose regex

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related