Python regex compile

Question

The programmer who wrote the following line probably uses a python package called regex.

UNIT = regex.compile("(?:{A}(?:'{A})?)++|-+|\S".format(A='\p{Word_Break=ALetter}'))

Can some one help explain what A='\p{Word_Break=ALetter}' and -+ means?

I've removed the pypi tag; the module may be distributed through pypi, but this question is not about pypi itself. — Martijn Pieters
– Martijn Pieters, Commented Sep 2, 2012 at 16:13

Martijn Pieters · Accepted Answer · 2012-09-02 16:24:12Z

3

The \p{property=value} operator matches on unicode codepoint properties, and is documented on the package index page you linked to:

Unicode codepoint properties, including scripts and blocks
\p{property=value}; \P{property=value}; \p{value} ; \P{value}

The entry matches any unicode character whose codepoint has a Word_Break property with the value ALetter (there are currently 24941 matches in the Unicode codepoint database, see the Unicode Text Segmentation, Word Boundaries chapter specifiation for details).

The example you gave also uses standard python string formatting to interpolate a partial expression into the regular expression being compiled. The "{A}" part is just a placeholder for the .format(A='...') part to fill. The end result is:

"(?:\p{Word_Break=ALetter}(?:'\p{Word_Break=ALetter})?)++|-+|\S"

The -+ sequence just matches 1 or more - dashes, just like in the python re module expressions, it is not anything special, really.

Now, the ++ before that is more interesting. It's a possessive quantifier, and using it prevents the regex matcher from trying out all possible permutations of the pattern. It's a performance optimization, one that prevents catastrophic backtracking issues.

edited Sep 2, 2012 at 16:24

answered Sep 2, 2012 at 16:05

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

juju Over a year ago

Is Word_Break property something like comma, proclamation? I don't fully understand the linked page.

Martijn Pieters Over a year ago

@juju: Word_Break classifies code points into items that form words and things that come in between, so that software dealing with texts can determine where words start and end in any script. ALetter is one such class, mostly alphabetic characters.

juju Over a year ago

Is the linked page introducing varied languages that Word_Break covers? Would you give me an example in English?

Martijn Pieters Over a year ago

Sorry, I don't know what Word_Break=ALetter means in detail either; I imagine it is a more inclusive \w group in this case supporting more Unicode scripts.

Collectives™ on Stack Overflow

Python regex compile

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related