1

The programmer who wrote the following line probably uses a python package called regex.

UNIT = regex.compile("(?:{A}(?:'{A})?)++|-+|\S".format(A='\p{Word_Break=ALetter}'))

Can some one help explain what A='\p{Word_Break=ALetter}' and -+ means?

1
  • I've removed the pypi tag; the module may be distributed through pypi, but this question is not about pypi itself. Commented Sep 2, 2012 at 16:13

1 Answer 1

3

The \p{property=value} operator matches on unicode codepoint properties, and is documented on the package index page you linked to:

  • Unicode codepoint properties, including scripts and blocks

    \p{property=value}; \P{property=value}; \p{value} ; \P{value}
    

The entry matches any unicode character whose codepoint has a Word_Break property with the value ALetter (there are currently 24941 matches in the Unicode codepoint database, see the Unicode Text Segmentation, Word Boundaries chapter specifiation for details).

The example you gave also uses standard python string formatting to interpolate a partial expression into the regular expression being compiled. The "{A}" part is just a placeholder for the .format(A='...') part to fill. The end result is:

"(?:\p{Word_Break=ALetter}(?:'\p{Word_Break=ALetter})?)++|-+|\S"

The -+ sequence just matches 1 or more - dashes, just like in the python re module expressions, it is not anything special, really.

Now, the ++ before that is more interesting. It's a possessive quantifier, and using it prevents the regex matcher from trying out all possible permutations of the pattern. It's a performance optimization, one that prevents catastrophic backtracking issues.

Sign up to request clarification or add additional context in comments.

4 Comments

Is Word_Break property something like comma, proclamation? I don't fully understand the linked page.
@juju: Word_Break classifies code points into items that form words and things that come in between, so that software dealing with texts can determine where words start and end in any script. ALetter is one such class, mostly alphabetic characters.
Is the linked page introducing varied languages that Word_Break covers? Would you give me an example in English?
Sorry, I don't know what Word_Break=ALetter means in detail either; I imagine it is a more inclusive \w group in this case supporting more Unicode scripts.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.