python regex split any \W+ with some exceptions

Question

it is easy to split text using regex at non-alpha characters:

tokens=re.split(r'(?u)\W+',text) #to split at any non-alpha unicode character

and This answer provides a way to split at certain characters. However, what I need is:

splitting at any unicode non-alpha
give regex the following exceptions:
- underscores "_"
- this slash"/"
- ampersand "&" and at sign "@"
- fullstops surrounded by digits \d+
- fullstops preceded by certain arbitrary strings "Mr.", "Dr."...etc

I can easily detect any of these using regex, but the question is how to tell regex to have them as exceptions to the splitting at non-alpha.

EDIT: Here is an example text I am trying to match:

text="Mr. Jones email [email protected] 12.455 12,254.25 says This is@a&test example_cool man+right more/fun 43.35. And so we stopped. And then we started again. وبعدها رجعنا إلى المنزل، وقابلنا أصدقاءنا؛ وشربنا الشاي."

and here is its version in unicode (notice the non-alpha characters in Arabic u'\u060c', u'\u061b')

unicode_text=u'Mr. Jones email [email protected] 12.455 12,254.25 says This is@a&test example_cool man+right more/fun 43.35. And so we stopped. And then we started again. \u0648\u0628\u0639\u062f\u0647\u0627 \u0631\u062c\u0639\u0646\u0627 \u0625\u0644\u0649 \u0627\u0644\u0645\u0646\u0632\u0644\u060c \u0648\u0642\u0627\u0628\u0644\u0646\u0627 \u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627\u061b \u0648\u0634\u0631\u0628\u0646\u0627 \u0627\u0644\u0634\u0627\u064a.'

Here is the result of the regex in the answer provided:

re.split(r'(?u)(?![\+&\/@\d+\.\d+Mr\.])\W+',unicode_text)

[u'Mr.', u'Jones', u'email', u'[email protected]', u'12.455', u'12', u'254.25', u'says', u'This', u'is@a&test', u'example_cool', u'man+right', u'more/fun', u'43.35.', u'And', u'so', u'we', u'stopped.', u'And', u'then', u'we', u'started', u'again.', u'\u0648\u0628\u0639\u062f\u0647\u0627', u'\u0631\u062c\u0639\u0646\u0627', u'\u0625\u0644\u0649', u'\u0627\u0644\u0645\u0646\u0632\u0644', u'\u0648\u0642\u0627\u0628\u0644\u0646\u0627', u'\u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627', u'\u0648\u0634\u0631\u0628\u0646\u0627', u'\u0627\u0644\u0634\u0627\u064a.']

Notice that the regex did not split around fullstops at the end of words. So it would be nice to have something to deal with this

So what have you tried ? This is quite simple except for the last parts. Note that \w matches alphanumeric characters and an underscore _ ! So \W is exactly the reverse of it. — HamZa
– HamZa, Commented Oct 18, 2013 at 21:24
I tried this: tokens=re.split('(?u)[^\w_@/]|(?<!\d)[,.](?!\d)',string) but didn't work... — hmghaly
– hmghaly, Commented Oct 18, 2013 at 21:37
I'm not sure what you mean by "comparing"... I want the regex to split around any non-alpha character unless this character is [.,] and it is surrounded by things — hmghaly
– hmghaly, Commented Oct 18, 2013 at 21:53
When you say "it didn't work" please be specific. What did it match? Anything? Did the script fail with an error? — SethMMorton
– SethMMorton, Commented Oct 19, 2013 at 1:17

Kyle Hannon · Accepted Answer · 2013-10-18 22:09:00Z

0

The key is to use a negative lookahead. I think this covers all the examples on your list, but let me know if there's something I missed.

In [549]: re.split(r'(?u)(?![\+&\/@\d+\.\d+Mr\.])\W+', "Mr.Jones says This is@a&test example_cool man+right more/fun 43.35")
Out[549]: ['Mr.Jones', 'says', 'This', 'is@a&test', 'example_cool', 'man+right', 'more/fun', '43.35']

Anything inside the group in the (?!) will not be matched. Let me know if I understood the question correctly.

edited Oct 18, 2013 at 22:09

answered Oct 18, 2013 at 22:03

Kyle Hannon

2,2591 gold badge16 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

hmghaly Over a year ago

Thank you, but it didn't work as desired, please see my edit above.

Kyle Hannon Over a year ago

What I'm getting from you is that the answer worked for the problem you provided, but now you want it to match Arabic? Non alpha characters in foreign language should be handled by the re library. If the standard definition of non-alpha doesn't match yours, simply extend the methodology I explained.

Armali · Accepted Answer · 2014-09-30 09:24:03Z

I don't think you want to split e-mail addresses like [email protected] in jones@gmail and com, hence I changed your exception requirement fullstops surrounded by digits to full stops followed by an alphanumeric character.

re.split(r'(?u)(?![_/&@.])\W+|(?<!Mr|Dr)\.(?!\w)\W*', unicode_text)

[u'Mr.', u'Jones', u'email', u'[email protected]', u'12.455', u'12', u'254.25', u'says', u'This', u'is@a&test', u'example_cool', u'man', u'right', u'more/fun', u'43.35', u'And', u'so', u'we', u'stopped', u'And', u'then', u'we', u'started', u'again', u'\u0648\u0628\u0639\u062f\u0647\u0627', u'\u0631\u062c\u0639\u0646\u0627', u'\u0625\u0644\u0649', u'\u0627\u0644\u0645\u0646\u0632\u0644', u'\u0648\u0642\u0627\u0628\u0644\u0646\u0627', u'\u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627', u'\u0648\u0634\u0631\u0628\u0646\u0627', u'\u0627\u0644\u0634\u0627\u064a', u'']

Collectives™ on Stack Overflow

python regex split any \W+ with some exceptions

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related