it is easy to split text using regex at non-alpha characters:
tokens=re.split(r'(?u)\W+',text) #to split at any non-alpha unicode character
and This answer provides a way to split at certain characters. However, what I need is:
- splitting at any unicode non-alpha
give regex the following exceptions:
- underscores "_"
- this slash"/"
- ampersand "&" and at sign "@"
- fullstops surrounded by digits \d+
- fullstops preceded by certain arbitrary strings "Mr.", "Dr."...etc
I can easily detect any of these using regex, but the question is how to tell regex to have them as exceptions to the splitting at non-alpha.
EDIT: Here is an example text I am trying to match:
text="Mr. Jones email [email protected] 12.455 12,254.25 says This is@a&test example_cool man+right more/fun 43.35. And so we stopped. And then we started again. وبعدها رجعنا إلى المنزل، وقابلنا أصدقاءنا؛ وشربنا الشاي."
and here is its version in unicode (notice the non-alpha characters in Arabic u'\u060c', u'\u061b')
unicode_text=u'Mr. Jones email [email protected] 12.455 12,254.25 says This is@a&test example_cool man+right more/fun 43.35. And so we stopped. And then we started again. \u0648\u0628\u0639\u062f\u0647\u0627 \u0631\u062c\u0639\u0646\u0627 \u0625\u0644\u0649 \u0627\u0644\u0645\u0646\u0632\u0644\u060c \u0648\u0642\u0627\u0628\u0644\u0646\u0627 \u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627\u061b \u0648\u0634\u0631\u0628\u0646\u0627 \u0627\u0644\u0634\u0627\u064a.'
Here is the result of the regex in the answer provided:
re.split(r'(?u)(?![\+&\/@\d+\.\d+Mr\.])\W+',unicode_text)
[u'Mr.', u'Jones', u'email', u'[email protected]', u'12.455', u'12', u'254.25', u'says', u'This', u'is@a&test', u'example_cool', u'man+right', u'more/fun', u'43.35.', u'And', u'so', u'we', u'stopped.', u'And', u'then', u'we', u'started', u'again.', u'\u0648\u0628\u0639\u062f\u0647\u0627', u'\u0631\u062c\u0639\u0646\u0627', u'\u0625\u0644\u0649', u'\u0627\u0644\u0645\u0646\u0632\u0644', u'\u0648\u0642\u0627\u0628\u0644\u0646\u0627', u'\u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627', u'\u0648\u0634\u0631\u0628\u0646\u0627', u'\u0627\u0644\u0634\u0627\u064a.']
Notice that the regex did not split around fullstops at the end of words. So it would be nice to have something to deal with this
\wmatches alphanumeric characters and an underscore_! So\Wis exactly the reverse of it.