1

it is easy to split text using regex at non-alpha characters:

tokens=re.split(r'(?u)\W+',text) #to split at any non-alpha unicode character

and This answer provides a way to split at certain characters. However, what I need is:

  1. splitting at any unicode non-alpha
  2. give regex the following exceptions:

    • underscores "_"
    • this slash"/"
    • ampersand "&" and at sign "@"
    • fullstops surrounded by digits \d+
    • fullstops preceded by certain arbitrary strings "Mr.", "Dr."...etc

I can easily detect any of these using regex, but the question is how to tell regex to have them as exceptions to the splitting at non-alpha.


EDIT: Here is an example text I am trying to match:

text="Mr. Jones email [email protected] 12.455 12,254.25 says This is@a&test example_cool man+right more/fun 43.35. And so we stopped. And then we started again. وبعدها رجعنا إلى المنزل، وقابلنا أصدقاءنا؛ وشربنا الشاي."

and here is its version in unicode (notice the non-alpha characters in Arabic u'\u060c', u'\u061b')

unicode_text=u'Mr. Jones email [email protected] 12.455 12,254.25 says This is@a&test example_cool man+right more/fun 43.35. And so we stopped. And then we started again. \u0648\u0628\u0639\u062f\u0647\u0627 \u0631\u062c\u0639\u0646\u0627 \u0625\u0644\u0649 \u0627\u0644\u0645\u0646\u0632\u0644\u060c \u0648\u0642\u0627\u0628\u0644\u0646\u0627 \u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627\u061b \u0648\u0634\u0631\u0628\u0646\u0627 \u0627\u0644\u0634\u0627\u064a.'

Here is the result of the regex in the answer provided:

re.split(r'(?u)(?![\+&\/@\d+\.\d+Mr\.])\W+',unicode_text)

[u'Mr.', u'Jones', u'email', u'[email protected]', u'12.455', u'12', u'254.25', u'says', u'This', u'is@a&test', u'example_cool', u'man+right', u'more/fun', u'43.35.', u'And', u'so', u'we', u'stopped.', u'And', u'then', u'we', u'started', u'again.', u'\u0648\u0628\u0639\u062f\u0647\u0627', u'\u0631\u062c\u0639\u0646\u0627', u'\u0625\u0644\u0649', u'\u0627\u0644\u0645\u0646\u0632\u0644', u'\u0648\u0642\u0627\u0628\u0644\u0646\u0627', u'\u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627', u'\u0648\u0634\u0631\u0628\u0646\u0627', u'\u0627\u0644\u0634\u0627\u064a.']

Notice that the regex did not split around fullstops at the end of words. So it would be nice to have something to deal with this

6
  • yes, this is what I want Commented Oct 18, 2013 at 21:21
  • 1
    So what have you tried ? This is quite simple except for the last parts. Note that \w matches alphanumeric characters and an underscore _ ! So \W is exactly the reverse of it. Commented Oct 18, 2013 at 21:24
  • I tried this: tokens=re.split('(?u)[^\w_@/]|(?<!\d)[,.](?!\d)',string) but didn't work... Commented Oct 18, 2013 at 21:37
  • I'm not sure what you mean by "comparing"... I want the regex to split around any non-alpha character unless this character is [.,] and it is surrounded by things Commented Oct 18, 2013 at 21:53
  • When you say "it didn't work" please be specific. What did it match? Anything? Did the script fail with an error? Commented Oct 19, 2013 at 1:17

2 Answers 2

0

The key is to use a negative lookahead. I think this covers all the examples on your list, but let me know if there's something I missed.

In [549]: re.split(r'(?u)(?![\+&\/@\d+\.\d+Mr\.])\W+', "Mr.Jones says This is@a&test example_cool man+right more/fun 43.35")
Out[549]: ['Mr.Jones', 'says', 'This', 'is@a&test', 'example_cool', 'man+right', 'more/fun', '43.35']

Anything inside the group in the (?!) will not be matched. Let me know if I understood the question correctly.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, but it didn't work as desired, please see my edit above.
What I'm getting from you is that the answer worked for the problem you provided, but now you want it to match Arabic? Non alpha characters in foreign language should be handled by the re library. If the standard definition of non-alpha doesn't match yours, simply extend the methodology I explained.
0

I don't think you want to split e-mail addresses like [email protected] in jones@gmail and com, hence I changed your exception requirement fullstops surrounded by digits to full stops followed by an alphanumeric character.

re.split(r'(?u)(?![_/&@.])\W+|(?<!Mr|Dr)\.(?!\w)\W*', unicode_text)

[u'Mr.', u'Jones', u'email', u'[email protected]', u'12.455', u'12', u'254.25', u'says', u'This', u'is@a&test', u'example_cool', u'man', u'right', u'more/fun', u'43.35', u'And', u'so', u'we', u'stopped', u'And', u'then', u'we', u'started', u'again', u'\u0648\u0628\u0639\u062f\u0647\u0627', u'\u0631\u062c\u0639\u0646\u0627', u'\u0625\u0644\u0649', u'\u0627\u0644\u0645\u0646\u0632\u0644', u'\u0648\u0642\u0627\u0628\u0644\u0646\u0627', u'\u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627', u'\u0648\u0634\u0631\u0628\u0646\u0627', u'\u0627\u0644\u0634\u0627\u064a', u'']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.