1

How do I split on all nonalphanumeric characters, EXCEPT the apostrophe?

re.split('\W+',text)

works, but will also split on apostrophes. How do I add an exception to this rule?

Thanks!

3 Answers 3

3

Try this:

re.split(r"[^\w']+",text)

Note the w is now lowercase, because it represents all alphanumeric characters (note that that includes the underscore). The character class [^\w'] refers to anything that's not (^) either alphanumeric (\w) or an apostrophe.

Sign up to request clarification or add additional context in comments.

Comments

2
re.split(r"[^\w']+",text)

By starting a character class with ^, it inverts the definition, so [^\w'] is the inverse of [\w'], which would match an alphanumeric/underscore/apostrophe.

Comments

0

The answers here don't work, as 'quoted' words will not be stripped of their apostrophes.

What works for me is

re.split(r"\W'+|^'+|'+\W|'$|[^\w']+", text)

i.e. remove:

apostrophe(s) after non-word OR apostrophe(s) at line start OR apostrophe(s) before non-word OR the current solution

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.