0

I want to separate punctuation marks and symbols from main text so as to split them as separate tokens. I have a text file containing the following symbols %&()+,-./:;=–‘’“”″ and I want to replace each symbol with \ssymbol\s (the \s means a space), and if two symbols of the same type e.g. .. occur adjacent to each other, I want to replace them with \s..\s. This is what I have tried so far:

>>> punc = "[%&\(\)\+,-./:;=–‘’“”″]+"
>>> import re
>>> pattern = re.compile(punc)
>>> text = "hi. hi.. hi; hi;; 55% good& good&&"
>>> text = re.sub(pattern, ' '+str(pattern)+' ', text)

When I print the text, I get the following:

>>> print(text)
hi <_sre.SRE_Pattern object at 0x00000000035E14E0>  hi <_sre.SRE_Pattern object at 0x00000000035E14E0>  hi <_sre.SRE_Pattern object at 0x00000000035E14E0>  hi <_sre.SRE_Pattern object at 0x00000000035E14E0>  55 <_sre.SRE_Pattern object at 0x00000000035E14E0>  x <_sre.SRE_Pattern object at 0x00000000035E14E0> 

But I want the output to be like this:

hi . hi .. hi ; hi ;; 55 % good & good &&

After several tries, I realized that I cannot compile the right regex. Your kind help is greatly appreciated!

1 Answer 1

1

The proper way to deal with what you are attempting to do is to use capturing groups. This will let you refer back to your match. First, let me begin by explaining why your attempt was giving you the output you saw.

Why you saw what you saw

In the re.sub function, when you give it ' '+str(pattern)+' ' as the third parameter, this gets evaluated to the string " <_sre.SRE_Pattern object at some_memory_location> ", because str(pattern) returns the string representation of the pattern object, not of the pattern.

As an aside, on Python 3.4 and 3.5, str(pattern) returns re.compile('[%&\\(\\)\\+,-./:;=–‘’“”″]') for me, what version of Python are you using? Is it perhaps a version of Python 2?

Solution

As I alluded to before, your solution requires utilizing capturing groups. To denote a group, you simply use parentheses. In your case, the solution is simple enough because you only need one group:

>>> import re
>>> pattern = re.compile(r"([%&\(\)\+,-./:;=–‘’“”″]+)")

Notice for my string literal, I used an r before the start of the string. This denotes a raw string, which causes the string to ignore any escape sequence as defined by Python. An escape sequence is something like '\t', for example, which denotes a tab. However, if you use r'\t' then it is the actual string \t.

>>> text = "hi. hi.. hi; hi;; 55% good& good&&"
>>> pattern.sub(r' \1 ', text)
'hi .  hi ..  hi ;  hi ;;  55 %  good &  good && '

Notice I simply used the sub method of the pattern object rather than the module-level function re.sub. It's not a big deal, but it just seems cleaner to me. Also, for the replacement argument, I used r' \1 '. This \1 refers to the first group captured by your pattern. If you had more than one group you could use something like \2 \1 if you wanted to reverse some pattern, for example. This again, is an escape sequence!

A potential improvement

It was unclear in your specification how you wanted to deal with more than 2 character e.g. three characters. So your pattern would deal with that situation thusly:

>>> text2 = "hi. hi.. hi; hi;; 55% good& good&& hi &&& hello,"
>>> pattern.sub(r' \1 ', text2)
'hi .  hi ..  hi ;  hi ;;  55 %  good &  good &&  hi  &&&  hello , '

Perhaps that is what you what, but maybe you want to consider '&&&' as two distinct matches: '&&' and '&'. You can deal with that situation using quantifiers:

>>> pattern2 = re.compile(r'([%&\(\)\+,-./:;=–‘’“”″]{1,2})')
>>> pattern2.sub(r' \1 ', text2)
'hi .  hi ..  hi ;  hi ;;  55 %  good &  good &&  hi  &&  &  hello , '

Instead of using the + sign which denotes one-or-more, you can use the bracket notation to have more fine-grained control. For example, {1,3} will match 1 to 3. {3} will match exactly 3. {3,} will match 3 or more.

Sign up to request clarification or add additional context in comments.

3 Comments

thank you so much for your kind help. I did not think that it was so simple to solve my problem as you adequately explained. Regarding Python version, I was trying to implement it on 3.3.2. I will install 3.4 now. Regarding improvement section, I wanted to deal with one or more, even three characters of the same type should be treated as one match, and that is what you have demonstrated. Thank you again for your help.
However, I want to understand the use of .sub(r' \l ', ...).
@Mohammed Ah! Yes, that simply refers to the first group captured by your pattern! If you had multiple groups you could refer to the third as something like This is the third group: \3. See my edit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.