1

The list is a = ['Aeroplane','Ramanujan','World-king','Pizza/Burger'] I am trying to replace the items(with -, /) in the list with Pizza_Burger and World_king . Whatever the symbol is should be replaced by and underscore.

Here is my code:

import re
def replaceStrings(arg):
    txt =arg
    res = re.search(r'(?i)\b([a-z][a-z0-9_]*)([/-]+)([a-z][a-z0-9_]*)\b', txt)
    if res:
        pp = reg.sub(r'\1_\2',txt)
        print(pp)
        return pp



for i in a:
    replaceStrings(i)

But I am not getting the desired output. What is wrong in my regex. I am a beginner in regex. Thank you

5
  • you dont need to search, do the sub directly Commented Feb 24, 2017 at 13:37
  • @Nullman I understand. But I have a list of 10,000 items. They contains strings like this.So asking. Thank you. Commented Feb 24, 2017 at 13:38
  • Can you please check my regex? I think I did a slight mistake somewhere. Commented Feb 24, 2017 at 13:39
  • you arent using reg.sub correctly, it takes 3 parameters, you are giving it two. Commented Feb 24, 2017 at 13:42
  • @Nullman My goodness! Yup you are correct. I missed. Thank you. I also got a answer below. I think I am good to go. Commented Feb 24, 2017 at 13:43

1 Answer 1

5

A simple way to clean terms is to loop over the terms and clean each term separately. You can just go for something as simple as 'World-king'.replace('/','_').replace('-','_')

Or you can use regex for cleaning like this:

import re
def replaceStrings(arg):
    # each individual special character you want to clean can be put in the brackets `[]`
    pp = re.sub(r'[-/]', '_', arg)
    print(pp)
    return pp


a = ['Aeroplane','Ramanujan','World-king','Pizza/Burger']
for i in a:
    replaceStrings(i)

output:

Aeroplane
Ramanujan
World_king
Pizza_Burger

update: [comment added by OP]

I took a precautionary measure making sure I have the string of the required pattern. My question is, Is it a good practice The way I wrote an extra step instead of directly doing re.sub?

if you want to make sure a pattern is matched before cleaning it, that can also be done:

import re

pattern = re.compile(r'(?i)\b([a-z][a-z0-9_]*)([/-]+)([a-z][a-z0-9_]*)\b')

def replaceStrings(arg):
    if pattern.match(arg):
        pp = re.sub(r'[-/]','_', arg)
        print(pp)
        return pp

a = ['Aeroplane','Ramanujan','World-king','Pizza/Burger']
for i in a:
    replaceStrings(i)

output:

World_king
Pizza_Burger
Sign up to request clarification or add additional context in comments.

8 Comments

I took a precautionary measure making sure I have the string of the required pattern. My question is, Is it a good practice The way I wrote an extra step instead of directly doing re.sub? Thank you
Great! Thank you. It is clear and concise now. I got the difference.
@BhabaniMohapatra the pattern you are looking for seems quite complicated. If you can tell me what exactly you are looking for maybe we can simplify that.
How to parse the string like this 'Hashimoto's_thyroiditis' Notice <<'s>> this in the string. How to remove <<'>> it and give the output as Hashimotos_thyroiditis
@BhabaniMohapatra What you are asking for is a very custom use case. As your corpus is really large you will run into more such cases with time. Until and unless you have very clear cut pattern for cleaning text and removing special characters you will end up writing a lot of if conditions.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.