Regex not returning expected output in python

Question

The list is a = ['Aeroplane','Ramanujan','World-king','Pizza/Burger'] I am trying to replace the items(with -, /) in the list with Pizza_Burger and World_king . Whatever the symbol is should be replaced by and underscore.

Here is my code:

import re
def replaceStrings(arg):
    txt =arg
    res = re.search(r'(?i)\b([a-z][a-z0-9_]*)([/-]+)([a-z][a-z0-9_]*)\b', txt)
    if res:
        pp = reg.sub(r'\1_\2',txt)
        print(pp)
        return pp



for i in a:
    replaceStrings(i)

But I am not getting the desired output. What is wrong in my regex. I am a beginner in regex. Thank you

@Nullman I understand. But I have a list of 10,000 items. They contains strings like this.So asking. Thank you. — WaterRocket8236
– WaterRocket8236, Commented Feb 24, 2017 at 13:38
Can you please check my regex? I think I did a slight mistake somewhere. — WaterRocket8236
– WaterRocket8236, Commented Feb 24, 2017 at 13:39
you arent using reg.sub correctly, it takes 3 parameters, you are giving it two. — Nullman
– Nullman, Commented Feb 24, 2017 at 13:42
@Nullman My goodness! Yup you are correct. I missed. Thank you. I also got a answer below. I think I am good to go. — WaterRocket8236
– WaterRocket8236, Commented Feb 24, 2017 at 13:43

Vikash Singh · Accepted Answer · 2017-02-24 13:50:00Z

5

A simple way to clean terms is to loop over the terms and clean each term separately. You can just go for something as simple as 'World-king'.replace('/','_').replace('-','_')

Or you can use regex for cleaning like this:

import re
def replaceStrings(arg):
    # each individual special character you want to clean can be put in the brackets `[]`
    pp = re.sub(r'[-/]', '_', arg)
    print(pp)
    return pp


a = ['Aeroplane','Ramanujan','World-king','Pizza/Burger']
for i in a:
    replaceStrings(i)

output:

Aeroplane
Ramanujan
World_king
Pizza_Burger

update: [comment added by OP]

I took a precautionary measure making sure I have the string of the required pattern. My question is, Is it a good practice The way I wrote an extra step instead of directly doing re.sub?

if you want to make sure a pattern is matched before cleaning it, that can also be done:

import re

pattern = re.compile(r'(?i)\b([a-z][a-z0-9_]*)([/-]+)([a-z][a-z0-9_]*)\b')

def replaceStrings(arg):
    if pattern.match(arg):
        pp = re.sub(r'[-/]','_', arg)
        print(pp)
        return pp

a = ['Aeroplane','Ramanujan','World-king','Pizza/Burger']
for i in a:
    replaceStrings(i)

output:

World_king
Pizza_Burger

edited Feb 24, 2017 at 13:50

answered Feb 24, 2017 at 13:40

Vikash Singh

14.1k9 gold badges45 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

WaterRocket8236 Over a year ago

I took a precautionary measure making sure I have the string of the required pattern. My question is, Is it a good practice The way I wrote an extra step instead of directly doing re.sub? Thank you

WaterRocket8236 Over a year ago

Great! Thank you. It is clear and concise now. I got the difference.

Vikash Singh Over a year ago

@BhabaniMohapatra the pattern you are looking for seems quite complicated. If you can tell me what exactly you are looking for maybe we can simplify that.

WaterRocket8236 Over a year ago

How to parse the string like this 'Hashimoto's_thyroiditis' Notice <<'s>> this in the string. How to remove <<'>> it and give the output as Hashimotos_thyroiditis

Vikash Singh Over a year ago

@BhabaniMohapatra What you are asking for is a very custom use case. As your corpus is really large you will run into more such cases with time. Until and unless you have very clear cut pattern for cleaning text and removing special characters you will end up writing a lot of if conditions.

|

Collectives™ on Stack Overflow

Regex not returning expected output in python

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related