Use re.sub to replace replace variable that is regex with a string

Question

I need to replace KEY from with VAL. Key is a regex like import.* and val is a string like "important". I know that this code is not good, because key is regex, but i couldn't find a solution that works.

#import stem dict
d = {}
with open("Stem rečnik.txt") as f:
    for line in f:
       key, val = line.split(":")
       d[key.replace("\n","")] = val.replace("\n","")

#define tokenizer
def custom_tokenizer(text):
    #split- space
    tokens = nltk.tokenize.word_tokenize(text)
    #stemmer
    for key,val in d.items():
        tokens=[token.replace(key,val) for token in tokens]
    #remove special characters
    tokens=[re.sub(r'[^a-zA-Z0-9]',"",token) for token in tokens]
    return tokens  
cv=CountVectorizer(tokenizer= custom_tokenizer,analyzer ='word',encoding='utf-8', min_df=0, max_df=1.0)
post_textCV= cv.fit_transform(post_text)
df=DataFrame(post_textCV.A, columns=cv.get_feature_names())
print(df.head)

So, the problem is this line here:

tokens=[token.replace(key,val) for token in tokens]

I don't really agree with the duplicate target. It sounds irrelevant. And it's really a bad original question with below par 0-score accepted answer. I don't even understand the answer... — Jean-François Fabre
– Jean-François Fabre ♦, Commented Aug 18, 2017 at 19:52
There is nothing about re.sub replacement. This is really not helping me out. — user8451312
– user8451312, Commented Aug 18, 2017 at 20:06
@Alexander reopened the question (someone already voted for it). I hope you don't mind. — Jean-François Fabre
– Jean-François Fabre ♦, Commented Aug 18, 2017 at 20:06
@Alexander that said I understand perfectly the urge to close a question as a duplicate to avoid dump copied/pasted answers... I'm so frustrated when someone answers before I find the exact dupe. — Jean-François Fabre
– Jean-François Fabre ♦, Commented Aug 18, 2017 at 20:21
@hope94 Would you mind posting a small sample of the data? Presumably from "Stem rečnik.txt" and post_text. — Alexander
– Alexander, Commented Aug 18, 2017 at 23:35

Jean-François Fabre · Accepted Answer · 2017-08-18 20:45:03Z

1

token.replace(key,val) invokes str.replace which is basic string replace.

To call regex replace just do this instead:

tokens=[re.sub(key,val,token) for token in tokens]

that said, this seems rather ineffective, rebuilding a list comprehension for each key/val couple.

edited Aug 18, 2017 at 20:45

answered Aug 18, 2017 at 19:48

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user8451312 Over a year ago

Code executes more than 6 minutes, and is still not over, it happens to me everytime when something is not ok. The rest of the code works fine when i don't run this part.

Jean-François Fabre Over a year ago

that's what I thought. I edited my answer to show a really efficient solution. More complex, but faster.

user8451312 Over a year ago

I am not sure how to implement it, it will take some time for me to study it, because Í've never used before things you mentioned here.I am new in Python. So i tried just to copy those two lines, but i got an error: KeyError: 'intesa'. (intesa is a value in the dictionary)

Jean-François Fabre Over a year ago

damn, my solution doesn't work when there are patterns... it works fine when keys are actual values (so the matched group IS the key, but not here)

user8451312 Over a year ago

i tried this, but it takes too long. I suppose because there is a lot of iterations... tokens=[re.sub(r'%s' %key, '%s'%val,token ) for token in tokens]

|

Collectives™ on Stack Overflow

Use re.sub to replace replace variable that is regex with a string

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related