1

I need to replace KEY from with VAL. Key is a regex like import.* and val is a string like "important". I know that this code is not good, because key is regex, but i couldn't find a solution that works.

#import stem dict
d = {}
with open("Stem rečnik.txt") as f:
    for line in f:
       key, val = line.split(":")
       d[key.replace("\n","")] = val.replace("\n","")

#define tokenizer
def custom_tokenizer(text):
    #split- space
    tokens = nltk.tokenize.word_tokenize(text)
    #stemmer
    for key,val in d.items():
        tokens=[token.replace(key,val) for token in tokens]
    #remove special characters
    tokens=[re.sub(r'[^a-zA-Z0-9]',"",token) for token in tokens]
    return tokens  
cv=CountVectorizer(tokenizer= custom_tokenizer,analyzer ='word',encoding='utf-8', min_df=0, max_df=1.0)
post_textCV= cv.fit_transform(post_text)
df=DataFrame(post_textCV.A, columns=cv.get_feature_names())
print(df.head)

So, the problem is this line here:

tokens=[token.replace(key,val) for token in tokens]
5
  • 2
    I don't really agree with the duplicate target. It sounds irrelevant. And it's really a bad original question with below par 0-score accepted answer. I don't even understand the answer... Commented Aug 18, 2017 at 19:52
  • 1
    There is nothing about re.sub replacement. This is really not helping me out. Commented Aug 18, 2017 at 20:06
  • 1
    @Alexander reopened the question (someone already voted for it). I hope you don't mind. Commented Aug 18, 2017 at 20:06
  • @Alexander that said I understand perfectly the urge to close a question as a duplicate to avoid dump copied/pasted answers... I'm so frustrated when someone answers before I find the exact dupe. Commented Aug 18, 2017 at 20:21
  • @hope94 Would you mind posting a small sample of the data? Presumably from "Stem rečnik.txt" and post_text. Commented Aug 18, 2017 at 23:35

1 Answer 1

1

token.replace(key,val) invokes str.replace which is basic string replace.

To call regex replace just do this instead:

tokens=[re.sub(key,val,token) for token in tokens]

that said, this seems rather ineffective, rebuilding a list comprehension for each key/val couple.

Sign up to request clarification or add additional context in comments.

6 Comments

Code executes more than 6 minutes, and is still not over, it happens to me everytime when something is not ok. The rest of the code works fine when i don't run this part.
that's what I thought. I edited my answer to show a really efficient solution. More complex, but faster.
I am not sure how to implement it, it will take some time for me to study it, because Í've never used before things you mentioned here.I am new in Python. So i tried just to copy those two lines, but i got an error: KeyError: 'intesa'. (intesa is a value in the dictionary)
damn, my solution doesn't work when there are patterns... it works fine when keys are actual values (so the matched group IS the key, but not here)
i tried this, but it takes too long. I suppose because there is a lot of iterations... tokens=[re.sub(r'%s' %key, '%s'%val,token ) for token in tokens]
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.