0

I would like to replace a list of words using the RegEx module, but I have failed even after many attempts.

#full_list.txt
#Tab is the delimiter
#The left column is the list of words to be searched
#The right column is the list of words to be replaced

%!あ [×啞]    @#あ [啞]
%!あい きょう [\(愛嬌\)・\(愛▷敬\)]   @#あいきょう [愛嬌]
    .
    .
    .

My codes are as follows:

import re

with open('full_list.txt', 'r', encoding='utf-8') as f:
    search_list = [line.strip().split('\t')[0] for line in f]

with open('full_list.txt', 'r', encoding='utf-8') as f:
    replace_list = [line.strip().split('\t')[1] for line in f]

with open('document.txt', 'r', encoding='utf-8') as f:
    content = f.read()

def replace_func(x, content):
    content = re.sub(search_list[x], replace_list[x], content)
    return content

x = 0
while x < 30:
    content = replace_func(x, content)
    x+=1

with open('new_document.txt', 'w', encoding='utf-8') as f:
    f.write(content)

After running the codes, some words can be replaced while some cannot. What could have been wrong with the codes?

3
  • Are there only 30 words in the full_list.txt or more? Commented Feb 14, 2015 at 10:54
  • @falsetru It is not. There are around 70,000 words to be replaced. I just test the first 30 items to see if there could be something wrong, and unfortunately there is. Commented Feb 14, 2015 at 10:56
  • @AvinashRaj Hi Avinash, nice to see you again. My expected output is to replace the words in the list_search with those in the list_replace. Commented Feb 14, 2015 at 11:06

1 Answer 1

2

If you only want to replace words, don't use regular expressions, but the replace-Method for strings:

with open('full_list.txt', 'r', encoding='utf-8') as f:
    search_and_replace = [line.strip().split('\t') for line in f]

with open('document.txt', 'r', encoding='utf-8') as f:
    content = f.read()

for search, repl in search_and_replace:
    content = content.replace(search, repl)

with open('new_document.txt', 'w', encoding='utf-8') as f:
    f.write(content)
Sign up to request clarification or add additional context in comments.

4 Comments

Using your codes, there are still some lines cannot be replaced. For example: %!あい ぎ [合い着・▷間着]<sub>アヒ–</sub> @#あいぎ [合い着・間着]
Did you consider, that you are using different kinds of square brackets?
I have considered that square bracket; however, I haven't figured out why [ could make the string replacement to be failed. Could you please explain that? And after all, that's the reason I used re.sub at the beginning, as this function worked nicely.
The character is not the same as [. Second, some characters have special meaning in regular expressions like [.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.