0

I have a problem I'd be very grateful for help with.

Specifically, I have a gigantic text file; I need to replace specific strings in it with entries from a dictionary. Usefully, the words I need to replace are named in sequential fashion: 'Word1', 'Word2', ... , 'Wordn'.

Now, I'd like to write a 'for' loop that loops across the file, and for all instances of 'Wordx' replaces it with dictionary[x]. The problem, of course, is that 'Wordx' requires the 'x' part to function as a variable, which (so far as I know) can't be done inside a string.

Does anyone have workaround? I tried looking at regular expressions, but found nothing obvious (possibly because I also found it somewhat confusing).

(Note that I can when I generate the text file, I have complete control over the form the words I want to replace can take: i.e., it need not be 'Word11; it can be 'Wordeleven' or 'wordXI' or anything ascii at all.)

Edit: To add more detail, as requested: my text file is an export of the javascript behind a survey file. The original survey software only allows me to enter text prompts one at a time (as opposed to pipe the in from a csv), but I have several thousand text prompts to enter (the words). My plan is to manually enter about 100 words ('Word1, ..., 'Word100'), export the survey javascript as a text file, write a script to replace the words with dictionary entries, import the resulting files, and join them into a new survey.

However, the issue remains whether I can use the number portion of a string as a variable to loop across

3
  • 3
    maybe you need show more clear example, more about your text file, and what you want Commented Jun 4, 2016 at 11:21
  • How big in bytes is this "gigantic" text file? Commented Jun 4, 2016 at 11:50
  • Eh, the size isn't really my point: I say 'gigantic' only to convey that it rewards writing some code, as opposed to doing a 'find' 'replace' one word at a time. Commented Jun 4, 2016 at 12:41

3 Answers 3

5

With re.sub(), you can pass it a function instead of a replacement string. This function can look up the replacement from a dictionary. For example:

d = {'0': 'foo', '1': 'bar', '2': 'baz'}
re.sub(r'word(\d+)',
       lambda match: d[match.group(1)],
       "Hello word0, this is word2. How is word1?")

Hello foo, this is baz. How is bar?

Sign up to request clarification or add additional context in comments.

4 Comments

This is great. The only issue I have with it is that the code throws an error if it encounters words with numbers that are not in the dictionary. This seems sub-optimal given that the author really just wants to replace where the integer is found in the dictionary, not react to all integers in the doc.
I don't know how well re.sub() can handle "gigantic" text files as input.
@Jason: Sure, but that only requires a minor adjustment to the replacement function. But what makes you think there could be words of the "wordx" pattern in the file that aren't in the dictionary? According to the info given, the OP has enough control over the file to prevent that situation from arising.
@Jasper If the performance of sub is a problem with a large string, you could do it line-by-line.
2
n = 1
while not done:
    replace_str = 'Word' + str(n)
    # find and replace all instances of replace_str in the file text
    # set variable done if finished
    n += 1

Does that framework solve your needs? A string is not a variable: a string is a value which can be calculated, while a variable is a name, which (usually) is not calculated. With more difficulty you can also set strings like 'WordEleven' and so on.

3 Comments

Reading through a huge file repeatedly for each n is a very expensive method. It would be better to read through the file once, and do all the replacements on each line.
I agree with those concerns. I ignored them in my answer because the question as originally written gave very few details, and I wanted to concentrate on what seemed to be the main issue, "'Wordx' requires the 'x' part to function as a variable, which (so far as I know) can't be done inside a string".
For what it's worth, Rory Daulton's suggestion does capture what I was looking for in a fairly direct way. The other suggestions are excellent too, but this gives the kind of workaround that the problem needs. It may well be that it's less efficient––though the problem is a once-off.
1

I suppose the text file you were talking was like this:

Hi! This is word1

I like to swim, word2 and word3 ....

if so, then you can read line by line, split lines and replace words with values from dictionary, whose keys would be int(word[-1])

Here is the code,

from __future__ import print_function

dict = {1: 'Aravind', 2: 'eat', 3:'play'}

def word_gen(file):
    for line in file:
        for word in line.split():

            if word[0:4] == 'word' and len(word) == 5:
                 print( dict[ int( word[-1] ) ], end=" " )  #remove int() if keys are are "chars" like {'1':'Mark',..}
                 #this------------------^

            else: print(word, end = " ")

        print("\r")


with open('re.txt', 'r') as f:
    word_gen(f)

now direct terminal output to another file with

python replace.py > replaced.txt

Hope that helps :)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.