0

I have a string object (type str) called 'corpus_jn'. It is a composed of about a hundred sentences. From this object, I'd like to delete substrings I have in a list called boilerplates. Ex:

boilerplates = ['Contact Number: 444-444-4444.', 'More information provided on request.']
corpus_jn = (corpus_jn.replace(sentence, '') for sentence in boilerplates)

The code executes, but when I try to print it, it outputs a generator object:

print(corpus_jn)

<generator object <genexpr> at 0x0000000012552518>

How can I maintain or output my str object?

1 Answer 1

1

replace does not modify the original string. You need to reassign to it for every sentence:

for sentence in boilerplates:
    corpus_jn = corpus_jn.replace(sentence, '')

Or you can use a regex:

import re
regex = '|'.join(map(re.escape, boilerplates))
corpus_jn = re.sub(regex, '', corpus_jn)

This will probably be more efficient since it iterates over the string only once.


Just to clarify: your original codes does not do any replacement at all. The argument to str is a generator expression which produces a generator object that does nothing until something iterates over it.

The call to str however does not iterate over it, it just transforms it into that <generator object ...> text.

Even if you consumed the generator properly using ''.join or a list-comprehension you would not obtain what you expected:

>>> text = 'hello 123 hello bye'
>>> boilerplates = ['hello', 'bye']
>>> [text.replace(sentence, '') for sentence in boilerplates]
[' 123  bye', 'hello 123 hello ']

As you can see the first time the word hello is replaced from text but the second iteration is still done on the original value and hence you get a string with no bye but that still contains hello. To remove both you have to use the solutions above, you can't do that using a generator in that way.

Sign up to request clarification or add additional context in comments.

11 Comments

These generators outputs are rather perplexing and admittedly very frustrating. But I understand they save a lot of space, so I have been trying to employ them as much as possible. I plan to spend a lot of time with these as I have millions, and possibly around a billion lines to work with, each with a hundred sentences or so. Thanks.
@spacedustpi That output is the generic result of the method object.__repr__. Basically when an object "does not know" how to print itself it simply prints that string that tells you the type and its id. A generator cannot do really anything else because it does not know what it really contains until it consumes itself, and you probably don't want to consume it to print it.
@spacedustpi Yes, generators are "one-off". Once you iterate over it, it's done. That is also what allows them to avoid consuming space... they literally do not contain anything until you try to fetch an element, that element is then computed and returned, it is not kept into memory.
@spacedustpi Consider (time.time() for _ in range(100)).Whenever you fetch an item time.time() si called and returned, so it returns the system time for the instance you fetch that item. There is no way to compute this beforehand since you don't know when that iteration will happen and after it happened there is no way to compute the same value again. The generator only keeps the state of the expression that defines it. In this case it will hold a reference to the iterator over range(100) .
#@spacedustpi Functions can depend on the system state which cannot be reproduced afterwards. So it's not just a matter of memory but infeasibility.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.