0

I'm trying to build n-grams which don't cross a period symbol. Split() only works for functions and list[index] only works with an index. Is there a way to access/split/divide a list by giving it a string/an element? Here is a snippet of my current function:

text = ["split","this","stuff",".","my","dear"]

def generate_ngram(rawlist, ngram_order):
        """
        Input: List of words or characters, ngram-order ["this", "is", "an", "example"], 2
        Output: Set of tuples or words or characters {("this", "is"),("is","an"),...}
        """

    list_of_tuples = []
    for i in range(0, len(rawlist) - ngram_order + 1):
        ngram_order_index = i + ngram_order    
        generated_ngram = rawlist[i : ngram_order_index]

        #if "." in generated_ngram:
            #generated_ngram . . . 

        generated_tuple = tuple(generated_ngram)  
        list_of_tuples.append(generated_tuple)

    return set(list_of_tuples)

generate_ngram(text,3)

currently returns:

{('.', 'my', 'dear'),
 ('stuff', '.', 'my'),
 ('split', 'this', 'stuff'),
 ('this', 'stuff', '.')}

but it should ideally return:

{('split', 'this', 'stuff'),
 ('this', 'stuff', '.')}

Any idea on how to achieve this? Thanks for your help!

3
  • There are many words which are not in list appeared in your output. Commented Feb 26, 2019 at 12:25
  • 1
    Please review your examples and try to explain a bit further what do you want it to do. The documentation in the function seems to suggest you are trying to build n-grams. However, the outputs that you say you expect have different sizes. Do you want to build n-grams that do not cross a period symbol? Commented Feb 26, 2019 at 12:30
  • @jdehesa, thank you for your recommendations. I tried to adapt my documentation. Sorry, first time posting here! Yes, I indeed mean building n-grams that don't cross a period symbol/sentence border. Commented Feb 26, 2019 at 12:35

1 Answer 1

2

I'm not sure if this is exactly what you need, but this function generates ngrams that can only contain stop words (in this case period) at the end:

STOPWORDS = {"."}

def generate_ngram(rawlist, ngram_order):
    # All ngrams
    ngrams = zip(*(rawlist[i:] for i in range(ngram_order)))
    # Generate only those ngrams that do not contain stop words before the end
    return (ngram for ngram in ngrams if not any(w in STOPWORDS for w in ngram[:-1]))

text = ["split", "this", "stuff", ".", "my", "dear"]
print(*generate_ngram(text, 3), sep="\n")
# ('split', 'this', 'stuff')
# ('this', 'stuff', '.')
print(*generate_ngram(text, 2), sep="\n")
# ('split', 'this')
# ('this', 'stuff')
# ('stuff', '.')
# ('my', 'dear')

Note this function returns a generator. You can convert it to a list wrapping it with list(...) if you want, or you can directly iterate over it.

EDIT: You may find the equivalent syntax below more readable.

def generate_ngram(rawlist, ngram_order):
    # Iterate over all ngrams
    for ngram in zip(*(rawlist[i:] for i in range(ngram_order))):
        # Yield only those not containing stop words before the end
        if not any(w in STOPWORDS for w in ngram[:-1]):
            yield ngram
Sign up to request clarification or add additional context in comments.

2 Comments

That's precisely what I needed! Thank you so much.
@Lisa Glad it helped. I have added a syntax variation that you may find more readable. Please consider marking the answer as accepted if you feel it solved your question. Note, by the way, this method assumes the input is a sequence, like a list or a tuple, if it is another kind of iterable, like a generator, then zip(*(rawlist[i:] for i in range(ngram_order))) would not work - you may look at Rolling or sliding window iterator? for alternatives to that line.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.