Splitting list into sublists by given separator in python

Question

I'm trying to build n-grams which don't cross a period symbol. Split() only works for functions and list[index] only works with an index. Is there a way to access/split/divide a list by giving it a string/an element? Here is a snippet of my current function:

text = ["split","this","stuff",".","my","dear"]

def generate_ngram(rawlist, ngram_order):
        """
        Input: List of words or characters, ngram-order ["this", "is", "an", "example"], 2
        Output: Set of tuples or words or characters {("this", "is"),("is","an"),...}
        """

    list_of_tuples = []
    for i in range(0, len(rawlist) - ngram_order + 1):
        ngram_order_index = i + ngram_order    
        generated_ngram = rawlist[i : ngram_order_index]

        #if "." in generated_ngram:
            #generated_ngram . . . 

        generated_tuple = tuple(generated_ngram)  
        list_of_tuples.append(generated_tuple)

    return set(list_of_tuples)

generate_ngram(text,3)

currently returns:

{('.', 'my', 'dear'),
 ('stuff', '.', 'my'),
 ('split', 'this', 'stuff'),
 ('this', 'stuff', '.')}

but it should ideally return:

{('split', 'this', 'stuff'),
 ('this', 'stuff', '.')}

Any idea on how to achieve this? Thanks for your help!

There are many words which are not in list appeared in your output. — Sociopath
– Sociopath, Commented Feb 26, 2019 at 12:25
Please review your examples and try to explain a bit further what do you want it to do. The documentation in the function seems to suggest you are trying to build n-grams. However, the outputs that you say you expect have different sizes. Do you want to build n-grams that do not cross a period symbol? — javidcf
– javidcf, Commented Feb 26, 2019 at 12:30
@jdehesa, thank you for your recommendations. I tried to adapt my documentation. Sorry, first time posting here! Yes, I indeed mean building n-grams that don't cross a period symbol/sentence border. — Lisa
– Lisa, Commented Feb 26, 2019 at 12:35

javidcf · Accepted Answer · 2019-02-26 12:46:49Z

2

I'm not sure if this is exactly what you need, but this function generates ngrams that can only contain stop words (in this case period) at the end:

STOPWORDS = {"."}

def generate_ngram(rawlist, ngram_order):
    # All ngrams
    ngrams = zip(*(rawlist[i:] for i in range(ngram_order)))
    # Generate only those ngrams that do not contain stop words before the end
    return (ngram for ngram in ngrams if not any(w in STOPWORDS for w in ngram[:-1]))

text = ["split", "this", "stuff", ".", "my", "dear"]
print(*generate_ngram(text, 3), sep="\n")
# ('split', 'this', 'stuff')
# ('this', 'stuff', '.')
print(*generate_ngram(text, 2), sep="\n")
# ('split', 'this')
# ('this', 'stuff')
# ('stuff', '.')
# ('my', 'dear')

Note this function returns a generator. You can convert it to a list wrapping it with list(...) if you want, or you can directly iterate over it.

EDIT: You may find the equivalent syntax below more readable.

def generate_ngram(rawlist, ngram_order):
    # Iterate over all ngrams
    for ngram in zip(*(rawlist[i:] for i in range(ngram_order))):
        # Yield only those not containing stop words before the end
        if not any(w in STOPWORDS for w in ngram[:-1]):
            yield ngram

edited Feb 26, 2019 at 12:46

answered Feb 26, 2019 at 12:37

javidcf

59.9k7 gold badges87 silver badges134 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Lisa Over a year ago

That's precisely what I needed! Thank you so much.

javidcf Over a year ago

@Lisa Glad it helped. I have added a syntax variation that you may find more readable. Please consider marking the answer as accepted if you feel it solved your question. Note, by the way, this method assumes the input is a sequence, like a list or a tuple, if it is another kind of iterable, like a generator, then zip(*(rawlist[i:] for i in range(ngram_order))) would not work - you may look at Rolling or sliding window iterator? for alternatives to that line.

Collectives™ on Stack Overflow

Splitting list into sublists by given separator in python

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related