3

I have a list named values containing a series of numbers:

values = [0, 1, 2, 3, 4, 5, ... , 351, 0, 1, 2, 3, 4, 5, 6, ... , 750, 0, 1, 2, 3, 4, 5, ... , 559]

I want to create a new list that contains a list of elements from 0 up to a number.

Like :

new_values = [[0, 1, 2, ... , 351], [0, 1, 2, ... , 750], [0, 1, 2, ... , 559]]

The code that I did was this:

start = 0
new_values = []
for i,val in enumerate(values): 
    if(val == 0):
        new_values.append(values[start:i]) 
        start = i

However, what it returns this:

new_values = [[], [0, 1, 2, ... , 750], [0, 1, 2, ... , 559]]

How can I fix my code? It will really be a great help.

0

5 Answers 5

5

So your problem with the code as written is that it includes an empty list at the beginning, and omits the final sub-list. The minimalist fix for this is:

  1. Change the test to avoid appending the first list (when i is 0), e.g. if val == 0 and i != 0:

  2. Append the final group after the loop exits

Combining the two fixes, you'd have:

start = 0
new_values = []
for i,val in enumerate(values): 
    if val == 0 and i != 0:  # Avoid adding empty list
        new_values.append(values[start:i]) 
        start = i
if values:  # Handle edgecase for empty values where nothing to add
    new_values.append(values[start:])  # Add final list

I was going to add the cleaner groupby solution which avoids the special cases for beginning/end of list, but Chris_Rands already handled that, so I'll refer you to his answer.

Somewhat surprisingly, this actually seems to be the fastest solution, asymptotically, at the expense of requiring the input to be a list (where some of the other solutions can accept arbitrary iterables, including pure iterators for which indexing is impossible).

For comparison (using Python 3.5 additional unpacking generalizations both for brevity and to get optimal performance on modern Python, and using the implicit booleanness of int to avoid comparing to 0 since it's equivalent for int input, but meaningfully faster to use implicit booleanness):

from itertools import *

# truth is the same as bool, but unlike the bool constructor, it requires
# exactly one positional argument, which makes a *major* difference
# on runtime when it's in a hot code path
from operator import truth

def method1(values):
    # Optimized/correct OP's code
    # Only works on list inputs, and requires non-empty values to begin with 0,
    # but handles repeated 0s as separate groups properly
    new_values = []
    start = None
    for i, val in enumerate(values):
        if not val and i:
            new_values.append(values[start:i])
            start = i
    if values:
        new_values.append(values[start:])
    return new_values

def method2(values):
    # Works with arbitrary iterables and iterators, but doesn't handle
    # repeated 0s or non-empty values that don't begin with 0
    return [[0, *g] for k, g in groupby(values, truth) if k]

def method3(values):
    # Same behaviors and limitations as method1, but without verbose
    # special casing for begin and end
    start_indices = [i for i, val in enumerate(values) if not val]

    # End indices for all but terminal slice are previous start index
    # so make iterator and discard first value to pair properly
    end_indices = iter(start_indices)
    next(end_indices, None)

    # Pairing with zip_longest avoids need to explicitly pad end_indices
    return [values[s:e] for s, e in zip_longest(start_indices, end_indices)]

def method4(values):
    # Requires any non-empty values to begin with 0
    # but otherwise handles runs of 0s and arbitrary iterables (including iterators)
    new_values = []
    for val in values:
        if not val:
            curlist = [val]
            new_values.append(curlist)
            # Use pre-bound method in local name for speed
            curlist_append = curlist.append
        else:
            curlist_append(val)
    return new_values

def method5(values):
    # Most flexible solution; similar to method2, but handles all inputs, empty, non-empty,
    # with or without leading 0, with or without runs of repeated 0s
    new_values = []
    for nonzero, grp in groupby(values, truth):
        if nonzero:
            try:
                new_values[-1] += grp
            except IndexError:
                new_values.append([*grp])  # Only happens when values begins with nonzero
        else:
            new_values += [[0] for _ in grp]
    return new_values

Timings on Python 3.6, Linux x64, using ipython 6.1's %timeit magic:

>>> values = [*range(100), *range(50), *range(150)]
>>> %timeit -r5 method1(values)
12.5 μs ± 50.6 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)

>>> %timeit -r5 method2(values)
16.9 μs ± 54.9 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)

>>> %timeit -r5 method3(values)
13 μs ± 18.9 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)

>>> %timeit -r5 method4(values)
16.7 μs ± 9.51 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)

>>> %timeit -r5 method5(values)
18.2 μs ± 25.2 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)

Summary:

Solutions that slice out the runs in bulk (method1, method3) are the fastest, but depend on the input being a sequence (and if the return type must be list, the input must be list too, or conversions must be added).

groupby solutions (method2, method5) are a little slower, but are typically quite succinct (handling all edges cases as in method5 doesn't require extreme verbosity nor explicit test-and-check LBYL patterns). They also don't require a lot of hackery to make them go as fast as possible, aside from using operator.truth instead of bool. This is necessary because CPython's bool constructor is very slow thanks to some weird implementation details (bool must accept full varargs, including keywords, dispatching through the object construction machinery, which costs a lot more than operator.truth which uses a low overhead path that takes exactly one positional argument and bypasses object construction machinery); if bool were used as the key function instead of operator.truth, runtimes more than double (to 36.8 μs and 38.8 μs for method2 and method5 respectively).

In between is the slower, but more flexible approach (handles arbitrary input iterables, including iterators, handles runs of 0s with no special casing, etc.) using item-by-item appends (method4). Problem is, getting maximum performance requires much more verbose code (because of the need to avoid repeated indexing and method binding); if the loop of method4 is changed to the much more succinct:

for val in values:
    if not val:
        new_values.append([])
    new_values[-1].append(val)

the runtime more than doubles (to ~34.4 μs), thanks to the cost of repeatedly indexing new_values and binding the append method over and over.

In any event, personally, if performance wasn't absolutely critical, I'd use one of the groupby solutions using bool as the key just to avoid imports and uncommon APIs. If performance was more important, I'd probably still use groupby, but swap in operator.truth as the key function; sure, it's not as fast as the spelled out version, but for people who know groupby, it's easy enough to follow, and it's generally the most succinct solution for any given level of edge case handling.

Sign up to request clarification or add additional context in comments.

Comments

1

You can group your elements with itertools.groupby based on the presence of 0 (which is falsy) and extract the sublists between 0 while appending the missing 0 with a list comprehension:

 [[0]+list(g) for k, g in groupby(values, bool) if k]

Example:

>>> from itertools import groupby
>>> values = [0, 1, 2, 3, 4, 5 , 351, 0, 1, 2, 3, 4, 5, 6, 750, 0, 1, 2, 3, 4, 559]
>>> [[0]+list(g) for k, g in groupby(values, bool) if k]
[[0, 1, 2, 3, 4, 5, 351], [0, 1, 2, 3, 4, 5, 6, 750], [0, 1, 2, 3, 4, 559]]

12 Comments

Minor note: This assumes 0s never occur back-to-back, and values always starts with 0. The OP's example follows this rule, so if that rule is guaranteed, this is by far the simplest/most performant solution.
Correction: It's actually quite slow as written because bool has absurd overhead. Replacing bool with operator.truth as the key function will remove that overhead, making this similar cost to other optimized solutions (slightly slower, but trivially so, where bool can make it take 3x what the optimized non-groupby solutions require). Personally, if performance isn't critical, I'd use this solution as is; if performance is critical, swapping to operator.truth would get you a major speedup with minimal complexity change.
Yeah, I knew bool was slower than it should be, didn't realize just how much slower until I tested it head-to-head against operator.truth. The problem is that CPython built-in constructors have to use a fairly general dispatch mechanism supporting variable length positional and keyword arguments. The interpreter builds tuple and dict, then bool's tp_new calls PyArg_ParseTupleAndKeywords (for some dumb reason they allowed you to actually use a keyword argument for bool, which means they need to check for them every time); that's only the basics of constructor bookkeeping.
By contrast, operator.truth is of a specially optimized class of plain functions (it has the argument type flag METH_O which means it takes exactly one positional argument, no more, no less, no keyword args, where tp_new is always METH_VARARGS | METH_KEYWORDS). Being a plain function, no constructor specific machinery is invoked; being METH_O, no tuple or dict must be created and subsequently unpacked. Ultimately, the work they both do is PyBool_FromLong(PyObject_IsTrue(argval)), but it's buried under a ton of kruft in the bool constructor.
Oh, and hey, apparently the CPython folks decided some of these decisions should be fixed; as of 3.7, bool no longer attempts to use keyword arguments, it just complains if any are passed and uses PyArg_UnpackTuple to unpack the tuple without having format strings to parse, so it should run faster. Don't have a copy of 3.7 to test with, but it's something.
|
1

You can use itertools.groupby by finding all groups where each value is less than the element that proceeds it in values:

import itertools
values = [0, 1, 2, 3, 4, 5, 351, 0, 1, 2, 3, 4, 5, 6, 750, 0, 1, 2, 3, 4, 5, 559]
new_vals = [[i[-1] for i in b] for a, b in itertools.groupby(enumerate(values), key=lambda x:x[-1] <= values[x[0]+1] if x[0]+1 < len(values) else False)]
final_data = [new_vals[i]+new_vals[i+1] for i in range(0, len(new_vals), 2)]

Output:

[[0, 1, 2, 3, 4, 5, 351], [0, 1, 2, 3, 4, 5, 6, 750], [0, 1, 2, 3, 4, 5, 559]]

2 Comments

Using map when you need a list anyway, and rely on a lambda to do so, is just slower/more verbose than the equivalent listcomp. And as long as you definitely consume the group before requesting the next, you needn't listify the group up front. You could replace list(map(lambda x:x[-1], list(b))) with [x[-1] for x in b] and it would behave identically (except faster, and easier to read). If you really love map, importing itemgetter from operator and doing list(map(itemgetter(-1), b)) would at least avoid the speed hit, but I'd usually stick with the listcomp.
@ShadowRanger List comprehension is certainly better than map. Please see my recent edit.
1

This should work:

values = [0, 1, 2, 3, 4, 5, 351, 0, 1, 2, 3, 4, 5, 6, 750, 0, 1, 2, 3, 4, 5, 559]
new_values = []

split_at = 0  # split the list when this value is reached

idx = -1
for value in values:
    if value == split_at:
        idx += 1
        new_values.append([])

    new_values[idx].append(value)

Output:

[[0, 1, 2, 3, 4, 5, 351], [0, 1, 2, 3, 4, 5, 6, 750], [0, 1, 2, 3, 4, 5, 559]]

It also handles edgecasses.

My method is a bit faster than Chris_Rands's, but it's also a bit slower than Vasilis G's method:

from itertools import groupby


values = [
    0, 1, 2, 3, 4, 5, 351,
    0, 1, 2, 3, 4, 5, 6, 750,
    0, 1, 2, 3, 4, 5, 559,
]


def method1():
    new_values = []

    idx = -1
    for value in values:
        if value == 0:
            idx += 1
            new_values.append([])

        new_values[idx].append(value)

    return new_values


def method2():
    new_values = [[0] + list(g) for k, g in groupby(values, bool) if k]
    return new_values


def method3():
    indices = [index for index, value in enumerate(values) if value == 0] + [len(values)]
    new_values = [values[indices[i]:indices[i + 1]] for i in range(len(indices) - 1)]
    return new_values

>>> timeit.timeit(method1, number=100000)
0.6725746986698414
>>> timeit.timeit(method2, number=100000)
0.8143814620314903
>>> timeit.timeit(method3, number=100000)
0.6596001360341748

4 Comments

You can save some work by avoiding the use of idx entirely; you always want to append to the final list in new_values, so just use negative indexing, new_values[-1].append(...), rather than maintaining idx and doing new_values[idx].append(...).
Alternatively, for even faster performance, in your if case you do: new_values.append([]), add_to_cur_list = new_values[-1].append, then change the code outside the if to add_to_cur_list(value), so you avoid the indexing and method binding overhead in common case entirely.
Also, a note on performance: When the runs to be grouped get long, my minimalist change to the OP's original code actually seems to be the fastest (with your method3 code being a close second, though there is room for a little improvement on it, speedwise, specifically, changing [values[indices[i]:indices[i + 1]] for i in range(len(indices) - 1)] to [values[s:e] for s, e in zip(indices, indices[1:])] to remove two explicit index lookups and an addition per item). Try it with 0-based groups that are each ~100 items long to see.
Huh. One last note: I just wrote up a more comprehensive set of performance comparisons for larger groups (which gives a better idea of how the solutions scale, avoiding penalizing a method for moderate fixed setup costs if the method is faster per-item). Most surprising discovery was just how much using bool as the key function slows you down; using operator.truth (does the exact same job as bool in this use case) halves runtime by avoiding the expense associated w/object construction machinery (that ultimately doesn't even construct anything, since True/False are singletons).
0

You can also do it like this:

values = [0, 1, 2, 3, 4, 5, 351, 0, 1, 2, 3, 4, 5, 6, 750, 0, 1, 2, 3, 4, 5, 559]

# Find all indices whose element is 0.
indices = [index for index, value in enumerate(values) if value==0] + [len(values)]

# Split the list accordingly
values = [values[indices[i]:indices[i+1]] for i in range(len(indices)-1)]

print(values)

Output:

[[0, 1, 2, 3, 4, 5, 351], [0, 1, 2, 3, 4, 5, 6, 750], [0, 1, 2, 3, 4, 5, 559]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.