0

I made a script which inserts two lists into another every each 4 element but it takes a really long time to complete. Here are my two very long lists:

listOfX = ['567','765','456','457','546'....] len(383656)
listOfY = ['564','345','253','234','123'....] len(383656)

And the other list which contain some data and where I want to add the data of the other lists:

cleanData = ['2020-04-28T01:44:59.392043', 'c57', '0', '2020-04-28T01:44:59.392043', 'c57', '1'....] len(1145146)

Here what I want:

cleanData = ['2020-04-28T01:44:59.392043', 'c57', '0', 567, 564, '2020-04-28T01:44:59.392043', 'c57', '1', 765, 345]

Finally, here my code:

  ## ADDING X AND Y TO ORIGINAL LIST
  addingValue = True
  valueItem = ""
  loopValue = 3
  xIndex = 0
  yIndex = 0
  print(len(listOfX))

  while addingValue:

    if xIndex > len(listOfX):
      break

    try:
      cleanData.insert(loopValue, listOfY[yIndex])
      cleanData.insert(loopValue, listOfX[xIndex])

    except IndexError:
      addingValue = False
      break

    xIndex += 1
    yIndex += 1
    loopValue += 5

Do you have any idea?

9
  • How are you trying to merge the lists? Do you have code? Commented May 8, 2020 at 20:09
  • Can you add the code you wrote to the question please. Commented May 8, 2020 at 20:10
  • yes sure, wait im adding it Commented May 8, 2020 at 20:10
  • 1
    I see a problem here: listOfX and listOfY have 383656 items each, and cleanData has only 1145146, which is less than 3*383656; so, if you want to add one item from each of the first two lists after every group of 3 items in cleanData, you'll have unused elements left in listOfX, listOfY. Is that what you intended? Commented May 8, 2020 at 20:23
  • 3
    As mentioned above, insertion into an existing list is very expensive. It would be better to for i in range(len(<smallestlist>)): then append elements to a new list, or as I mentioned in my previous comment, even better consume and use it group of elements as you combine them instead of putting them back in a list. You could do that via a generator like so is demonstrated here: realpython.com/introduction-to-python-generators Commented May 8, 2020 at 20:28

4 Answers 4

2

The main problem with your solution was, that in your solution you inserted elements 2 * 383656 times into an existing list. Every time all the elements after the insertion point had to be shifted.

Thus it's faster to create a new list.

If for any reason you want that cleanData stays the same old object with the new data (perhaps, because another function / object has a reference to it and should see the changed data) then write

cleanData[:] = blablabla 

instead of

cleanData = blablabla

I wrote following two solutions (second faster one only after answer got accepted)

import functools
import operator
cleanData = functools.reduce(
    operator.iconcat,
    (list(v) for v in zip(*([iter(cleanData)] * 3), listOfX, listOfY)),
    [])

and

import itertools
cleanData = list(itertools.chain.from_iterable(
    (v for v in zip(*([iter(cleanData)] * 3), listOfX, listOfY)),
    ))

In order to understand the zip(*([iter(cleanData)] * 3), listOfX, listOfY) construct you might look at what is meaning of [iter(list)]*2 in python?

Potential downside of my first solution (depending on the context). Using functools.reduce and operator.iconcat creates a list and no generator.

The second solution returns a list. If you want a generator, then just remove list( and one trailing ) and it will be a generator

Second solution is (about 2x) faster than the first one.

Then I wrote some code to compare performance and results of the two given solutions and mine:

Not a very big difference (2.5x), but the second solution seems to be a bit faster than @Błotosmętek's first solution and Alain T.'s solution.

from contextlib import contextmanager
import functools
import itertools
import operator
import time

@contextmanager
def measuretime(comment):
    print("=" * 76)
    t0 = time.time()
    yield comment
    print("%s: %5.3fs" % (comment, time.time() - t0))
    print("-" * 76 + "\n")


N = 383656
t0 = time.time()
with measuretime("create listOfX"):
    listOfX = list(range(N))

with measuretime("create listOfY"):
    listOfY = list(range(1000000, 1000000 + N))

print("listOfX", len(listOfX), listOfX[:10])
print("listOfY", len(listOfY), listOfY[:10])

with measuretime("create cleanData"):
    origCleanData = functools.reduce(
        operator.iconcat,
        (["2020-010-1T01:00:00.%06d" % i, "c%d" % i, "%d" %i] for i in range(N)),
        [])

print("cleanData", len(origCleanData), origCleanData[:12])

cleanData = list(origCleanData)
with measuretime("funct.reduce operator icat + zip"):
    newcd1 = functools.reduce(
        operator.iconcat,
        (list(v) for v in zip(*([iter(cleanData)] * 3), listOfX, listOfY)),
        [])

print("NEW", len(newcd1), newcd1[:3*10])

cleanData = list(origCleanData)
with measuretime("itertools.chain + zip"):
    cleanData = list(itertools.chain.from_iterable(
        (v for v in zip(*([iter(cleanData)] * 3), listOfX, listOfY)),
        ))

print("NEW", len(cleanData), cleanData[:3*10])
assert newcd1 == cleanData

cleanData = list(origCleanData)
with measuretime("blotosmetek"):
    tmp = []
    n = min(len(listOfX), len(listOfY), len(cleanData)//3)
    for i in range(n):
       tmp.extend(cleanData[3*i : 3*i+3])
       tmp.append(listOfX[i])
       tmp.append(listOfY[i])
    cleanData = tmp

print("NEW", len(cleanData), cleanData[:3*10])
assert newcd1 == cleanData


cleanData = list(origCleanData)
with measuretime("alainT"):
    cleanData = [ v for i,x,y in zip(range(0,len(cleanData),3),listOfX,listOfY)
                for v in (*cleanData[i:i+3],x,y) ]

print("NEW", len(cleanData), cleanData[:3*10])
assert newcd1 == cleanData


Output on my old PC looks like:

============================================================================
create listOfX: 0.013s
----------------------------------------------------------------------------

============================================================================
create listOfY: 0.013s
----------------------------------------------------------------------------

listOfX 383656 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
listOfY 383656 [1000000, 1000001, 1000002, 1000003, 1000004, 1000005, 1000006, 1000007, 1000008, 1000009]
============================================================================
create cleanData: 0.454s
----------------------------------------------------------------------------

cleanData 1150968 ['2020-010-1T01:00:00.000000', 'c0', '0', '2020-010-1T01:00:00.000001', 'c1', '1', '2020-010-1T01:00:00.000002', 'c2', '2', '2020-010-1T01:00:00.000003', 'c3', '3']
============================================================================
funct.reduce operator icat + zip: 0.240s
----------------------------------------------------------------------------

NEW 1918280 ['2020-010-1T01:00:00.000000', 'c0', '0', 0, 1000000, '2020-010-1T01:00:00.000001', 'c1', '1', 1, 1000001, '2020-010-1T01:00:00.000002', 'c2', '2', 2, 1000002, '2020-010-1T01:00:00.000003', 'c3', '3', 3, 1000003, '2020-010-1T01:00:00.000004', 'c4', '4', 4, 1000004, '2020-010-1T01:00:00.000005', 'c5', '5', 5, 1000005]
============================================================================
itertools.chain + zip: 0.109s
----------------------------------------------------------------------------

NEW 1918280 ['2020-010-1T01:00:00.000000', 'c0', '0', 0, 1000000, '2020-010-1T01:00:00.000001', 'c1', '1', 1, 1000001, '2020-010-1T01:00:00.000002', 'c2', '2', 2, 1000002, '2020-010-1T01:00:00.000003', 'c3', '3', 3, 1000003, '2020-010-1T01:00:00.000004', 'c4', '4', 4, 1000004, '2020-010-1T01:00:00.000005', 'c5', '5', 5, 1000005]
============================================================================
blotosmetek: 0.370s
----------------------------------------------------------------------------

NEW 1918280 ['2020-010-1T01:00:00.000000', 'c0', '0', 0, 1000000, '2020-010-1T01:00:00.000001', 'c1', '1', 1, 1000001, '2020-010-1T01:00:00.000002', 'c2', '2', 2, 1000002, '2020-010-1T01:00:00.000003', 'c3', '3', 3, 1000003, '2020-010-1T01:00:00.000004', 'c4', '4', 4, 1000004, '2020-010-1T01:00:00.000005', 'c5', '5', 5, 1000005]
============================================================================
alainT: 0.258s
----------------------------------------------------------------------------

NEW 1918280 ['2020-010-1T01:00:00.000000', 'c0', '0', 0, 1000000, '2020-010-1T01:00:00.000001', 'c1', '1', 1, 1000001, '2020-010-1T01:00:00.000002', 'c2', '2', 2, 1000002, '2020-010-1T01:00:00.000003', 'c3', '3', 3, 1000003, '2020-010-1T01:00:00.000004', 'c4', '4', 4, 1000004, '2020-010-1T01:00:00.000005', 'c5', '5', 5, 1000005]

Sign up to request clarification or add additional context in comments.

3 Comments

Just In case I added a few more comments to my answer
found an even faster solution. with itertools.chain.from_iterable() and zip() will adapt my answer
I also corrected the comparison code. I forgot to reset cleanData before each test, so 3rd and 4th test didn't have correct input values
1

This is implementation of shelister's suggestion:

tmp = []
n = min(len(listOfX), len(listOfY), len(cleanData)//3)
for i in range(n):
   tmp.extend(cleanData[3*i : 3*i+3])
   tmp.append(listOfX[i])
   tmp.append(listOfY[i])
cleanData = tmp

Comments

1

This should be much faster:

cleanData = [ v for i,x,y in zip(range(0,len(cleanData),3),listOfX,listOfY) 
                for v in (*cleanData[i:i+3],x,y) ]

If you use parentheses instead of brackets, the expression becomes a generator that you can use to iterate through the merged data (e.g. with a for loop) without actually creating a copy in a new list

Comments

0

Building on Blotometek's with a generator, you would do something like this:

def get_next_group():
    n = min(len(listOfX), len(listOfY), len(cleanData)//3)
    for i in range(n):
        tmp = cleanData[3*i : 3*i+3]
        tmp.append(listOfX[i])
        tmp.append(listOfY[i])

        yield tmp

#in you main code:

for x in get_next_group():
    #do something with x
    pass

The advantage of the above code is that combination is only done piece by piece as you request it. If you do something with it, and don't store it in a list in memory, memory overhead is reduced. Since you are no longer memory-bound, the CPU can immediately be processing other instructions on each chunk instead of waiting for everything to be combined first.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.