4

I have the following code:

def task1():
    for url in splitarr[0]:
        print(url) #these are supposed to be scrape_induvidual_page() . print is just for debugging
def task2():
    for url in splitarr[1]:
        print(url)
def task3():
    for url in splitarr[2]:
        print(url)
def task4():
    for url in splitarr[3]:
        print(url)
def task5():
    for url in splitarr[4]:
        print(url)
def task6():
    for url in splitarr[5]:
        print(url)
def task7():
    for url in splitarr[6]:
        print(url)     
def task8():
    for url in splitarr[7]:
        print(url)   

splitarr=np.array_split(urllist, 8)
t1 = threading.Thread(target=task1, name='t1') 
t2 = threading.Thread(target=task2, name='t2')   
t3 = threading.Thread(target=task3, name='t3')
t4 = threading.Thread(target=task4, name='t4') 
t5 = threading.Thread(target=task5, name='t5')
t6 = threading.Thread(target=task6, name='t6')
t7 = threading.Thread(target=task7, name='t7')
t8 = threading.Thread(target=task8, name='t8')

t1.start() 
t2.start()
t3.start() 
t4.start() 
t5.start()
t6.start() 
t7.start()
t8.start() 

t1.join()
t2.join()
t3.join()
t4.join()
t5.join()
t6.join()
t7.join() 
t8.join() 

And it does have the desired output without duplicates or anything

https://kickasstorrents.to/big-buck-bunny-1080p-h264-aac-5-1-tntvillage-t115783.html
https://kickasstorrents.to/big-buck-bunny-4k-uhd-hfr-60fps-eng-flac-webdl-2160p-x264-zmachine-t1041079.html
https://kickasstorrents.to/big-buck-bunny-4k-uhd-hfr-60-fps-flac-webrip-2160p-x265-zmachine-t1041689.html
https://kickasstorrents.to/big-buck-bunny-2008-720p-bluray-x264-don-no-rars-t11623.html
https://kickasstorrents.to/tkillaahh-big-buck-bunny-dvd-720p-2lions-team-t87503.html
https://kickasstorrents.to/big-buck-bunny-2008-720p-bluray-nhd-x264-nhanc3-t127050.html
https://kickasstorrents.to/big-buck-bunny-2008-brrip-720p-x264-mitzep-t172753.html

However, I feel like the code is a bit redundant with all the repeated def taskx(): So I attempted to compact the code down by using a single task:

x=0
def task1():
    global x
    for url in splitarr[x]:
        print(url)
        x=x+1
t1 = threading.Thread(target=task1, name='t1') 
t2 = threading.Thread(target=task1, name='t2')   
t3 = threading.Thread(target=task1, name='t3')
t4 = threading.Thread(target=task1, name='t4') 
t5 = threading.Thread(target=task1, name='t5')
t6 = threading.Thread(target=task1, name='t6')
t7 = threading.Thread(target=task1, name='t7')
t8 = threading.Thread(target=task1, name='t8')

t1.start() 
t2.start()
t3.start() 
t4.start() 
t5.start()
t6.start() 
t7.start()
t8.start() 

t1.join()
t2.join()
t3.join()
t4.join()
t5.join()
t6.join()
t7.join() 
t8.join() 

However, this gives undesired output with duplicates:

https://kickasstorrents.to/big-buck-bunny-1080p-h264-aac-5-1-tntvillage-t115783.html
https://kickasstorrents.to/big-buck-bunny-1080p-h264-aac-5-1-tntvillage-t115783.html
https://kickasstorrents.to/big-buck-bunny-4k-uhd-hfr-60-fps-flac-webrip-2160p-x265-zmachine-t1041689.html
https://kickasstorrents.to/big-buck-bunny-2008-720p-bluray-x264-don-no-rars-t11623.html
https://kickasstorrents.to/big-buck-bunny-2008-720p-bluray-x264-don-no-rars-t11623.html
https://kickasstorrents.to/tkillaahh-big-buck-bunny-dvd-720p-2lions-team-t87503.html
https://kickasstorrents.to/big-buck-bunny-2008-brrip-720p-x264-mitzep-t172753.html
https://kickasstorrents.to/big-buck-bunny-2008-brrip-720p-x264-mitzep-t172753.html

How do I make the x increment properly in a program with multiple threads?

4
  • You can send parameters to the function executed in a thread. Use this to send the index. Commented Sep 15, 2020 at 4:31
  • Have you tried using passing the list index as the function argument? like def taskx(x): and then while setting the target, you can manually pass the argument x in the target for threading.Thread using partial from functools ? You are probably getting duplicates because you are using a global variable ang the time for each thread to finish is different. Commented Sep 15, 2020 at 4:31
  • Ah thanks, passing index as parameter worked. I have updated the answer Commented Sep 15, 2020 at 4:45
  • 1
    You should find out how lists and loops work. They would make your code less repetitive. Commented Sep 15, 2020 at 4:51

1 Answer 1

5

for url in splitarr[x]: creates an iterator for the sequence in splitarr[x]. It doesn't matter that you increment x later - the iterator is already built. Since you have a print in there, its very likely that all of the threads will grab x when its still zero and iterate the same sequence.

One solution is to pass incrementing values to task1 via the args argument in threading.Thread. But a thread pool is even easier.

from multiprocessing.pool import ThreadPool

# generate test array
splitarr = []
for i in range(8):
    splitarr.append([f"url_{i}_{j}" for j in range(4)])

def task(splitarr_column):
    for url in splitarr_column:
        print(url)

with ThreadPool(len(splitarr)) as pool:
    result = pool.map(task, splitarr)

In this example, len(splitarr) is used to create one thread per sequence in splitarr. Then each of those sequences is mapped to the task function. Since we created the right number of threads to handle all of the sequences, they all run at once. When the map completes, the with clause exits and the pool is closed, terminating the threads.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.