2

I am trying to make a webscraper with multithreading to make it faster. I want to make the value increase every execution. but sometimes the value is skipping or repeating on itself.

import threading
num = 0

def scan():
    while True:
        global num
        num += 1
        print(num)
        open('logs.txt','a').write(str(f'{num}\n'))

for x in range(500):
    threading.Thread(target=scan).start()

Result:

2
2
5
5
7
8
10
10
12
13
13
13
16
17
19
19
22
23
24
25
26
28
29
29
31
32
33
34

Expected result:

1
2
3
4
5
6
7
8
9
10
0

3 Answers 3

7

so since the variable num is a shared resource, you need to put a lock on it. This is done as follows:

num_lock = threading.Lock()

Everytime you want to update the shared variable, you need your thread to first acquire the lock. Once the lock is acquired, only that thread will have access to update the value of num, and no other thread will be able to do so while the current thread has acquired the lock.

Ensure that you use wait or a try-finally block while doing this, to guarantee that the lock will be released even if the current thread fails to update the shared variable.

Something like this:

num_lock.acquire()
try:
        num+=1
finally:
   num_lock.release()

using with:

 with num_lock:
   num+=1
Sign up to request clarification or add additional context in comments.

5 Comments

I agree with your try... finally version, but I don't understand your use of with here (num_lock.acquire(); with num: num += 1) - the target of with should be the lock itself (e.g. num_lock) rather than the resource that the lock is protecting (e.g. num), and the use of with is instead of doing acquire -- so the whole thing is with num_lock: num+=1
right,my bad! I intented to write with num_lock:, thanks for pointing that out!
Do you agree also that the num_lock.acquire() is not necessary when using with? I must admit that I am fairly new to it myself, but this is based on example at bogotobogo.com/python/Multithread/… and seemed to work when I tried it myself.
yes, you are correct! using with eliminates the need to explicity acquire the lock.
with closes any resources for you, so you don't have to close them explicitly. It can also be used with files or anything else that defines a __exit__ function.
3

An important callout in addition to threading.Lock:

  • Use join to make the parent thread wait for forked threads to complete.
  • Without this, threads would still race.

Suppose I'm using the num after threads complete:

import threading

lock, num = threading.Lock(), 0


def operation():
    global num
    print("Operation has started")
    with lock:
        num += 1


threads = [threading.Thread(target=operation) for x in range(10)]
for t in threads:
    t.start()

for t in threads:
    t.join()

print(num)

Without join, inconsistent (9 gets printed once, 10 otherwise):

Operation has started
Operation has started
Operation has started
Operation has started
Operation has startedOperation has started

Operation has started
Operation has started
Operation has started
Operation has started9

With join, its consistent:

Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
10

2 Comments

Many thanks for pointing out the use of join. I'd forgotten to do that in my own answer, so I will also edit it to add it for sake of a tidier exit. However, the way that you are doing it here (where the t.start() and t.join() are in the same loop over threads) means that your threads are not actually executing in parallel, so you may as well just be running serial code. Don't you mean to start them in one loop and then join them in a separate loop?
@alani oops! thanks man, you got it right but the behavior still holds, as you know. updated :)
2

Seems like a race condition. You could use a lock so that only one thread can get a particular number. It would make sense also to use lock for writing to the output file.

Here is an example with lock. You do not guarantee the order in which the output is written of course, but every item should be written exactly once. In this example I added a limit of 10000 so that you can more easily check that everything is written eventually in the test code, because otherwise at whatever point you interrupt it, it is harder to verify whether a number got skipped or it was just waiting for a lock to write the output.

The my_num is not shared, so you after you have already claimed it inside the with num_lock section, you are free to release that lock (which protects the shared num) and then continue to use my_num outside of the with while other threads can access the lock to claim their own value. This minimises the duration of time that the lock is held.

import threading

num = 0
num_lock = threading.Lock()
file_lock = threading.Lock()    

def scan():
    global num_lock, file_lock, num
    
    while num < 10000:
        with num_lock:
            num += 1
            my_num = num

        # do whatever you want here using my_num
        # but do not touch num

        with file_lock:
            open('logs.txt','a').write(str(f'{my_num}\n'))
        
threads = [threading.Thread(target=scan) for _ in range(500)]

for thread in threads:
    thread.start()

for thread in threads:
    thread.join()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.