I am reproducing some simple 10-arm bandit experiments from Sutton and Barto's book Reinforcement Learning: An Introduction. Some of these require significant computation time so I tried to get the advantage of my multicore CPU.
Here is the function which i need to run 2000 times. It has 1000 sequential steps which incrementally improve the reward:
import numpy as np
def foo(eps): # need an (unused) argument to use pool.map()
# initialising
# the true values of the actions
q = np.random.normal(0, 1, size=10)
# the estimated values
q_est = np.zeros(10)
# the counter of how many times each of the 10 actions was chosen
n = np.zeros(10)
rewards = []
for i in range(1000):
# choose an action based on its estimated value
a = np.argmax(q_est)
# get the normally distributed reward
rewards.append(np.random.normal(q[a], 1))
# increment the chosen action counter
n[a] += 1
# update the estimated value of the action
q_est[a] += (rewards[-1] - q_est[a]) / n[a]
return rewards
I execute this function 2000 times to get (2000, 1000) array:
reward = np.array([foo(0) for _ in range(2000)])
Then I plot the mean reward across 2000 experiments:
import matplotlib.pyplot as plt
plt.plot(np.arange(1000), reward.mean(axis=0))
which fully corresponds the expected result (looks the same as in the book). But when I try to execute it in parallel, I get much greater standard deviation of the average reward:
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
reward_p = np.array(pool.map(foo, [0]*2000))
plt.plot(np.arange(1000), reward_p.mean(axis=0))
I suppose this is due to the parallelization of a loop inside of the foo. As i reduce the number of cores allocated to the task, the reward plot approaches the expected shape.
Is there a way to get the advantage of the multiprocessing here while getting the correct results?
UPD: I tried running the same code on Windows 10 and sequential vs parallel and the results turned out to be the same! What may be the reason?
Ubuntu 20.04, Python 3.8.5, jupyter
Windows 10, Python 3.7.3, jupyter