I am trying to label multiple images by brand -> product -> each product image. Since it takes a bit of time to label each image one at a time, I decided to use multiprocessing to speed up the job. I tried using multiprocessing, it definitely speeds up labeling the images, but the code doesn't work how I intended it to.
Code:
def multiprocessing_func(line):
json_line = json.loads(line)
product = json_line['groupid']
active_urls = set(json_line['urls'])
try:
active_urls.remove(brand_dic[brand])
except:
pass
if product in saved_product_dict and active_urls == saved_product_dict[product]:
keep_products.append(product)
print('True')
else:
with open(new_images_filename, 'a') as save_file:
labels = label_product_images(line)
save_file.write('{}\n'.format(json.dumps(labels)))
print('False')
active_images_filename = 'data/input/image_urls.json'
new_images_filename = 'data/output/new_labeled_images.json'
saved_images_filename = 'data/output/saved_labeled_images.json'
brand_dic = {'a': 'https://www.a.com/imgs/ab/images/dp/m.jpg',
'b': 'https://www.b.com/imgs/ab/images/wcm/m.jpg',
'c': 'https://www.c.com/imgs/ab/images/dp/m.jpg',}
if __name__ == '__main__':
brands = ['a', 'b', 'c']
for brand in brands:
active_images_filename = 'data/input/brands/' + brand + '/image_urls.json'
new_images_filename = 'data/output/brands/' + brand + '/new_labeled_images.json'
saved_images_filename = 'data/output/brands/' + brand + '/saved_labeled_images.json'
print(new_images_filename)
with open(new_images_filename, 'w'): pass
saved_product_dict = {}
with open(saved_images_filename) as in_file:
for line in in_file:
json_line = json.loads(line)
saved_urls = [url for urls_list in json_line['urls'] for url in urls_list]
saved_product_dict[json_line['groupid']] = set(saved_urls)
print(saved_product_dict)
keep_products = []
labels_list = []
with open(active_images_filename, 'r') as in_file:
processes = []
for line in in_file:
p = multiprocessing.Process(target=multiprocessing_func, args=(line,))
processes.append(p)
p.start()
print('complete stage 1')
for i in range(0,2):
print('running stage 2')
Output:
data/output/brands/mg/new_labeled_images.json
{}
complete stage 1
running stage 2
running stage 2
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202025/0011/terminal-1-soft-sided-carry-on-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202025/0011/terminal-1-soft-sided-carry-on-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202010/0027/anchor-hope-and-protect-necklace-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202007/0003/patterned-folded-notecards-set-of-25-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202005/0003/patterned-folded-notecards-set-of-25-t.jpg
silo : https://a/mgimgs/rk/images/dp/wcm/202007/0002/patterned-folded-notecards-set-of-25-1-m.jpg
unmatched : https://www.a.com/mgimgs/rk/images/dp/a/202010/0013.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/a/202007/0002.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/a/202007/0003.jpg
False
unmatched : https://www.a.com/mgimgs/rk/images/dp/a/202010/0022.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202019/454.jpg
False
lifestyle - Lif1 : https://a.com/mgimgs/rk/images/dp/wcm/202025/0011.jpg
False
False
I noticed that the multiprocessing step runs last and skips codes, and I'm not sure why it does this. Also I'm not sure why it didn't run the first part, when I tried printing "saved_product_dict", the dictionary came up empty.
I have code before and after the multiprocessing step that run before it. My question is how to I force the multiprocessing step to run in the order that I have written my code. Any explanation on what's going would be greatly appreciated. I'm new to using multiprocessing, I'm still learning how it works.