I wrote a large program that processes a big data set of 70k documents. Each docu takes about 5 seconds, hence I want to parallelize the procedure. The code doesn't work and I can't make sense of why. I tried it with one worker only, to make sure it's not a memory issue.
Code:
from doc_builder import DocBuilder
from glob import glob
from tqdm import tqdm
import threading
path = "/home/marcel/Desktop/transformers-master/examples/token-classification/CORD-19-research-challenge/document_parses/test_collection"
paths = [x for x in glob(path + '/**/*.json', recursive=True)]
workers_amount = 1
def main(paths):
doc_builder = DocBuilder()
for path in tqdm(paths):
data, doc = doc_builder.get_doc(path)
doc_builder.write_doc(path, data, doc)
threads = []
for i in range(workers_amount):
worker_paths = paths[int((i-1/workers_amount)*len(paths)):int((i/workers_amount)*len(paths))]
t = threading.Thread(target=main, args=[worker_paths])
t.start()
threads.append(t)
for t in threads:
t.join()
It just randomly finishes executing after a while. CPU threads do spike when starting but besides that nothing really happens. Is there something wrong with the code? If thats important I am running this on a Ryzen 7 3700X (so 16 threads should be possible).
/edit: At first I thought the problem might be that each thread initalizes a large PyTorch model and a trainer like this:
self.tokenizer = AutoTokenizer.from_pretrained(self.pretrained_dir) #, cache_dir=cache_dir)
self.splitter = spacy.load(cd_dir + "/en_core_sci_md-0.2.4/en_core_sci_md/en_core_sci_md-0.2.4")
self.model = AutoModelForTokenClassification.from_pretrained(self.pretrained_dir, config=self.config_dir) #,cache_dir=cache_dir)
self.model.load_state_dict(torch.load(self.model_dir))
self.trainer = Trainer(model=self.model, args=TrainingArguments(output_dir=self.output_dir))
This could be shared amongst the threads, so I don't need to initialize a new one every time (initializing those is very costly), but as I said, I tried using 1 worker, so really, that shouldn't be the problem, right?
doc_builder = DocBuilder()but it doesn't even get there!