I am new in Python, I had a program which loads one big CSV file where is over 100k lines, each line had 4 columns. In FOR loop I check for each row same duplicated list (dlist), this dlist is list of objects of DRef class which I load with another function
DsRef class:
from tqdm import tqdm
from multiprocessing import Pool, cpu_count, freeze_support
class DsRef:
def __init__(self, pn, comp, comp_name, type, diff):
self.pn = pn
self.comp = comp
self.comp_name = comp_name
self.type = type
self.diff = diff
def __str__(self):
return f'{self.pn} {get_red("|")} {self.comp} {get_red("|")} {self.comp_name} {get_red("|")} {self.type} {get_red("|")} {self.diff}\n'
def __repr__(self):
return str(self)
def __iter__(self):
return iter(self.__dict__.items())
Duplication class:
class Duplication:
def __init__(self, pn, comp, cnt):
self.pn = pn
self.comp = comp
self.cnt = cnt
def __str__(self):
return f'{self.pn};{self.comp};{self.cnt}\n'
def __repr__(self):
return str(self)
def __hash__(self):
return hash(('pn', self.pn,
'comp', self.comp))
def __eq__(self, other):
return self.pn == other.pn and self.comp == other.comp
Load data file sample for testing:
dlist= []
dlist.append(DsRef(
"TTT_XXX", "CCC_VVV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_XCX", "CCC_VVV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_XXX", "CCC_VCV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_XXX", "CCC_VVV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_XYX", "CCC_YYY", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TAT_XQX", "CCC_VVV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"ATT_XXX", "CCC_VQV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_EEE", "CCC_VVV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_XWX", "CCC_VVV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_XXX", "CCC_VWV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_EEE", "CCC_VVV", "CTYPE", "CTYPE", "text"))
Method to find and return rows where were duplicated values:
def FindDuplications(dlist):
duplicates = []
for pn, comp in enumerate(dlist):
matches = [xpn for xpn, xcomp in enumerate(dlist) if pn == xpn and comp == xcomp]
duplicates.append(Duplication(pn, comp, len(matches)))
return duplicates
row.pn == x.pn and row.comp == x.comp if its true I find a duplication I compare first 2 parameters of each objech with each object in list
Now I try to use something like that for use all processor for a faster result, now it takes over 15 minutes
if __name__ == '__main__':
freeze_support()
p = Pool(cpu_count())
duplicates = p.map(FindDuplications, dlist)
p.close()
p.join()
In first I got an error when Class is not iterable then I create iter functions for first class, after that, I got an error then tuple object does not know pn or comp parameter, then I use in for enumerate(dlist) but still does not work
Could you please help me?
I would like also use TQDM to check the progress of processing function to find duplications
there is an original working function without use Multithreading:
def CheckDuplications(dlist):
print(get_yellow("========= CHECK CROSS DUPLICATIONS ========="))
duplicates = []
for r in tqdm(dlist):
matches = [x for x in dlist if r.pn == x.pn and r.comp == x.comp]
duplicates.append(Duplication(r.pn, r.comp, len(matches)))
results = [d for d in duplicates if d.cnt > 1]
results = set(results)
return results
From function FindDuplications I got list of DsRef objects (simple copy), but this must return list of Duplication objects, something is wrong
Thank you
pool.map()calls the given function on every item independently. TheFindDuplicationsdoesn't receive the full list, and it can't have access to the rest of the list to find other duplicates.find_duplications