I been trying to speed up my code below which looks up an index, which will get a string from the list "name", and finally count the number of exact matches it has in two sections of data.
This process has been very slow. I read about replacing for loops when using numpy arrays but was not sure how to handle/approach creating a vectorized version with the regex matching.
x = np.empty([38000, 8000])
y = np.empty([38000, 8000])
for i in range(0, 38000):
for j in range(0, 8000):
x[i, j] = len(re.findall('\\b'+name[index[j]]+'\\b', data[i][1]))
y[i, j] = len(re.findall('\\b'+name[index[j]]+'\\b', data[i][2]))
Any insight is greatly appreciated,
nameitself contains regexs, it looks like you could first filter the possible candidate cells with simple string matching and then run regexs against the candidate cells...index = [ 0, 123, 454, ...] #1-by-8000index holds an index of name that is deemed interestingname = ['dog', 'cat', ...]name holds a large list of strings (1-by-50000) which we only want the index numbered values of.countornp.char.countare faster thanre.findallif you don't need the\bseparation.