TLDR: Here is an attempt that takes an estimated ~2 hours on my computer for the scale of data you describe.
import numpy as np
import pandas as pd
def substring_search(fullstrings, sublen=4):
'''
fullstrings: array like of strings
sublen: length of substring to search
'''
# PART 1: FIND SUBSTRINGS
# length of full strings, assumes all are same
strsize = len(fullstrings[0])
# get unique strings, # occurences
strs, counts = np.unique(fullstrings, return_counts=True)
fullstrings = pd.DataFrame({'string':strs,
'count':counts})
unique_n = len(fullstrings)
# create array to hold substrings
substrings = np.empty(unique_n * (strsize - sublen + 1), dtype=str)
substrings = pd.Series(substrings)
# slice to find each substring
c = 0
while c + sublen <= strsize:
sliced = fullstrings['string'].str.slice(c, c+sublen)
s = c * unique_n
e = s + unique_n
substrings[s: e] = sliced
c += 1
# take the set of substrings, save in output df
substrings = np.unique(substrings)
output = pd.DataFrame({'substrings':substrings,
'repeats': 0,
'unique_sources': 0})
# PART 2: CHECKING FULL STRINGS FOR SUBSTRINGS
for i, s in enumerate(output['substrings']):
# check which fullstrings contain each substring
idx = fullstrings['string'].str.contains(s)
count = fullstrings['count'][idx].sum()
output.loc[i, 'repeats'] = count
output.loc[i, 'unique_sources'] = idx.sum()
print('Finished!')
return output
Applied to your example:
>>> example = ['AAAAB', 'BBBBA', 'BBBBA', 'ABBBB']
>>> substring_search(example)
substrings repeats unique_sources
0 AAAA 1 1
1 AAAB 1 1
2 ABBB 1 1
3 BBBA 2 1
4 BBBB 3 2
Explanation
The basic idea in the above code is to loop over all the unique substrings, and (for each of them) check against the list of full strings using pandas str methods. This saves one for loop (i.e. you don't loop over each full string for each substring). The other idea is to only check unique full strings (in addition to unique substrings); you save the number of occurrences of each full string beforehand and correct the count at the end.
The basic structure is:
- Get the unique strings in the input, and record the number of times each occurs.
- Find all unique substrings in the input (I do this using
pandas.Series.str.slice)
- Loop over each substring, and use
pandas.Series.str.contains to (element-wise) check the full strings. Since these are unique and we know the number of times each occurred, we can fill both repeats and unique_sources.
Testing
Here is code I used to create larger input data:
n = 100
size = 12
letters = list(string.ascii_uppercase[:20])
bigger = [''.join(np.random.choice(letters, size)) for i in range(n)]
So bigger is n size-length strings:
['FQHMHSOIEKGO',
'FLLNCKAHFISM',
'LDKKRKJROIRL',
...
'KDTTLOKCDMCD',
'SKLNSAQQBQHJ',
'TAIAGSIEQSGI']
With modified code that prints the progress (posted below), I tried with n=150000 and size=12, and got this initial output:
Starting main loop...
5%, 344.59 seconds
10.0%, 685.28 seconds
So 10 * 685 seconds / 60 (seconds/minute) = ~114 minutes. So 2 hours is not ideal, but practically more useful than 1-week. I don't doubt that there is some much cleverer way to do this, but maybe this can be helpful if nothing else is posted.
If you do use this code, you may want to verify that the results are correct with some smaller examples. One thing I was unsure of is whether you want to count whether a substring just appears in each full string (i.e. contains), or if you wanted the number of times it appears in a full string (i.e. count). That would at least hopefully be a small change.
Here is the additional code for printing progress while doing the search; there are just additional statements in #PART 2:
def substring_search_progress(fullstrings, sublen=4):
'''
fullstrings: array like of strings
sublen: length of substring to search
'''
# PART 1: FIND SUBSTRINGS
# length of full strings, assumes all are same
strsize = len(fullstrings[0])
# get unique strings, # occurences
strs, counts = np.unique(fullstrings, return_counts=True)
fullstrings = pd.DataFrame({'string':strs,
'count':counts})
unique_n = len(fullstrings)
# create array to hold substrings
substrings = np.empty(unique_n * (strsize - sublen + 1), dtype=str)
substrings = pd.Series(substrings)
# slice to find each substring
c = 0
while c + sublen <= strsize:
sliced = fullstrings['string'].str.slice(c, c+sublen)
s = c * unique_n
e = s + unique_n
substrings[s: e] = sliced
c += 1
# take the set of substrings, save in output df
substrings = np.unique(substrings)
output = pd.DataFrame({'substrings':substrings,
'repeats': 0,
'unique_sources': 0})
# PART 2: CHECKING FULL STRINGS FOR SUBSTRINGS
# for marking progress
total = len(output)
every = 5
progress = every
# main loop
print('Starting main loop...')
start = time.time()
for i, s in enumerate(output['substrings']):
# progress
if (i / total * 100) > progress:
now = round(time.time() - start, 2)
print(f'{progress}%, {now} seconds')
progress = (((i / total * 100) // every) + 1) * every
# check which fullstrings contain each substring
idx = fullstrings['string'].str.contains(s)
count = fullstrings['count'][idx].sum()
output.loc[i, 'repeats'] = count
output.loc[i, 'unique_sources'] = idx.sum()
print('Finished!')
return output