Following off this question I am wanting a count of specific words or phrases per rows in a DataFrame that are extremely similar and such that there is a subset string to be searched that is considered unique. When expanding and diversifying the criteria of strings to be searched, I ran into trouble and was not able to clearly modify or generalize for the solution. The set of strings I'm looking for is as follows:
Set of Strings
'burgers',
'burgers.extreme',
'burgers legacy',
'burgers 2',
'burgers V2',
'greasy burgers',
'ultimate burgers',
'chewy burgers',
'rotten burgers'
Expected Output
Additional columns representing counts per each of those strings appended onto the DataFrame with the strings as their new, respective column names. This will show the counts of each of those strings per row.
Data: DataFrame column with "Text" as the column name:
Text
burgers V2 satisfies these requirements 1; burgers is sufficient for requirements 2; burgers legacy is on the road map for requirement 3; burgers answers requirements 4
burgers satisfies requirement 1; burgers is on the roadmap for requirement 2; burgers V2 satisfies requirement 3; burgers legacy satisfies requirement 4
burgers legacy satisfies requirements 1; burgers V2 satisfies requirements 2; burgers V2 satisfies requirements 3
greasy burgers do not meet requirements 1; burgers satisfies requirements 2; burgers.extreme meets requirement 3; ultimate burgers satisfies requirements 4
chewy burgers satisfies requirements 1; rotten burgers satisfies requirements 2; burgers legacy satisfies requirement 3; burgers legacy satisfies requirements 4
burgers 2 satisfies requirements 1; chewy burgers satisfies requirements 2; greasy burgers satisfies requirements 3; greasy burgers satisfies requirements 4
Issues to Take Note
- As in the other question linked above, the obvious problem is that the string 'burgers' is a subset of the other strings. So doing something like
df['burgers'] = df['Text'].str.count('burgers')results in over-counting 'burgers'. - To delineate and count 'burgers' as its own entity I believe I have to rely on what is not immediately surrounding that string. ie if 'burgers is not immediately preceded by
'greasy ','ultimate ','chewy ','rotten 'or not immediately followed by'.extreme', ' legacy', ' 2', ' V2'then it should be considered and counted as a unique item in that row. - All the above prefixes and suffixes indicating unique items contain a space (white space) except
'.extreme'indicating the item'burgers.extreme'is its own unique item and does not have a space like the others.
What I attempted
Using Andrej's wonderful answer as a starting point, I attempted to modify it to a broader category of strings, but have been running into trouble with the regex and generalizing it. Such as trying:
tmp = df["Text"].str.findall(r"(greasy|ultimate|chewy|rotten)?\s*burgers\s*(legacy|V2|2)?").explode())
df = pd.concat([df, pd.crosstab(tmp.index, tmp).add_prefix("burgers ")], axis=1)
df.columns = map(str.strip, df.columns)
but adding (greasy|ultimate|chewy|rotten)?\s* returns TypeError: Cannot broadcast np.ndarray with operand of type <class 'list'>, nor does it make sense to label the columns with .add_prefix("burgers ")] since there is now text being looked at at either side. It also just omits 'burgers.extreme' as well.
I thought that perhaps doing it piece-wise might be the way to go, something like
tmp1 = df["Text"].str.findall(r"burgers\s*(legacy|V2|2)?").explode()
tmp2 = df["Text"].str.findall(r"(greasy|ultimate|chewy|rotten)?\s*burgers").explode()
tmp3 = df["Text"].str.findall(r"burgers.extreme").explode()
but I'm not able to merge or concatenate the three in a way that allows for the crosstab to work and returns a dimensionality error. It also seems like a very inefficient approach, as well.
What can I do to resolve this? Or what generalized approach can I take that allows for the search and counting of string per row where there is a unique subset of string characters as one of the elements?