I have a dataframe column which is comprised of strings. I also have a list of substrings. For every substring, I want to test it against each string in the dataframe column, returning True if the substring is in the string. The following works but is very slow.
import pandas as pd
import time
t0 = time.time()
df = pd.DataFrame({
'FullName': ['C:/historical Dog analysis/Digger.doc', 'C:/historical Dog analysis/Roscoe.doc', 'C:/2024/Budgie requests/pipsqueak.csv', 'C:/text4.doc', 'C:/text5.doc'],
})
new_columns = {"_Outreach/Website design": (df['FullName'].str.contains("/historical Dog analysis/|"\
"/Budgie requests/|"\
"Dog analysis/best practices",case=False))
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1).reindex(df.index)
t1 = time.time()
print(t1-t0)
print(df)
In an effort to find a faster approach, I tried isin. But it only appears to work when matching string to string, not string to substring.
t0 = time.time()
df = pd.DataFrame({
'FullName': ['C:/historical Dog analysis/Digger.doc', 'C:/historical Dog analysis/Roscoe.doc', 'C:/2024/Budgie requests/pipsqueak.csv', 'C:/text4.doc', 'C:/text5.doc'],
})
#works, but not useful because requires full string match
new_columns = df["FullName"].isin(["C:/historical Dog analysis/Digger.doc","C:/2024/Budgie requests/pipsqueak.csv"])
#doesn't work (Returns a list of FALSE in next column)
# new_columns = df["FullName"].isin([".*/historical Dog analysis/.*"])
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1).reindex(df.index)
t1 = time.time()
print(t1-t0)
print(df)
I also tried filter, but it seems that it can only take one substring input at a time.
col_one_list = df['FullName'].tolist()
#doesn't work:TypeError: 'in <string>' requires string as left operand, not list
# b = ["/historical Dog analysis/","/Budgie requests/"]
#doesn't work: TypeError: unsupported operand type(s) for |: 'str' and 'str'
# b = ("/historical Dog analysis/"|"/Budgie requests/")
#works, but can only search one substring at a time
b = "/historical Dog analysis/"
new_columns = list(filter(lambda x: b in x, col_one_list))
print(new_columns)
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1).reindex(df.index)
t1 = time.time()
print(t1-t0)
print(df)
Does anyone know a fast way to match a list of substrings to strings?