Delete substrings in pandas DataFrame with python

Question

I want my python script to delete a row in a DataFrame, if the term at the current index is a substring of the following term. And also, if the following term is a substring of the term at the current index.

In the following example only the last data set with the terms 'A 600 Strom' should be left aswell as 'Silent'.

    term            timestamp
83  A 6             2018-09-27 18:26:46
85  A 60            2018-09-27 18:26:46
86  A 600           2018-09-27 18:26:46
89  A 600           2018-09-27 18:26:47
91  A 600 S         2018-09-27 18:26:47
93  A 600 Str       2018-09-27 18:26:48
95  A 600 Stro      2018-09-27 18:26:49
97  A 600 Str       2018-09-27 18:26:53
98  A 600 Strom     2018-09-27 18:26:5
99  S               2018-09-27 18:26:48
100 Sil             2018-09-27 18:26:49
101 Silen           2018-09-27 18:26:53
102 Silent          2018-09-27 18:26:5

Is there an elegant and efficient solution or do I have to process a series of if-statements in a loop?

is the term always in the same format A 600 Storm i.e B 250 Rain and B 2 would be a subset of it — Umar.H
– Umar.H, Commented Jun 22, 2020 at 15:57
it is not. It could also be something like "weather" and "weat" would be a subset. For a better understanding: The data comes from an application that gathers all search queries from the users, so the term could be in any format — Finito
– Finito, Commented Jun 22, 2020 at 16:07
Yes, but unfortunately it is inconsistent and therefore not really usable — Finito
– Finito, Commented Jun 22, 2020 at 16:23

Shubham Sharma · Accepted Answer · 2020-06-22 16:12:03Z

2

Use, Series.shift to shift the term column and assign it to the new_column s_1 then use DataFrame.agg along axis=1 to create a boolean mask by comparing a previous term to next term(s_1) and also compare the next term(s_1) to its previous term. Finally use this mask to filter the dataframe:

mask = (
    df.assign(s_1=df['term'].shift(-1).astype(str))
    .agg(lambda s: s['term'] in s['s_1'] or s['s_1'] in s['term'], axis=1)
)

df1 = df[~mask]

Result:

# print(df1)
           term            timestamp
98  A 600 Strom  2018-09-27 18:26:53

answered Jun 22, 2020 at 16:12

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Finito Over a year ago

Exactly what I needed. Thank you very much!

Collectives™ on Stack Overflow

Delete substrings in pandas DataFrame with python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related