I'm playing around with regular expression in Python for the below data.
Random
0 helloooo
1 hahaha
2 kebab
3 shsh
4 title
5 miss
6 were
7 laptop
8 welcome
9 pencil
I would like to delete the words which have patterns of repeated letters (e.g. blaaaa), repeated pair of letters (e.g. hahaha) and any words which have the same adjacent letters around one letter (e.g.title, kebab, were).
Here is the code:
import pandas as pd
data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
df = df.loc[~df.agg(lambda x: x.str.contains(r"([a-z])+\1{1,}\b"), axis=1).any(1)].reset_index(drop=True)
print(df)
Below is the output for the above with a Warning message:
UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
Random
0 hahaha
1 kebab
2 shsh
3 title
4 were
5 laptop
6 welcome
7 pencil
However, I expect to see this:
Random
0 laptop
1 welcome
2 pencil