Delete words with regex patterns in Python from a dataframe

Question

I'm playing around with regular expression in Python for the below data.

     Random
0  helloooo
1    hahaha
2     kebab
3      shsh
4     title
5      miss
6      were
7    laptop
8   welcome
9    pencil

I would like to delete the words which have patterns of repeated letters (e.g. blaaaa), repeated pair of letters (e.g. hahaha) and any words which have the same adjacent letters around one letter (e.g.title, kebab, were).

Here is the code:

import pandas as pd

data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}

df = pd.DataFrame(data)

df = df.loc[~df.agg(lambda x: x.str.contains(r"([a-z])+\1{1,}\b"), axis=1).any(1)].reset_index(drop=True)

print(df)

Below is the output for the above with a Warning message:

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
    Random
0   hahaha
1    kebab
2     shsh
3    title
4     were
5   laptop
6  welcome
7   pencil

However, I expect to see this:

    Random
0   laptop
1  welcome
2   pencil

Wiktor Stribiżew · Accepted Answer · 2021-09-07 15:23:07Z

4

You can use Series.str.contains directly to create a mask and disable the user warning before and enable it after:

import pandas as pd
import warnings

data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
warnings.filterwarnings("ignore", 'This pattern has match groups') # Disable the warning
df['Random'] = df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
warnings.filterwarnings("always", 'This pattern has match groups') # Enable the warning

Output:

>>> df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
# =>     
7     laptop
8    welcome
9     pencil
Name: Random, dtype: object

The regex you have contains an issue: the quantifier is put outside of the group, and \1 was looking for the wrong repeated string. Also, the \b word boundary is superflous. The ([a-z]+)[a-z]?\1 pattern matches for one or more letters, then any one optional letter, and the same substring right after it.

See the regex demo.

We can safely disable the user warning because we deliberately use the capturing group here, as we need to use a backreference in this regex pattern. The warning needs re-enabling to avoid using capturing groups in other parts of our code where it is not necessary.

edited Sep 7, 2021 at 15:23

answered Sep 7, 2021 at 14:06

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user01 Over a year ago

thanks for the explanation! I got an error for the warning --> NameError: name 'warnings' is not defined

Wiktor Stribiżew Over a year ago

@user01 Sorry, lost when copy/pasting. Added import warnings.

user01 Over a year ago

Also, if I want to check for the repeated digits. Where do I add '\d+' to the regex?

Wiktor Stribiżew Over a year ago

@user01 ([a-z\d]+)[a-z\d]?\1? (\w+)\w?\1? ([^\W_]+)[^\W_]?\1? Can you precise what chars you want to match? If any, then (.+).?\1?

Wiktor Stribiżew Over a year ago

@user01 Then use ([a-z2-5]+)[a-z2-5]?\1

|

fsimonjetz · Accepted Answer · 2021-09-07 14:03:56Z

2

IIUC, you can use sth like the pattern r'(\w+)(\w)?\1', i.e., one or more letters, an optional letter, and the letters from the first match. This gives the right result:

df[~df.Random.str.contains(r'(\w+)(\w)?\1')]

answered Sep 7, 2021 at 14:03

fsimonjetz

5,7923 gold badges7 silver badges23 bronze badges

Collectives™ on Stack Overflow

Delete words with regex patterns in Python from a dataframe

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related