1

I'm playing around with regular expression in Python for the below data.

     Random
0  helloooo
1    hahaha
2     kebab
3      shsh
4     title
5      miss
6      were
7    laptop
8   welcome
9    pencil

I would like to delete the words which have patterns of repeated letters (e.g. blaaaa), repeated pair of letters (e.g. hahaha) and any words which have the same adjacent letters around one letter (e.g.title, kebab, were).

Here is the code:

import pandas as pd

data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}

df = pd.DataFrame(data)

df = df.loc[~df.agg(lambda x: x.str.contains(r"([a-z])+\1{1,}\b"), axis=1).any(1)].reset_index(drop=True)

print(df)

Below is the output for the above with a Warning message:

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
    Random
0   hahaha
1    kebab
2     shsh
3    title
4     were
5   laptop
6  welcome
7   pencil

However, I expect to see this:

    Random
0   laptop
1  welcome
2   pencil
0

2 Answers 2

4

You can use Series.str.contains directly to create a mask and disable the user warning before and enable it after:

import pandas as pd
import warnings

data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
warnings.filterwarnings("ignore", 'This pattern has match groups') # Disable the warning
df['Random'] = df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
warnings.filterwarnings("always", 'This pattern has match groups') # Enable the warning

Output:

>>> df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
# =>     
7     laptop
8    welcome
9     pencil
Name: Random, dtype: object

The regex you have contains an issue: the quantifier is put outside of the group, and \1 was looking for the wrong repeated string. Also, the \b word boundary is superflous. The ([a-z]+)[a-z]?\1 pattern matches for one or more letters, then any one optional letter, and the same substring right after it.

See the regex demo.

We can safely disable the user warning because we deliberately use the capturing group here, as we need to use a backreference in this regex pattern. The warning needs re-enabling to avoid using capturing groups in other parts of our code where it is not necessary.

Sign up to request clarification or add additional context in comments.

6 Comments

thanks for the explanation! I got an error for the warning --> NameError: name 'warnings' is not defined
@user01 Sorry, lost when copy/pasting. Added import warnings.
Also, if I want to check for the repeated digits. Where do I add '\d+' to the regex?
@user01 ([a-z\d]+)[a-z\d]?\1? (\w+)\w?\1? ([^\W_]+)[^\W_]?\1? Can you precise what chars you want to match? If any, then (.+).?\1?
@user01 Then use ([a-z2-5]+)[a-z2-5]?\1
|
2

IIUC, you can use sth like the pattern r'(\w+)(\w)?\1', i.e., one or more letters, an optional letter, and the letters from the first match. This gives the right result:

df[~df.Random.str.contains(r'(\w+)(\w)?\1')]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.