How to remove unique character based on the same index via regex

Question

while learning through SO's one of the question, where using regex to extract values.

I am wondering how we can implement a regex to remove all the characters if the are same in every row and matching the same index position.

Below is the DataFrame:

print(df)
   column1
0  [b,e,c]
1  [e,a,c]
2  [a,b,c]

regex :

 df.column1.str.extract(r'(\w,\w)')

 print(df)
  column1
0     b,e
1     e,a
2     a,b

In the above regex it extract the characters needed but i want to preserve [] this as well.

Are there strings in column1? Do you actually have '[b,e,c]' there? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 18, 2021 at 10:37
If there are strings in column1, try df['column1'].str.replace(r'\[(\w,\w).*', r'[\1]', regex=True) — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 18, 2021 at 10:38
it produced an error.. sre_constants.error: unterminated character set at position 0 — user2023
– user2023, Commented Aug 18, 2021 at 10:40
No, it works well, you must have left out the escape before [ — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 18, 2021 at 10:41

Wiktor Stribiżew · Accepted Answer · 2021-08-18 11:01:38Z

1

You can use

df['column2'] = df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True)
df['column2'] = '[' + df['column1'].str.extract(r'(\w,\w)') + ']'

In the .str.replace approach, the (?s).*?\[(\w,\w).* matches any zero or more chars as few as possible, then a [, then captures a word char + comma + a word char into Group 1 (\1) and then the rest of the string and replaces the match with [ + Group 1 value + ].

In the second approach, [ and ] are added to the result of the extraction, this solution is best for your toy examples here.

Here is a Pandas test:

>>> import pandas as pd
>>> df = pd.DataFrame({'column1':['[b,e,c]']})
>>> df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True)
0    [b,e]
Name: column1, dtype: object

>>> '[' + df['column1'].str.extract(r'(\w,\w)') + ']'
       0
0  [b,e]

edited Aug 18, 2021 at 11:01

answered Aug 18, 2021 at 10:46

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user2023 Over a year ago

thanks for the answer as this satisfies the sample requirement but i need if there could be a future proof solution ensuring if the character is same at the same index position

Wiktor Stribiżew Over a year ago

@kulfi You need to provide exact specs. Otherwise, df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True) might suffice. The second solution adding [ and ] around extracted values seems the simplest.

Collectives™ on Stack Overflow

How to remove unique character based on the same index via regex

Below is the DataFrame:

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Below is the DataFrame:

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related