1

while learning through SO's one of the question, where using regex to extract values.

I am wondering how we can implement a regex to remove all the characters if the are same in every row and matching the same index position.

Below is the DataFrame:

print(df)
   column1
0  [b,e,c]
1  [e,a,c]
2  [a,b,c]

regex :

 df.column1.str.extract(r'(\w,\w)')

 print(df)
  column1
0     b,e
1     e,a
2     a,b

In the above regex it extract the characters needed but i want to preserve [] this as well.

8
  • Are there strings in column1? Do you actually have '[b,e,c]' there? Commented Aug 18, 2021 at 10:37
  • @WiktorStribiżew, yes these are all strings. Commented Aug 18, 2021 at 10:38
  • 1
    If there are strings in column1, try df['column1'].str.replace(r'\[(\w,\w).*', r'[\1]', regex=True) Commented Aug 18, 2021 at 10:38
  • it produced an error.. sre_constants.error: unterminated character set at position 0 Commented Aug 18, 2021 at 10:40
  • No, it works well, you must have left out the escape before [ Commented Aug 18, 2021 at 10:41

1 Answer 1

1

You can use

df['column2'] = df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True)
df['column2'] = '[' + df['column1'].str.extract(r'(\w,\w)') + ']'

In the .str.replace approach, the (?s).*?\[(\w,\w).* matches any zero or more chars as few as possible, then a [, then captures a word char + comma + a word char into Group 1 (\1) and then the rest of the string and replaces the match with [ + Group 1 value + ].

In the second approach, [ and ] are added to the result of the extraction, this solution is best for your toy examples here.

Here is a Pandas test:

>>> import pandas as pd
>>> df = pd.DataFrame({'column1':['[b,e,c]']})
>>> df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True)
0    [b,e]
Name: column1, dtype: object

>>> '[' + df['column1'].str.extract(r'(\w,\w)') + ']'
       0
0  [b,e]
Sign up to request clarification or add additional context in comments.

2 Comments

thanks for the answer as this satisfies the sample requirement but i need if there could be a future proof solution ensuring if the character is same at the same index position
@kulfi You need to provide exact specs. Otherwise, df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True) might suffice. The second solution adding [ and ] around extracted values seems the simplest.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.