Removing specific pattern from a string using regex in python

Question

I am trying to remove the pattern using following code

x = "mr<u+092d><u+093e><u+0935><u+0941><u+0915>" 
pattern = '[<u+0-9de>]'
re.sub(pattern,'', x)

Output

mr

This output is actually correct for the given sample string but when I am running this code to the corpus, it removing all the occurrences of 'de' as well as digits etc. I want these things are replaced only when < > is used.

azro · Accepted Answer · 2020-05-29 08:17:18Z

1

You need to put the <> outside, as the structure will always be

start with <
following by u\+
4 chars in hexa [0-9a-f]{4} as from Unicode definition
end with >

pattern = '<u\+[0-9a-f]{4}>'
re.sub(pattern,'', x)

`REGEX DEMO` ★ `CODE DEMO`

edited May 29, 2020 at 8:17

answered May 29, 2020 at 8:04

azro

54.2k9 gold badges39 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user7571182 Over a year ago

Because of hex characters.

azro Over a year ago

@Binh This is Unicode definition. the 4 chars are hexadecimal

Binh Over a year ago

@Mandy8055 Oh I didnt notice it is hex characters, I understand it know. Thanks Mandy

Rajendra Nayal Over a year ago

@azro another doubt, As I understand {4} is used because there are always 4 characters. If the character count is not uniform we can use '|' and create multiple pattern. What will be the better way to fix this problem ?

azro Over a year ago

@RajendraNayal to get an size interval do <u\+[0-9a-f]{2,5}> for example which accept 3-4-5 length, like 023, a5d0 and 5d9c0

Collectives™ on Stack Overflow

Removing specific pattern from a string using regex in python

1 Answer 1

`REGEX DEMO` ★ `CODE DEMO`

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

REGEX DEMO ★ CODE DEMO

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related

`REGEX DEMO` ★ `CODE DEMO`