0

I am trying to remove the pattern using following code

x = "mr<u+092d><u+093e><u+0935><u+0941><u+0915>" 
pattern = '[<u+0-9de>]'
re.sub(pattern,'', x)

Output

mr

This output is actually correct for the given sample string but when I am running this code to the corpus, it removing all the occurrences of 'de' as well as digits etc. I want these things are replaced only when < > is used.

0

1 Answer 1

1

You need to put the <> outside, as the structure will always be

  • start with <
  • following by u\+
  • 4 chars in hexa [0-9a-f]{4} as from Unicode definition
  • end with >
pattern = '<u\+[0-9a-f]{4}>'
re.sub(pattern,'', x)

                                  REGEX DEMOCODE DEMO

Sign up to request clarification or add additional context in comments.

5 Comments

Because of hex characters.
@Binh This is Unicode definition. the 4 chars are hexadecimal
@Mandy8055 Oh I didnt notice it is hex characters, I understand it know. Thanks Mandy
@azro another doubt, As I understand {4} is used because there are always 4 characters. If the character count is not uniform we can use '|' and create multiple pattern. What will be the better way to fix this problem ?
@RajendraNayal to get an size interval do <u\+[0-9a-f]{2,5}> for example which accept 3-4-5 length, like 023, a5d0 and 5d9c0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.