Regex expression for a particular pattern

Question

So I have occurrence of strings starting with \u followed by various forms of 4 character hexadecimals (They are not unicode objects, but actual strings in the data, which is why I would like to clean up the data) and would like to replace that occurrences with white spaces.

Example textfile: Hello \u2022 Created, reviewed, \u00e9executed and maintained

For eg: there would be occurrences of strings \u2022 and \u00e9, I would like to find \u and remove it along with the 4 character substring 2022 and 00e9 followed after that. I'm looking for an adequate regex for this pattern.

Example Code:

import json
import io
import re

files = glob('Candidate Profile Data/*')

for file_ in files:
    with io.open(file_, 'r', encoding='us-ascii') as json_file:
        json_data = json_file.read().decode()
        json_data = re.sub('[^\x00-\x7F]+',' ',json_data)
        json_data = json_data.replace('\\n',' ')
        json_data = re.sub(r'\\u[0-9a-f]{,4}',' ',json_data)

        print json_data
        json_data = json.loads(json_data)
        print(json_data)

If I'm getting it right, you need to remove unicode characters from a string? — lch
– lch, Commented Apr 22, 2017 at 16:01
@LeonardoChirivì No, which is why I have explicitly mentioned that they are not ACTUAL unicode characters, but in form of strings in the data itself. — burglarhobbit
– burglarhobbit, Commented Apr 22, 2017 at 16:03

Kind Stranger · Accepted Answer · 2017-04-22 16:13:42Z

2

Really, we need an example of your code, but as a pointer, the regex i think you'll need is something like r'\\u[0-9a-f]{,4}'

Here is an example of it in use:

>>> import re
>>> my_string='Hello \\u2022 Created, reviewed, \\u00e9executed and maintained'
>>> my_string
'Hello \\u2022 Created, reviewed, \\u00e9executed and maintained'
>>> re.sub(r'\\u[0-9a-f]{,4}',"",my_string)
'Hello  Created, reviewed, executed and maintained'

Would still like to see an example of your CODE so that we can provide a more accurate answer

edited Apr 22, 2017 at 16:13

answered Apr 22, 2017 at 16:01

Kind Stranger

1,78115 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

burglarhobbit Over a year ago

Yes it worked after I added the preceding 'r', thought it wasn't needed. I just added an example code of what I'm trying to do. If you could merge my code into a single regex expression, I would be grateful for that.

Collectives™ on Stack Overflow

Regex expression for a particular pattern

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related