2

So I have occurrence of strings starting with \u followed by various forms of 4 character hexadecimals (They are not unicode objects, but actual strings in the data, which is why I would like to clean up the data) and would like to replace that occurrences with white spaces.

Example textfile: Hello \u2022 Created, reviewed, \u00e9executed and maintained

For eg: there would be occurrences of strings \u2022 and \u00e9, I would like to find \u and remove it along with the 4 character substring 2022 and 00e9 followed after that. I'm looking for an adequate regex for this pattern.

Example Code:

import json
import io
import re

files = glob('Candidate Profile Data/*')

for file_ in files:
    with io.open(file_, 'r', encoding='us-ascii') as json_file:
        json_data = json_file.read().decode()
        json_data = re.sub('[^\x00-\x7F]+',' ',json_data)
        json_data = json_data.replace('\\n',' ')
        json_data = re.sub(r'\\u[0-9a-f]{,4}',' ',json_data)

        print json_data
        json_data = json.loads(json_data)
        print(json_data)
2
  • If I'm getting it right, you need to remove unicode characters from a string? Commented Apr 22, 2017 at 16:01
  • @LeonardoChirivì No, which is why I have explicitly mentioned that they are not ACTUAL unicode characters, but in form of strings in the data itself. Commented Apr 22, 2017 at 16:03

1 Answer 1

2

Really, we need an example of your code, but as a pointer, the regex i think you'll need is something like r'\\u[0-9a-f]{,4}'

Here is an example of it in use:

>>> import re
>>> my_string='Hello \\u2022 Created, reviewed, \\u00e9executed and maintained'
>>> my_string
'Hello \\u2022 Created, reviewed, \\u00e9executed and maintained'
>>> re.sub(r'\\u[0-9a-f]{,4}',"",my_string)
'Hello  Created, reviewed, executed and maintained'

Would still like to see an example of your CODE so that we can provide a more accurate answer

Sign up to request clarification or add additional context in comments.

1 Comment

Yes it worked after I added the preceding 'r', thought it wasn't needed. I just added an example code of what I'm trying to do. If you could merge my code into a single regex expression, I would be grateful for that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.