0

I need to write a very simple function that reads from a json file and writes some of the content back to csv file. The trouble is that the input json file has weird encoding format, for example :

{
"content": "b\"Comment minimiser l'impact environnemental d\\xe8s la R&D des proc\\xe9d\\xe9s micro\\xe9lectroniques."
}

I would like to write back

Comment minimiser l'impact environnemental dès la R&D des procédés microélectroniques.

The first problem is the 'b' so the content should read as a byte array but it is read as a string. The second one is how to replace the weird characters ? Thank you

2
  • 3
    Did you generate the JSON? It was generated incorrectly. It would be better to fix the JSON at the source. Commented Feb 16, 2022 at 17:21
  • I can't regenerated the json part unfortunatly. Commented Feb 17, 2022 at 13:09

1 Answer 1

1

You could use something like this:

json_file_path = 'your_json_file.json'

with open(json_file_path, 'r', encoding='utf-8') as j:
    # Remove problematic "b\ character
    j = j.read().replace('\"b\\',"");
    # Process json
    contents = json.loads(j)

# Decode string to process correctly double backslashes
output = contents['content'].encode('utf-8').decode('unicode_escape')

print(output)
# Output
# Comment minimiser l'impact environnemental dès la R&D des procédés microélectroniques.
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for your answer. The replace part works and and removes the "b\ part, but I still have the weird characters. output = contents['content'].encode('utf-8').decode('unicode_escape') gives me this error : UnicodeEncodeError: 'charmap' codec can't encode character '\x92' So I had to change it to content.encode('utf-8').decode('unicode_escape') But the output is : Comment minimiser l'impact environnemental d\xe8s la R&D des proc\xe9d\xe9s micro\xe9lectroniques
@AudVid it seems you have some weird encoding in other lines of the file. Try using my updated answear. If you are generating the JSON yourself, make sure to encode it with utf-8 as well when writing it
It works ! Thank you a lot, I tried so many things but maybe not this combination.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.