Python 3: JSON File Load with Non-ASCII Characters

Question

just trying to load this JSON file(with non-ascii characters) as a python dictionary with Unicode encoding but still getting this error:

return codecs.ascii_decode(input, self.errors)[0]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 90: ordinal not in range(128)

JSON file content = "tooltip":{ "dxPivotGrid-sortRowBySummary": "Sort\"{0}\"byThisRow",}

import sys  
import json

data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
    for line in f:
        data.append(json.loads(line.encode('utf-8','replace')))

json.loads as an argument encoding. What is real content of the pt-PT.json file - are there lines of valid JSON data, or it is one long JSON file? In later case it would be better load directly as a file, not line by line. — Jan Vlcinsky
– Jan Vlcinsky, Commented Apr 8, 2016 at 17:19
The string you show as JSON file content is not valid JSON, it is only fragment of larger object. — Jan Vlcinsky
– Jan Vlcinsky, Commented Apr 8, 2016 at 17:21
Tried loading as a file also but same issue and error is shown — min2bro
– min2bro, Commented Apr 8, 2016 at 17:29
Try to validate the JSON file by some JSON validator first. There are online tools, and some command line ones. — Jan Vlcinsky
– Jan Vlcinsky, Commented Apr 8, 2016 at 17:32
Check modified question now, it's due to some line in the json file, not sure how to fix it — min2bro
– min2bro, Commented Apr 8, 2016 at 18:08

tdelaney · Accepted Answer · 2016-04-08 18:33:02Z

11

You have several problems as near as I can tell. First, is the file encoding. When you open a file without specifying an encoding, the file is opened with whatever sys.getfilesystemencoding() is. Since that may vary (especially on Windows machines) its a good idea to explicitly use encoding="utf-8" for most json files. Because of your error message, I suspect that the file was opened with an ascii encoding.

Next, the file is decoded from utf-8 into python strings as it is read by the file system object. The utf-8 line has already been decoded to a string and is already ready for json to read. When you do line.encode('utf-8','replace'), you encode the line back into a bytes object which the json loads (that is, "load string") can't handle.

Finally, "tooltip":{ "navbar":"Operações de grupo"} isn't valid json, but it does look like one line of a pretty-printed json file containing a single json object. My guess is that you should read the entire file as 1 json object.

Putting it all together you get:

import json

with open('/Users/myvb/Desktop/Automation/pt-PT.json', encoding="utf-8") as f:
    data = json.load(f)

From its name, its possible that this file is encoded as a Windows Portugese code page. If so, the "cp860" encoding may work better.

edited Apr 8, 2016 at 18:33

answered Apr 8, 2016 at 18:27

tdelaney

77.9k6 gold badges91 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

min2bro Over a year ago

it's not because of portugese content but due to JSON file content = "tooltip":{ "dxPivotGrid-sortRowBySummary": "Sort\"{0}\"byThisRow",}

tdelaney Over a year ago

I see you've changed the string causing problems in your question from one that has non-ascii characters. The new string doesn't contain an 0xc3 utf-8 encoding byte so I don't see how it can produce the "can't decode byte 0xc3" error. Regardless, that string isn't valid JSON but does look like a fragment of valid JSON. Are you saying that the entire file contains just that one line?

Xiaofu Over a year ago

This was a life saver when having problems with a script on a Chinese colleague's Windows machine that was fine on Macs.

Giovanni Rescia · Accepted Answer · 2016-04-08 17:25:29Z

0

I had the same problem, what worked for me was creating a regular expression, and parsing every line from the json file:

REGEXP = '[^A-Za-z0-9\'\:\.\;\-\?\!]+'
new_file_line = re.sub(REGEXP, ' ', old_file_line).strip()

answered Apr 8, 2016 at 17:25

Giovanni Rescia

5905 silver badges15 bronze badges

1 Comment

tdelaney Over a year ago

This strips all non-English characters which is likely not what OP wants.

Jan Vlcinsky · Accepted Answer · 2016-04-08 17:26:32Z

0

Having a file with content similar to yours I can read the file in one simple shot:

>>> import json
>>> fname = "data.json"
>>> with open(fname) as f:
...     data = json.load(f)
...
>>> data
{'tooltip': {'navbar': 'Operações de grupo'}}

answered Apr 8, 2016 at 17:26

Jan Vlcinsky

44.4k12 gold badges106 silver badges103 bronze badges

2 Comments

min2bro Over a year ago

after lot of analysis i found, it's giving this error because of this data in the json file:

min2bro Over a year ago

"dxPivotGrid-sortRowBySummary": "Sort\"{0}\"byThisRow",

Magnun Leno · Accepted Answer · 2016-04-08 17:30:17Z

0

You don't need to read each line. You have two options:

import sys  
import json

data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
    data.append(json.load(f))

Or, you can load all lines and pass them to the json module:

import sys  
import json

data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
    data.append(json.loads(''.join(f.readlines())))

Obviously, the first suggestion is the best.

answered Apr 8, 2016 at 17:30

Magnun Leno

2,73821 silver badges29 bronze badges

Collectives™ on Stack Overflow

Python 3: JSON File Load with Non-ASCII Characters

4 Answers 4

3 Comments

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related