1

I am reading a JSON file in Python which has lots of fields and values (~8000 records). Env: windows 10, python 3.6.4; code:

import json
json_data = json.load(open('json_list.json'))
print (json_data)

With this I get an error. Below is the stack trace:

  json_data = json.load(open('json_list.json'))
  File "C:\Program Files (x86)\Python36-32\lib\json\__init__.py", line 296, in load
    return loads(fp.read(),
  File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7977319: character maps to <undefined>

Along with this I have tried

import json
with open('json_list.json', encoding='utf-8') as fd:
     json_data = json.load(fd)
     print (json_data)

with this my program runs for a long time then hangs with no output.

I have searched almost all topics related to this and could not find a solution.

Note: The JSON data is a valid one as when I see it on Postman/any REST client it doesn't report any anomalies.

Any help on this or alternative solution on how can I load my JSON data (any way by converting it to string then back to JSON etc) will be of great help.

Here is what the file looks like around the reported error:

>>> from pprint import pprint
>>> f = open('C:/Users/c5242046/Desktop/test2/dblist_rest.json', 'rb')
>>> f.seek(7977319)
7977319
>>> pprint(f.read(100))
(b'\x81TICA EL ABGEN INGL\xc3\x83\xc2\x89S, S.A.","memory_size_gb":"64","since'
 b'":"2017-04-10","storage_size_gb":"84.747')
26
  • Something is wrong if you are somehow ending up decoding cp1252. JSON is specifically UTF-8. Troubleshooting is hard as the Python trace doesn't show you the problematic data -- if you can use try/except you can at least print the problematic input as a first step towards debugging this, but with a large input, just waiting for it to repro is slow and painful. Commented Jan 15, 2018 at 8:56
  • Thanks for the response, is there any way i can change the data file to some other extension and then read it and convert back to json ? Commented Jan 15, 2018 at 9:11
  • Also to include , how can I use try/except dow you want me to check with other encoding formats ? If you could help it would be great Commented Jan 15, 2018 at 9:13
  • Python doesn't care what the file name is. If you don't try to decode it as JSON, you can do something else with it first, or instead. Commented Jan 15, 2018 at 9:30
  • okay so , what can be the solution for this ? Commented Jan 15, 2018 at 9:32

1 Answer 1

1

The snippet you are asking about seems to have been double-encoded. Basically, whatever originally generated this data produced text in Latin-1 or some related encoding (Windows code page 1252?). It was then fed to a process which converts Latin-1 to UTF-8 ... twice.

Of course, "converting" data which is already UTF-8 but telling the computer that it's Latin-1 just produces mojibake.

The string INGL\xc3\x83\xc2\x89S suggests this analysis, if you can guess that it is supposed to say Inglés in upper case, and realize that the UTF-8 encoding for É is \xC3 \x89 and then examine which characters these two bytes encode in Latin-1 (or, as it happens, Unicode, which is a superset of Latin-1, though they are not compatible on the encoding level).

Notice that being able to guess which string a problematic sequence is supposed to represent is the crucial step here; it also explains why including a representative snippet of the problematic data - with enough context! - is vital for debugging.

Anyway, if the entire file has the same symptom, you should be able to undo the second, superfluous and incorrect round of re-encoding; though an error this far into the file makes me imagine it's probably a local problem with just one or a few records. Maybe they were merged from multiple input files, only one of which had this error. Then fixing it requires a fair bit of detective work, and manual editing, or identifying and fixing the erroneous source. A quick and dirty workaround is to simply manually remove any erroneous records.

Sign up to request clarification or add additional context in comments.

2 Comments

So can I convert back it to latin-1 and then parse the data ? or any other alternatives are there
Updated the answer with an elaboration on this topic.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.