decoding issue while parsing JSON [python]

Question

I am reading a JSON file in Python which has lots of fields and values (~8000 records). Env: windows 10, python 3.6.4; code:

import json
json_data = json.load(open('json_list.json'))
print (json_data)

With this I get an error. Below is the stack trace:

  json_data = json.load(open('json_list.json'))
  File "C:\Program Files (x86)\Python36-32\lib\json\__init__.py", line 296, in load
    return loads(fp.read(),
  File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7977319: character maps to <undefined>

Along with this I have tried

import json
with open('json_list.json', encoding='utf-8') as fd:
     json_data = json.load(fd)
     print (json_data)

with this my program runs for a long time then hangs with no output.

I have searched almost all topics related to this and could not find a solution.

Note: The JSON data is a valid one as when I see it on Postman/any REST client it doesn't report any anomalies.

Any help on this or alternative solution on how can I load my JSON data (any way by converting it to string then back to JSON etc) will be of great help.

Here is what the file looks like around the reported error:

>>> from pprint import pprint
>>> f = open('C:/Users/c5242046/Desktop/test2/dblist_rest.json', 'rb')
>>> f.seek(7977319)
7977319
>>> pprint(f.read(100))
(b'\x81TICA EL ABGEN INGL\xc3\x83\xc2\x89S, S.A.","memory_size_gb":"64","since'
 b'":"2017-04-10","storage_size_gb":"84.747')

Something is wrong if you are somehow ending up decoding cp1252. JSON is specifically UTF-8. Troubleshooting is hard as the Python trace doesn't show you the problematic data -- if you can use try/except you can at least print the problematic input as a first step towards debugging this, but with a large input, just waiting for it to repro is slow and painful. — tripleee
– tripleee, Commented Jan 15, 2018 at 8:56
Thanks for the response, is there any way i can change the data file to some other extension and then read it and convert back to json ? — user9119815
– user9119815, Commented Jan 15, 2018 at 9:11
Also to include , how can I use try/except dow you want me to check with other encoding formats ? If you could help it would be great — user9119815
– user9119815, Commented Jan 15, 2018 at 9:13
Python doesn't care what the file name is. If you don't try to decode it as JSON, you can do something else with it first, or instead. — tripleee
– tripleee, Commented Jan 15, 2018 at 9:30

tripleee · Accepted Answer · 2021-09-10 11:48:11Z

1

The snippet you are asking about seems to have been double-encoded. Basically, whatever originally generated this data produced text in Latin-1 or some related encoding (Windows code page 1252?). It was then fed to a process which converts Latin-1 to UTF-8 ... twice.

Of course, "converting" data which is already UTF-8 but telling the computer that it's Latin-1 just produces mojibake.

The string INGL\xc3\x83\xc2\x89S suggests this analysis, if you can guess that it is supposed to say Inglés in upper case, and realize that the UTF-8 encoding for É is \xC3 \x89 and then examine which characters these two bytes encode in Latin-1 (or, as it happens, Unicode, which is a superset of Latin-1, though they are not compatible on the encoding level).

Notice that being able to guess which string a problematic sequence is supposed to represent is the crucial step here; it also explains why including a representative snippet of the problematic data - with enough context! - is vital for debugging.

Anyway, if the entire file has the same symptom, you should be able to undo the second, superfluous and incorrect round of re-encoding; though an error this far into the file makes me imagine it's probably a local problem with just one or a few records. Maybe they were merged from multiple input files, only one of which had this error. Then fixing it requires a fair bit of detective work, and manual editing, or identifying and fixing the erroneous source. A quick and dirty workaround is to simply manually remove any erroneous records.

edited Sep 10, 2021 at 11:48

answered Jan 16, 2018 at 14:26

tripleee

192k37 gold badges318 silver badges367 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user9119815 Over a year ago

So can I convert back it to latin-1 and then parse the data ? or any other alternatives are there

tripleee Over a year ago

Updated the answer with an elaboration on this topic.

Collectives™ on Stack Overflow

decoding issue while parsing JSON [python]

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related