Parsing a JSON string enclosed with quotation marks from a CSV using Pandas

Question

Similar to this question, but my CSV has a slightly different format. Here is an example:

id,employee,details,createdAt  
1,John,"{"Country":"USA","Salary":5000,"Review":null}","2018-09-01"  
2,Sarah,"{"Country":"Australia", "Salary":6000,"Review":"Hardworking"}","2018-09-05"

I think the double quotation mark in the beginning of the JSON column might have caused some errors. Using df = pandas.read_csv('file.csv'), this is the dataframe that I got:

id  employee                details    createdAt              Unnamed: 1  Unnamed: 2 
 1      John        {Country":"USA"  Salary:5000           Review:null}"  2018-09-01 
 2     Sarah  {Country":"Australia"  Salary:6000  Review:"Hardworking"}"  2018-09-05

My desired output:

id  employee                                                       details   createdAt
 1      John                 {"Country":"USA","Salary":5000,"Review":null}  2018-09-01 
 2     Sarah  {"Country":"Australia","Salary":6000,"Review":"Hardworking"}  2018-09-05

I've tried adding quotechar='"' as the parameter and it still doesn't give me the result that I want. Is there a way to tell pandas to ignore the first and the last quotation mark surrounding the json value?

the problem is not with quote, it is with comma, while reading csv all the entries separated by comma are considered as next column — Gahan
– Gahan, Commented Sep 8, 2018 at 7:36
@Gahan single columns in a CSV can contain commas. The issue probably is the enclosing " on the string, causing the commas to be interpreted as new columns rather than part of a dictionary structure — roganjosh
– roganjosh, Commented Sep 8, 2018 at 7:38
@roganjosh , I tried, the structure is too responsible for it as quote enclosed "{" and then Country without quote and then ":" in quote and then USA" and comma encountered which interpreted it as next column value — Gahan
– Gahan, Commented Sep 8, 2018 at 7:41
I suspect it can only be solved with regex, which rules me out of helping sorry :/ — roganjosh
– roganjosh, Commented Sep 8, 2018 at 7:42
Instead of trying to parse this, you should rather not use a mix of two badly-interacting metaformats (CSV, JSON) to write the data in the first place. Just use JSON all the way as a default. If you must use this, you need to escape quotes. — Ulrich Eckhardt
– Ulrich Eckhardt, Commented Sep 8, 2018 at 7:48

Martin Evans · Accepted Answer · 2018-09-08 11:13:20Z

0

As an alternative approach you could read the file in manually, parse each row correctly and use the resulting data to contruct the dataframe. This works by splitting the row both forward and backwards to get the non-problematic columns and then taking the remaining part:

import pandas as pd

data = []

with open("e1.csv") as f_input:
    for row in f_input:
        row = row.strip()
        split = row.split(',', 2)
        rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
        data.append(split[0:2] + rsplit)

df = pd.DataFrame(data[1:], columns=data[0])
print(df)

This would display your data as:

  id employee                                            details   createdAt
0  1     John      {"Country":"USA","Salary":5000,"Review":null}  2018-09-01
1  2    Sarah  {"Country":"Australia", "Salary":6000,"Review"...  2018-09-05

answered Sep 8, 2018 at 11:13

Martin Evans

46.9k17 gold badges88 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

hotchocolate Over a year ago

Yes I'm doing something similar to that: read the file as raw text, then do modification so that the JSON format is readable by pandas. Thanks!

Richard Rublev · Accepted Answer · 2018-09-08 07:47:37Z

0

I have reproduced your file With

   df = pd.read_csv('e1.csv', index_col=None )

print (df)

Output

     id    emp                                            details      createdat
0   1   john    "{"Country":"USA","Salary":5000,"Review":null}"  "2018-09-01" 
1   2  sarah  "{"Country":"Australia", "Salary":6000,"Review...   "2018-09-05"

edited Sep 8, 2018 at 7:47

answered Sep 8, 2018 at 7:43

Richard Rublev

8,26218 gold badges93 silver badges148 bronze badges

3 Comments

Gahan Over a year ago

could you specify what more tricks you did, tried your solution and it throws exception: ParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 6

hotchocolate Over a year ago

Hi, I tried using header=None but it still gives me the same result.

Jarad Over a year ago

index_col is None by default so your above solution is equivalent to what the OP has already tried.

Jarad · Accepted Answer · 2018-09-08 10:07:41Z

I think there's a better way by passing a regex to sep=r',"|",|(?<=\d),' and possibly some other combination of parameters. I haven't figured it out totally.

Here is a less than optimal option:

df = pd.read_csv('s083838383.csv', sep='@#$%^', engine='python')
header = df.columns[0]
print(df)

Why sep='@#$%^' ? This is just garbage that allows you to read the file with no sep character. It could be any random character and is just used as a means to import the data into a df object to work with.

df looks like this:

                       id,employee,details,createdAt
0  1,John,"{"Country":"USA","Salary":5000,"Review...
1  2,Sarah,"{"Country":"Australia", "Salary":6000...

Then you could use str.extract to apply regex and expand the columns:

result = df[header].str.extract(r'(.+),(.+),("\{.+\}"),(.+)',
                                expand=True).applymap(str.strip)

result.columns = header.strip().split(',')
print(result)

result is:

  id employee                                            details     createdAt
0  1     John    "{"Country":"USA","Salary":5000,"Review":null}"  "2018-09-01"
1  2    Sarah  "{"Country":"Australia", "Salary":6000,"Review...  "2018-09-05"

If you need the starting and ending quotes stripped off of the details string values, you could do:

result['details'] = result['details'].str.strip('"')

If the details object items needs to be a dicts instead of strings, you could do:

from json import loads
result['details'] = result['details'].apply(loads)

Collectives™ on Stack Overflow

Parsing a JSON string enclosed with quotation marks from a CSV using Pandas

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related