2

Similar to this question, but my CSV has a slightly different format. Here is an example:

id,employee,details,createdAt  
1,John,"{"Country":"USA","Salary":5000,"Review":null}","2018-09-01"  
2,Sarah,"{"Country":"Australia", "Salary":6000,"Review":"Hardworking"}","2018-09-05"

I think the double quotation mark in the beginning of the JSON column might have caused some errors. Using df = pandas.read_csv('file.csv'), this is the dataframe that I got:

id  employee                details    createdAt              Unnamed: 1  Unnamed: 2 
 1      John        {Country":"USA"  Salary:5000           Review:null}"  2018-09-01 
 2     Sarah  {Country":"Australia"  Salary:6000  Review:"Hardworking"}"  2018-09-05

My desired output:

id  employee                                                       details   createdAt
 1      John                 {"Country":"USA","Salary":5000,"Review":null}  2018-09-01 
 2     Sarah  {"Country":"Australia","Salary":6000,"Review":"Hardworking"}  2018-09-05 

I've tried adding quotechar='"' as the parameter and it still doesn't give me the result that I want. Is there a way to tell pandas to ignore the first and the last quotation mark surrounding the json value?

7
  • the problem is not with quote, it is with comma, while reading csv all the entries separated by comma are considered as next column Commented Sep 8, 2018 at 7:36
  • @Gahan single columns in a CSV can contain commas. The issue probably is the enclosing " on the string, causing the commas to be interpreted as new columns rather than part of a dictionary structure Commented Sep 8, 2018 at 7:38
  • @roganjosh , I tried, the structure is too responsible for it as quote enclosed "{" and then Country without quote and then ":" in quote and then USA" and comma encountered which interpreted it as next column value Commented Sep 8, 2018 at 7:41
  • I suspect it can only be solved with regex, which rules me out of helping sorry :/ Commented Sep 8, 2018 at 7:42
  • 1
    Instead of trying to parse this, you should rather not use a mix of two badly-interacting metaformats (CSV, JSON) to write the data in the first place. Just use JSON all the way as a default. If you must use this, you need to escape quotes. Commented Sep 8, 2018 at 7:48

3 Answers 3

0

As an alternative approach you could read the file in manually, parse each row correctly and use the resulting data to contruct the dataframe. This works by splitting the row both forward and backwards to get the non-problematic columns and then taking the remaining part:

import pandas as pd

data = []

with open("e1.csv") as f_input:
    for row in f_input:
        row = row.strip()
        split = row.split(',', 2)
        rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
        data.append(split[0:2] + rsplit)

df = pd.DataFrame(data[1:], columns=data[0])
print(df)

This would display your data as:

  id employee                                            details   createdAt
0  1     John      {"Country":"USA","Salary":5000,"Review":null}  2018-09-01
1  2    Sarah  {"Country":"Australia", "Salary":6000,"Review"...  2018-09-05
Sign up to request clarification or add additional context in comments.

1 Comment

Yes I'm doing something similar to that: read the file as raw text, then do modification so that the JSON format is readable by pandas. Thanks!
0

I have reproduced your file With

   df = pd.read_csv('e1.csv', index_col=None )

print (df)

Output

     id    emp                                            details      createdat
0   1   john    "{"Country":"USA","Salary":5000,"Review":null}"  "2018-09-01" 
1   2  sarah  "{"Country":"Australia", "Salary":6000,"Review...   "2018-09-05"

3 Comments

could you specify what more tricks you did, tried your solution and it throws exception: ParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 6
Hi, I tried using header=None but it still gives me the same result.
index_col is None by default so your above solution is equivalent to what the OP has already tried.
0

I think there's a better way by passing a regex to sep=r',"|",|(?<=\d),' and possibly some other combination of parameters. I haven't figured it out totally.

Here is a less than optimal option:

df = pd.read_csv('s083838383.csv', sep='@#$%^', engine='python')
header = df.columns[0]
print(df)

Why sep='@#$%^' ? This is just garbage that allows you to read the file with no sep character. It could be any random character and is just used as a means to import the data into a df object to work with.

df looks like this:

                       id,employee,details,createdAt
0  1,John,"{"Country":"USA","Salary":5000,"Review...
1  2,Sarah,"{"Country":"Australia", "Salary":6000...

Then you could use str.extract to apply regex and expand the columns:

result = df[header].str.extract(r'(.+),(.+),("\{.+\}"),(.+)',
                                expand=True).applymap(str.strip)

result.columns = header.strip().split(',')
print(result)

result is:

  id employee                                            details     createdAt
0  1     John    "{"Country":"USA","Salary":5000,"Review":null}"  "2018-09-01"
1  2    Sarah  "{"Country":"Australia", "Salary":6000,"Review...  "2018-09-05"

If you need the starting and ending quotes stripped off of the details string values, you could do:

result['details'] = result['details'].str.strip('"')

If the details object items needs to be a dicts instead of strings, you could do:

from json import loads
result['details'] = result['details'].apply(loads)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.