0

I need to convert csv to json file format but output not getting as expected. Line1, Line2 and etc getting repeated in json output. I need to remove those repeated part.

Input data

7,priya,kannan,[email protected],07-12-1994,"123","456",67,mdu,tn,india
7,priya,kannan,[email protected],07-12-1994,"123","456",67,mdu,tn,india

Expected output

[ {
    "source_id": 7,
    "fname": "priya",
    "lname": "kannan",
    "date_of_birth": "07-12-1994",
    "email": ["[email protected]", "[email protected]"],
    "address": [{
        "line1": 123,
        "line2": 456,
        "line3": 67,
        "city": "mdu",
        "state": "tn",
        "country": "india"
    }]
}]

Output getting

[ {
    "source_id": 7,
    "fname": "priya",
    "lname": "kannan",
    "date_of_birth": "07-12-1994",
    "email": ["[email protected]", "[email protected]"],
    "address": [{
        "line1": 123,
        "line2": 456,
        "line3": 67,
        "city": "mdu",
        "state": "tn",
        "country": "india"
    }, {
        "line1": 123,
        "line2": 456,
        "line3": 67,
        "city": "mdu",
        "state": "tn",
        "country": "india"
    }]
}]

Code tried

g_cols = ['source_id', 'fname', 'lname', 'email', 'date_of_birth']
df = pd.read_csv(path, sep=",", header=0)

cols = df.columns[~df.columns.isin(g_cols)]
g_cols.remove('email')

df = (df.sort_values(g_cols)
      .set_index(g_cols)
      .assign(email=df.groupby(g_cols)['email'].agg(lambda x: tuple(pd.unique(x))))
      .reset_index())

g_cols.append('email')
df1 = df.groupby(g_cols)[cols].apply(lambda x: x.to_dict('records')).reset_index(name='address').to_dict('record')
print(df1)
df2 = pd.DataFrame(df1)

2 Answers 2

1

In This step use drop_duplicates() method:

df1 = df.drop_duplicates().groupby(g_cols)[cols].apply(lambda x: x.to_dict('records')).reset_index(name='address').to_dict('record')

output of df1:

[{'source_id': 7,
  'fname': 'priya',
  'lname': 'kannan',
  'date_of_birth': '07-12-1994',
  'email': ('[email protected]', '[email protected]'),
  'address': [{'ln1': 123,
    'ln2': 456,
    'ln3': 67,
    'cty': 'mdu',
    'state': 'tn',
    'cntry': 'india'}]}]
Sign up to request clarification or add additional context in comments.

1 Comment

Can you please help me on this. stackoverflow.com/questions/68921025/…
0
g_cols = ['source_id', 'fname', 'lname', 'email', 'date_of_birth']
df = pd.read_csv(path, sep=",", header=0)

cols = df.columns[~df.columns.isin(g_cols)]
g_cols.remove('email')

df = (df.sort_values(g_cols)
      .set_index(g_cols)
      .assign(email=df.groupby(g_cols)['email'].agg(lambda x: tuple(pd.unique(x))))
      .reset_index())

g_cols.append('email')
df1 = df.drop_duplicates().groupby(g_cols)[cols].apply(lambda x: x.to_dict('records')).reset_index(name='address').to_dict('record')
print(df1)
df2 = pd.DataFrame(df1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.