Creating Nested Json from CSV with incomplete rows

Question

I've cobbled together a few disparate excel/csv files with pandas for a database I am trying to build. I've seen a few examples here of creating nested Jsons from csvs, and while those have helped to partially replicate what I need, they ultimately have fallen short.

Rather that being flat, my data is stepwise like this where data is 'joined' by a subject id #, but information on individual visits and samples are on separate rows with 'NaN' for unrelated columns.

in csv format:

subject_id,name,dob,gender,visit_date,date_entered,entered_by,sample_id,collected_by,collection_date
1,Bob,1/1/00,M,,,,,,
1,,,,1/1/18,1/2/18,Sally,,,
1,,,,1/2/18,1/2/18,Tim,,,
1,,,,,,,XXX123,Sally,1/3/18
2,Mary,1/2/00,F,,,,,,
2,,,,1/3/18,1/4/18,Sally,,,
2,,,,,,,YYY456,Sally,1/5/18
2,,,,,,,ZZZ789,Tim,1/6/18

I'm trying to get an output like this:

{
'subject_id': '1'
'name': 'Bob',
'dob': '1/1/00',
'gender': 'M',
'visits': { 
    '1/1/18': {
        'date_entered': '1/2/18',
        'entered_by': 'Sally',
        }
    '1/2/18': {
        'date_entered': '1/2/18',
        'entered_by': 'Tim',
        }
    }
'samples': {
    'XXX123': {
        'collected_by': 'Sally',
        'collection_date': '1/3/18',
        }
    }
}
{
'subject_id': '2'
'name': 'Mary',
'dob': '1/2/00',
'gender': 'F',
'visits': { 
    '1/3/18': {
        'date_entered': '1/4/18',
        'entered_by': 'Sally',
        }
    }
'samples': {
    'YYY456': {
        'collected_by': 'Sally',
        'collection_date': '1/5/18',
        }
    'ZZZ789': {
        'collected_by': 'Tim',
        'collection_date': '1/6/18',
        }   
    }
}

Where information on visits and samples are nested under the more general information. This is obviously a simplified dataset of what I am trying to accomplish, but any advice would be greatly appreciated.

Thanks.

EDIT: More accurate reflection of csv data. Not as streamlined or complete as original example.

'subid,firstvisit,name,contact,dob,gender,visitdate1,age,visitcategory,samplenumber,label_on_sample,completed_by
    1,12/31/11,Bob,,12/31/00,Male,,,,,,
    1,,,,,,12/31/15,17,Baseline Visit,,,
    1,,,,,,12/31/16,18,Follow Up Visit,,,
    1,,,,,,12/31/17,18,Follow Up Visit,,,
    1,,,,12/31/00,Male,,17,,XXX123,1,Sally
    2,1/1/12,,,1/1/01,Female,,,,,,
    2,,,,,,1/1/11,10,Baseline Visit,,,
    2,,,,,,1/1/12,11,Follow Up Visit,,,
    2,,,,,,1/1/13,12,Follow Up Visit,,,
    2,,,,,,1/1/14,13,Follow Up Visit,,,
    2,,,,,,1/1/15,14,Follow Up Visit,,,
    2,,,,1/1/01,Female,,15,,YYY456,2,
    2,,,,1/1/01,Female,,15,,ZZZ789,2,Sally'

G_M · Accepted Answer · 2018-05-24 22:00:19Z

0

Although I would guess the pandas wizards on SO have a different way, here is one way to achieve your example output with pure Python (I wrote this using Python 3.6.5).

Hopefully this can help you get started!

EDIT:

I modified the code to hopefully account for the new example csv data provided. Since the structure of the new csv is not exactly the same, I had to guess a bit as to the final output structure.

from collections import defaultdict
from csv import DictReader


def solution(csv_filename):
    by_subject_id = defaultdict(lambda: {
        'name': None,
        'dob': None,
        'gender': None,
        'visits': {},
        'samples': {}
    })

    with open(csv_filename) as f:
        dict_reader = DictReader(f)
        for row in dict_reader:
            non_empty = {k: v for k, v in row.items() if v}
            subject_id = non_empty['subid']  # must have to group by
            first_visit = non_empty.get('firstvisit')  # optional
            sample = non_empty.get('samplenumber')  # optional
            visit = non_empty.get('visitdate1')  # optional

            if first_visit:
                by_subject_id[subject_id].update({
                    'name': non_empty.get('name'),
                    'dob': non_empty.get('dob'),
                    'gender': non_empty.get('gender')
                })
            elif visit:
                by_subject_id[subject_id]['visits'][visit] = {
                    'age': non_empty.get('age'),
                    'visit_category': non_empty.get('visitcategory')
                }
            elif sample:
                by_subject_id[subject_id]['samples'][sample] = {
                    'completed_by': non_empty.get('completed_by'),
                    'label_on_sample': non_empty.get('label_on_sample')
                }
    return [{'subject_id': k, **v} for k, v in by_subject_id.items()]

Output:

[
    {
        "subject_id": "1",
        "name": "Bob",
        "dob": "12/31/00",
        "gender": "Male",
        "visits": {
            "12/31/15": {
                "age": "17",
                "visit_category": "Baseline Visit"
            },
            "12/31/16": {
                "age": "18",
                "visit_category": "Follow Up Visit"
            },
            "12/31/17": {
                "age": "18",
                "visit_category": "Follow Up Visit"
            }
        },
        "samples": {
            "XXX123": {
                "completed_by": "Sally",
                "label_on_sample": "1"
            }
        }
    },
    {
        "subject_id": "2",
        "name": null,
        "dob": "1/1/01",
        "gender": "Female",
        "visits": {
            "1/1/11": {
                "age": "10",
                "visit_category": "Baseline Visit"
            },
            "1/1/12": {
                "age": "11",
                "visit_category": "Follow Up Visit"
            },
            "1/1/13": {
                "age": "12",
                "visit_category": "Follow Up Visit"
            },
            "1/1/14": {
                "age": "13",
                "visit_category": "Follow Up Visit"
            },
            "1/1/15": {
                "age": "14",
                "visit_category": "Follow Up Visit"
            }
        },
        "samples": {
            "YYY456": {
                "completed_by": null,
                "label_on_sample": "2"
            },
            "ZZZ789": {
                "completed_by": "Sally",
                "label_on_sample": "2"
            }
        }
    }
]

edited May 24, 2018 at 22:00

answered May 23, 2018 at 20:05

G_M

3,3821 gold badge11 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

jester_in_yellow Over a year ago

Thanks so much. This works perfectly on the test data. Unfortunately, I am running into issues on the real set, largely, I believe, because of unexpected NaN's. For instance, Bob's 'dob' might be missing or a 'date_entered', and it seems to be throwing everything off.

jester_in_yellow Over a year ago

added a more accurate reflection of the data. Apologies for the inconvenience.

jester_in_yellow Over a year ago

This works like a charm thank you so much! I wanted to let you know you've helped a lot of people in a very meaningful way with this.

jester_in_yellow Over a year ago

Thanks again for all your help, been busy on trying to expand this to the larger dataset. One thing I noticed, is that it seems that this returns everything as a string. If I wanted ages, say, to be an int, is there some way to account for this?

jester_in_yellow Over a year ago

Ah, it seems to be an issue with reading in from the csv, it reads everything as a string. You wouldn't happen to know a good way to convert this to read from a dataframe would you?

|

Collectives™ on Stack Overflow

Creating Nested Json from CSV with incomplete rows

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related