0

I've cobbled together a few disparate excel/csv files with pandas for a database I am trying to build. I've seen a few examples here of creating nested Jsons from csvs, and while those have helped to partially replicate what I need, they ultimately have fallen short.

Rather that being flat, my data is stepwise like this where data is 'joined' by a subject id #, but information on individual visits and samples are on separate rows with 'NaN' for unrelated columns.

in csv format:

subject_id,name,dob,gender,visit_date,date_entered,entered_by,sample_id,collected_by,collection_date
1,Bob,1/1/00,M,,,,,,
1,,,,1/1/18,1/2/18,Sally,,,
1,,,,1/2/18,1/2/18,Tim,,,
1,,,,,,,XXX123,Sally,1/3/18
2,Mary,1/2/00,F,,,,,,
2,,,,1/3/18,1/4/18,Sally,,,
2,,,,,,,YYY456,Sally,1/5/18
2,,,,,,,ZZZ789,Tim,1/6/18

I'm trying to get an output like this:

{
'subject_id': '1'
'name': 'Bob',
'dob': '1/1/00',
'gender': 'M',
'visits': { 
    '1/1/18': {
        'date_entered': '1/2/18',
        'entered_by': 'Sally',
        }
    '1/2/18': {
        'date_entered': '1/2/18',
        'entered_by': 'Tim',
        }
    }
'samples': {
    'XXX123': {
        'collected_by': 'Sally',
        'collection_date': '1/3/18',
        }
    }
}
{
'subject_id': '2'
'name': 'Mary',
'dob': '1/2/00',
'gender': 'F',
'visits': { 
    '1/3/18': {
        'date_entered': '1/4/18',
        'entered_by': 'Sally',
        }
    }
'samples': {
    'YYY456': {
        'collected_by': 'Sally',
        'collection_date': '1/5/18',
        }
    'ZZZ789': {
        'collected_by': 'Tim',
        'collection_date': '1/6/18',
        }   
    }
}

Where information on visits and samples are nested under the more general information. This is obviously a simplified dataset of what I am trying to accomplish, but any advice would be greatly appreciated.

Thanks.

EDIT: More accurate reflection of csv data. Not as streamlined or complete as original example.

'subid,firstvisit,name,contact,dob,gender,visitdate1,age,visitcategory,samplenumber,label_on_sample,completed_by
    1,12/31/11,Bob,,12/31/00,Male,,,,,,
    1,,,,,,12/31/15,17,Baseline Visit,,,
    1,,,,,,12/31/16,18,Follow Up Visit,,,
    1,,,,,,12/31/17,18,Follow Up Visit,,,
    1,,,,12/31/00,Male,,17,,XXX123,1,Sally
    2,1/1/12,,,1/1/01,Female,,,,,,
    2,,,,,,1/1/11,10,Baseline Visit,,,
    2,,,,,,1/1/12,11,Follow Up Visit,,,
    2,,,,,,1/1/13,12,Follow Up Visit,,,
    2,,,,,,1/1/14,13,Follow Up Visit,,,
    2,,,,,,1/1/15,14,Follow Up Visit,,,
    2,,,,1/1/01,Female,,15,,YYY456,2,
    2,,,,1/1/01,Female,,15,,ZZZ789,2,Sally'

1 Answer 1

0

Although I would guess the pandas wizards on SO have a different way, here is one way to achieve your example output with pure Python (I wrote this using Python 3.6.5).

Hopefully this can help you get started!


EDIT:

I modified the code to hopefully account for the new example csv data provided. Since the structure of the new csv is not exactly the same, I had to guess a bit as to the final output structure.

from collections import defaultdict
from csv import DictReader


def solution(csv_filename):
    by_subject_id = defaultdict(lambda: {
        'name': None,
        'dob': None,
        'gender': None,
        'visits': {},
        'samples': {}
    })

    with open(csv_filename) as f:
        dict_reader = DictReader(f)
        for row in dict_reader:
            non_empty = {k: v for k, v in row.items() if v}
            subject_id = non_empty['subid']  # must have to group by
            first_visit = non_empty.get('firstvisit')  # optional
            sample = non_empty.get('samplenumber')  # optional
            visit = non_empty.get('visitdate1')  # optional

            if first_visit:
                by_subject_id[subject_id].update({
                    'name': non_empty.get('name'),
                    'dob': non_empty.get('dob'),
                    'gender': non_empty.get('gender')
                })
            elif visit:
                by_subject_id[subject_id]['visits'][visit] = {
                    'age': non_empty.get('age'),
                    'visit_category': non_empty.get('visitcategory')
                }
            elif sample:
                by_subject_id[subject_id]['samples'][sample] = {
                    'completed_by': non_empty.get('completed_by'),
                    'label_on_sample': non_empty.get('label_on_sample')
                }
    return [{'subject_id': k, **v} for k, v in by_subject_id.items()]

Output:

[
    {
        "subject_id": "1",
        "name": "Bob",
        "dob": "12/31/00",
        "gender": "Male",
        "visits": {
            "12/31/15": {
                "age": "17",
                "visit_category": "Baseline Visit"
            },
            "12/31/16": {
                "age": "18",
                "visit_category": "Follow Up Visit"
            },
            "12/31/17": {
                "age": "18",
                "visit_category": "Follow Up Visit"
            }
        },
        "samples": {
            "XXX123": {
                "completed_by": "Sally",
                "label_on_sample": "1"
            }
        }
    },
    {
        "subject_id": "2",
        "name": null,
        "dob": "1/1/01",
        "gender": "Female",
        "visits": {
            "1/1/11": {
                "age": "10",
                "visit_category": "Baseline Visit"
            },
            "1/1/12": {
                "age": "11",
                "visit_category": "Follow Up Visit"
            },
            "1/1/13": {
                "age": "12",
                "visit_category": "Follow Up Visit"
            },
            "1/1/14": {
                "age": "13",
                "visit_category": "Follow Up Visit"
            },
            "1/1/15": {
                "age": "14",
                "visit_category": "Follow Up Visit"
            }
        },
        "samples": {
            "YYY456": {
                "completed_by": null,
                "label_on_sample": "2"
            },
            "ZZZ789": {
                "completed_by": "Sally",
                "label_on_sample": "2"
            }
        }
    }
]
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks so much. This works perfectly on the test data. Unfortunately, I am running into issues on the real set, largely, I believe, because of unexpected NaN's. For instance, Bob's 'dob' might be missing or a 'date_entered', and it seems to be throwing everything off.
added a more accurate reflection of the data. Apologies for the inconvenience.
This works like a charm thank you so much! I wanted to let you know you've helped a lot of people in a very meaningful way with this.
Thanks again for all your help, been busy on trying to expand this to the larger dataset. One thing I noticed, is that it seems that this returns everything as a string. If I wanted ages, say, to be an int, is there some way to account for this?
Ah, it seems to be an issue with reading in from the csv, it reads everything as a string. You wouldn't happen to know a good way to convert this to read from a dataframe would you?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.