Python parsing large text files and capturing multi-level data

Question

First let me appologize if my description of this is completely retarded, still learning most of this on-the-fly.

I have several large text-files (.txt) (~600,000 lines) of general hospital information, which I'm parsing with python. I have been using default dicts (python2.7) to get counts and sub-counts one level deep of pt. diagnoses. For example, if looking to catch Heart Attacks and then differentiate based on type (pseudocode):

if 'heart attack' in line[65:69]: 
    defaultdict['heart attack'] +=1
    if [65:69] == 'worst kind':
        defaultdict['worst'] += 1
    else: 
        defaultdict['not worst'] +=1

This way I catch heart attacks and whether they were the specific one of interest. That all works fine. However, now I also want to collect information (from the same line) of the age of the patient (reported in coded- ranges), sex (M,F,U), and race, etc. I'm realizing my technique is not so well suited for this - it seems to be growing in complexity at a rapid rate. So, before I dig myself in too deep - is there another way I should be tackling this?

Eventually I'm planning on getting all these files into an actual database, but this is basically the last piece of info. I need for the current project so I'm comfortable just dumping it into excel and graphing it for now.

Appreciate any advice!

EDIT: Sample line is like -

02032011JuniorHospital       932220320M09A228393
03092011MassGeneralHospitals 923392818F09B228182

So all lines will be fixed length, where line[0:8] is always the date, etc. There is a seperate file (dictionary?) that explains what the numbers mean - so a diagnoses would be something like 410.22, ages will be in a range 0 = 0-1 yr old, 1 = 2-3 yr old, etc...

GOAL: For each diagnoses I want, also want to know is that particular diagnoses a sub-type of interst (no problem getting this far with above code), what are the various ages associated with that diagnoses (i.e., how many in each age range). I currently have this outputing to an excel file (csv), so I want various multiple columns that I can plot as I need.

Again, I CAN figure out how to do this all just creating a few extra default dicts - it just seems like there should be an easier way to group them all together into one main object!

Phew

If you give us a hint as to the format of your file, we might be able to give you a hint how to parse it. (A few lines of example data would be great.) — Sven Marnach
– Sven Marnach, Commented Jun 6, 2011 at 22:04
This is all highly dependent on the formats of input and output that you want. But a mature DB-ORM library might come to your rescue. — Santa
– Santa, Commented Jun 6, 2011 at 22:07
1. It's a simple .txt file, fixed-line (all lines same length) 2. I'm outputting to .csv (using excel) and use the Dictwriter. — chris
– chris, Commented Jun 6, 2011 at 23:53
'heart attack' in line[65:69] is always False -- it doesn't fit. [65:69] == 'worst kind' isn't correct Python syntax. If this actually should be line[65:69] == 'worst kind', it's alway False again. This is a bit confusing. Would it be possible to just post a few lines of example data? — Sven Marnach
– Sven Marnach, Commented Jun 7, 2011 at 0:10
See edit. That was just pseudo code - an actual snippet is, for instance - if '410.2' in line[65:70] - the overall logic is what I'm after. I can't post the actual sample data but its nearly identical (but much longer) than what I edited in. — chris
– chris, Commented Jun 7, 2011 at 0:15

Sven Marnach · Accepted Answer · 2012-09-16 12:04:30Z

You can generalise the concept of hierarchic counts to get clearer code that is easier to modify. A basic example of an hierachic counter class is

class HierarchicCounter(object):
    def __init__(self, key, hierarchy):
        self.key = key
        self.hierarchy = hierarchy
        self.counts = defaultdict(int)
        self.subcounters = defaultdict(self._create_subcounters)

    def _create_subcounters(self):
        return {key: HierarchicCounter(key, hierarchy)
                for key, hierarchy in self.hierarchy.iteritems()}

    def count(self, line):
        key = self.key(line)
        if key is None:
            return
        self.counts[key] += 1
        for subcounter in self.subcounters[key].itervalues():
            subcounter.count(line)

The constructor of this class accepts two parameters. The key parameter is the "key function", which tells the counter what it is supposed to count. If a line is fed to the counter, it applies the key functition to it and increments the count corresponding to the retrieved key. The hierarchy parameter is a dictionary mapping key functions of the desired subcounters to their respective hierarchies.

Example usage:

def diagnosis_major(line):
    return line[0:3]

def diagnosis_minor(line):
    return line[3:5]

def age(line):
    return int(line[5:7])

def sex(line):
    return line[7]

counter = HierarchicCounter(
    diagnosis_major, {diagnosis_minor: {sex: {}}, age: {}})

This creates a few simple key functions extracting different fields from a line. In your application, the key functions might become more complex. You might also filter out keys here – if a key function returns None, the counter will simply ignore the line. The last two lines make a HierarchicCounter instance with the following counts hierarchy:

diagnosis_major
|-- diagnosis_minor
|   \-- sex
\-- age

So the counter counts the number of cases of each major diagnosis. For each major diagnosis, it counts the minor diagnoses and ages corresponding to this major diagnosis. The sexes are counted for each major diagnosis for each minor diagnosis.

Of course this example is not complete. You would need to add some code to actually output the counts collected in the counter hierarchy in some format -- this is just to give you an idea how this could be abstracted in a more general way.

You sir, have gone above and beyond. This is the type of thing I'd imagined though it was a bit beyond my mental grasp. Thanks!

score 2 · Accepted Answer · 2011-06-07 00:35:10Z

2

If you get each line into a tuple you can sort on any field or combo of fields in that tuple. Provide a custom comparator function that compares the elements. e.g. your comparison func would sort on field 1 then field 3. If field 1 was ailment and field 3 was age, it would sort y age and within each age category you would have a sorted list of ailments. Fairly easy to adapt to break into sections e.g.: ailment then age, and split the list after each ailment.

http://wiki.python.org/moin/HowTo/Sorting/

That being said, without SQL or spreadsheet, you're probably looking at multiple dicts.

edited Jun 7, 2011 at 0:35

answered Jun 7, 2011 at 0:26

user774340

Comments

eyquem · Accepted Answer · 2011-06-06 22:53:02Z

0

It sounds to me that you are in the same situation I was in when I was struggling to treat texts and extract data from them without knowing the existence of regexes ( see module re )

Give precisions on what you want to do and we'll help you to do what you want. Personally, it will be with the power of regexes.

answered Jun 6, 2011 at 22:53

eyquem

27.7k7 gold badges43 silver badges46 bronze badges

1 Comment

chris Over a year ago

Sorry - it's a fixed length format so I can easily just do line[66:67] and check for values like that; figured that would be easier than regex (correct me if wrong). Will update my orig post a bit.

Collectives™ on Stack Overflow

Python parsing large text files and capturing multi-level data

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related