2

First let me appologize if my description of this is completely retarded, still learning most of this on-the-fly.

I have several large text-files (.txt) (~600,000 lines) of general hospital information, which I'm parsing with python. I have been using default dicts (python2.7) to get counts and sub-counts one level deep of pt. diagnoses. For example, if looking to catch Heart Attacks and then differentiate based on type (pseudocode):

if 'heart attack' in line[65:69]: 
    defaultdict['heart attack'] +=1
    if [65:69] == 'worst kind':
        defaultdict['worst'] += 1
    else: 
        defaultdict['not worst'] +=1

This way I catch heart attacks and whether they were the specific one of interest. That all works fine. However, now I also want to collect information (from the same line) of the age of the patient (reported in coded- ranges), sex (M,F,U), and race, etc. I'm realizing my technique is not so well suited for this - it seems to be growing in complexity at a rapid rate. So, before I dig myself in too deep - is there another way I should be tackling this?

Eventually I'm planning on getting all these files into an actual database, but this is basically the last piece of info. I need for the current project so I'm comfortable just dumping it into excel and graphing it for now.

Appreciate any advice!

EDIT: Sample line is like -

02032011JuniorHospital       932220320M09A228393
03092011MassGeneralHospitals 923392818F09B228182

So all lines will be fixed length, where line[0:8] is always the date, etc. There is a seperate file (dictionary?) that explains what the numbers mean - so a diagnoses would be something like 410.22, ages will be in a range 0 = 0-1 yr old, 1 = 2-3 yr old, etc...

GOAL: For each diagnoses I want, also want to know is that particular diagnoses a sub-type of interst (no problem getting this far with above code), what are the various ages associated with that diagnoses (i.e., how many in each age range). I currently have this outputing to an excel file (csv), so I want various multiple columns that I can plot as I need.

Again, I CAN figure out how to do this all just creating a few extra default dicts - it just seems like there should be an easier way to group them all together into one main object!

Phew

8
  • If you give us a hint as to the format of your file, we might be able to give you a hint how to parse it. (A few lines of example data would be great.) Commented Jun 6, 2011 at 22:04
  • This is all highly dependent on the formats of input and output that you want. But a mature DB-ORM library might come to your rescue. Commented Jun 6, 2011 at 22:07
  • 1. It's a simple .txt file, fixed-line (all lines same length) 2. I'm outputting to .csv (using excel) and use the Dictwriter. Commented Jun 6, 2011 at 23:53
  • 'heart attack' in line[65:69] is always False -- it doesn't fit. [65:69] == 'worst kind' isn't correct Python syntax. If this actually should be line[65:69] == 'worst kind', it's alway False again. This is a bit confusing. Would it be possible to just post a few lines of example data? Commented Jun 7, 2011 at 0:10
  • See edit. That was just pseudo code - an actual snippet is, for instance - if '410.2' in line[65:70] - the overall logic is what I'm after. I can't post the actual sample data but its nearly identical (but much longer) than what I edited in. Commented Jun 7, 2011 at 0:15

3 Answers 3

5

You can generalise the concept of hierarchic counts to get clearer code that is easier to modify. A basic example of an hierachic counter class is

class HierarchicCounter(object):
    def __init__(self, key, hierarchy):
        self.key = key
        self.hierarchy = hierarchy
        self.counts = defaultdict(int)
        self.subcounters = defaultdict(self._create_subcounters)

    def _create_subcounters(self):
        return {key: HierarchicCounter(key, hierarchy)
                for key, hierarchy in self.hierarchy.iteritems()}

    def count(self, line):
        key = self.key(line)
        if key is None:
            return
        self.counts[key] += 1
        for subcounter in self.subcounters[key].itervalues():
            subcounter.count(line)

The constructor of this class accepts two parameters. The key parameter is the "key function", which tells the counter what it is supposed to count. If a line is fed to the counter, it applies the key functition to it and increments the count corresponding to the retrieved key. The hierarchy parameter is a dictionary mapping key functions of the desired subcounters to their respective hierarchies.

Example usage:

def diagnosis_major(line):
    return line[0:3]

def diagnosis_minor(line):
    return line[3:5]

def age(line):
    return int(line[5:7])

def sex(line):
    return line[7]

counter = HierarchicCounter(
    diagnosis_major, {diagnosis_minor: {sex: {}}, age: {}})

This creates a few simple key functions extracting different fields from a line. In your application, the key functions might become more complex. You might also filter out keys here – if a key function returns None, the counter will simply ignore the line. The last two lines make a HierarchicCounter instance with the following counts hierarchy:

diagnosis_major
|-- diagnosis_minor
|   \-- sex
\-- age

So the counter counts the number of cases of each major diagnosis. For each major diagnosis, it counts the minor diagnoses and ages corresponding to this major diagnosis. The sexes are counted for each major diagnosis for each minor diagnosis.

Of course this example is not complete. You would need to add some code to actually output the counts collected in the counter hierarchy in some format -- this is just to give you an idea how this could be abstracted in a more general way.

Sign up to request clarification or add additional context in comments.

1 Comment

You sir, have gone above and beyond. This is the type of thing I'd imagined though it was a bit beyond my mental grasp. Thanks!
2

If you get each line into a tuple you can sort on any field or combo of fields in that tuple. Provide a custom comparator function that compares the elements. e.g. your comparison func would sort on field 1 then field 3. If field 1 was ailment and field 3 was age, it would sort y age and within each age category you would have a sorted list of ailments. Fairly easy to adapt to break into sections e.g.: ailment then age, and split the list after each ailment.

http://wiki.python.org/moin/HowTo/Sorting/

That being said, without SQL or spreadsheet, you're probably looking at multiple dicts.

Comments

0

It sounds to me that you are in the same situation I was in when I was struggling to treat texts and extract data from them without knowing the existence of regexes ( see module re )

Give precisions on what you want to do and we'll help you to do what you want. Personally, it will be with the power of regexes.

1 Comment

Sorry - it's a fixed length format so I can easily just do line[66:67] and check for values like that; figured that would be easier than regex (correct me if wrong). Will update my orig post a bit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.