First let me appologize if my description of this is completely retarded, still learning most of this on-the-fly.
I have several large text-files (.txt) (~600,000 lines) of general hospital information, which I'm parsing with python. I have been using default dicts (python2.7) to get counts and sub-counts one level deep of pt. diagnoses. For example, if looking to catch Heart Attacks and then differentiate based on type (pseudocode):
if 'heart attack' in line[65:69]:
defaultdict['heart attack'] +=1
if [65:69] == 'worst kind':
defaultdict['worst'] += 1
else:
defaultdict['not worst'] +=1
This way I catch heart attacks and whether they were the specific one of interest. That all works fine. However, now I also want to collect information (from the same line) of the age of the patient (reported in coded- ranges), sex (M,F,U), and race, etc. I'm realizing my technique is not so well suited for this - it seems to be growing in complexity at a rapid rate. So, before I dig myself in too deep - is there another way I should be tackling this?
Eventually I'm planning on getting all these files into an actual database, but this is basically the last piece of info. I need for the current project so I'm comfortable just dumping it into excel and graphing it for now.
Appreciate any advice!
EDIT: Sample line is like -
02032011JuniorHospital 932220320M09A228393
03092011MassGeneralHospitals 923392818F09B228182
So all lines will be fixed length, where line[0:8] is always the date, etc. There is a seperate file (dictionary?) that explains what the numbers mean - so a diagnoses would be something like 410.22, ages will be in a range 0 = 0-1 yr old, 1 = 2-3 yr old, etc...
GOAL: For each diagnoses I want, also want to know is that particular diagnoses a sub-type of interst (no problem getting this far with above code), what are the various ages associated with that diagnoses (i.e., how many in each age range). I currently have this outputing to an excel file (csv), so I want various multiple columns that I can plot as I need.
Again, I CAN figure out how to do this all just creating a few extra default dicts - it just seems like there should be an easier way to group them all together into one main object!
Phew
'heart attack' in line[65:69]is alwaysFalse-- it doesn't fit.[65:69] == 'worst kind'isn't correct Python syntax. If this actually should beline[65:69] == 'worst kind', it's alwayFalseagain. This is a bit confusing. Would it be possible to just post a few lines of example data?