Dynamically parsing research data in python

Question

The long (winded) version: I'm gathering research data using Python. My initial parsing is ugly (but functional) code which gives me some basic information and turns my raw data into a format suitable for heavy duty statistical analysis using SPSS. However, every time I modify the experiment, I have to dive into the analysis code.

For a typical experiment, I'll have 30 files, each for a unique user. Field count is fixed for each experiment (but can vary from one to another 10-20). Files are typically 700-1000 records long with a header row. Record format is tab separated (see sample which is 4 integers, 3 strings, and 10 floats).

I need to sort my list into categories. In a 1000 line file, I could have 4-256 categories. Rather than trying to pre-determine how many categories each file has, I'm using the code below to count them. The integers at the beginning of each line dictate what category the float values in the row correspond to. Integer combinations can be modified by the string values to produce wildly different results, and multiple combinations can sometimes be lumped together.

Once they're in categories, number crunching begins. I get statistical info (mean, sd, etc. for each category for each file).

The essentials: I need to parse data like the sample below into categories. Categories are combos of the non-floats in each record. ~~I'm also trying to come up with a dynamic (graphical) way to associate column combinations with categories.~~ Will make a new post fot this.

I'm looking for suggestions on how to do both.

    # data is a list of tab separated records
    # fields is a list of my field names

    # get a list of fieldtypes via gettype on our first row
    # gettype is a function to get type from string without changing data
    fieldtype = [gettype(n) for n in data[1].split('\t')]

    # get the indexes for fields that aren't floats
    mask =  [i for i, field in enumerate(fieldtype) if field!="float"]

    # for each row of data[skipping first and last empty lists] we split(on tabs)
    # and take the ith element of that split where i is taken from the list mask
    # which tells us which fields are not floats
    records = [[row.split('\t')[i] for i in mask] for row in data[1:-1]]

    # we now get a unique set of combos
    # since set doesn't happily take a list of lists, we join each row of values
    # together in a comma seperated string. So we end up with a list of strings.
    uniquerecs = set([",".join(row) for row in records])


    print len(uniquerecs)
    quit()

def gettype(s):
    try:
        int(s)
        return "int"
    except ValueError:
        pass
    try:
        float(s)
        return "float"
    except ValueError:
        return "string"

Sample Data:

field0  field1  field2  field3  field4  field5  field6  field7  field8  field9  field10 field11 field12 field13 field14 field15
10  0   2   1   Right   Right   Right   5.76765674196   0.0310912272139 0.0573603238282 0.0582901376612 0.0648936500524 0.0655294305058 0.0720571099855 0.0748289246137 0.446033755751
3   1   3   0   Left    Left    Right   8.00982745764   0.0313840132052 0.0576521406854 0.0585844966069 0.0644905497442 0.0653386429438 0.0712603578765 0.0740345755708 0.2641076191
5   19  1   0   Right   Left    Left    4.69440026591   0.0313852052224 0.0583165354345 0.0592403274967 0.0659404609478 0.0666070804916 0.0715314027001 0.0743022054775 0.465994962101
3   1   4   2   Left    Right   Left    9.58648184552   0.0303649003017 0.0571579895338 0.0580911765412 0.0634304670863 0.0640132919609 0.0702920967445 0.0730697946335 0.556525293
9   0   0   7   Left    Left    Left    7.65374257547   0.030318719717  0.0568551744109 0.0577785415066 0.0640577002605 0.0647226582655 0.0711459854908 0.0739256050784 1.23421547397

Get rid of that gettype and just do type(something) if you need the type itself; if you just want to compare it, use isinstance, e.g. isinstance(something, int). — Cat Plus Plus
– Cat Plus Plus, Commented May 22, 2011 at 18:11
The fields are all unicode strings. gettype differentiates them. — Codespaced
– Codespaced, Commented May 22, 2011 at 18:14
I regularly use both python and SPSS and would love to help, but I don't actually understand what your problem is. — chmullig
– chmullig, Commented May 22, 2011 at 18:22
I think part of my problem may be that I don't know that I understand what my problem is. Once I've got my list of categories, what's the best way to sort 30 files of 1000 lines each into 100 categories? — Codespaced
– Codespaced, Commented May 22, 2011 at 18:31

jammon · Accepted Answer · 2011-05-23 11:16:04Z

2

Not sure if I understand your question, but here are a few thoughts:

For parsing the data files, you usually use the Python csv module.

For categorizing the data you could use a defaultdict with the non-float fields joined as a key for the dict. Example:

from collections import defaultdict
import csv

reader = csv.reader(open('data.file', 'rb'), delimiter='\t')
data_of_category = defaultdict(list)
lines = [line for line in reader]
mask =  [i for i, n in enumerate(lines[1]) if gettype(n)!="float"]
for line in lines[1:]:
    category = ','.join([line[i] for i in mask])
    data_of_category[category].append(line)

This way you don't have to calculate the categories in the first place an can process the data in one pass.

And I didn't understand the part about "a dynamic (graphical) way to associate column combinations with categories".

edited May 23, 2011 at 11:16

answered May 22, 2011 at 19:11

jammon

3,4643 gold badges23 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Codespaced Over a year ago

This is fantastic and really addresses the splitting into category issue for me. Two minuscule typos line[1] became lines[1] and join[] became join(). Thank you! As for the dynamic graphical bit, I'll see about fleshing out the idea more completely and make a new post of it.

jammon Over a year ago

Ups! Seem to have been in a hurry. But at least you could figure out what I was trying to say.

Bolster · Accepted Answer · 2011-05-22 18:14:42Z

0

For at least part of your question, have a look at Named Tuples

answered May 22, 2011 at 18:14

Bolster

7,99615 gold badges66 silver badges99 bronze badges

Comments

Katriel · Accepted Answer · 2011-05-22 19:08:40Z

0

Step 1: Use something like csv.DictReader to turn the text file into an iterable of rows.

Step 2: Turn that into a dict of first entry: rest of entries.

with open("...", "rb") as data_file:
    lines = csv.Reader(data_file, some_custom_dialect)
    categories = {line[0]: line[1:] for line in lines}

Step 3: Iterate over the items() of the data and do something with each line.

for category, line in categories.items():
    do_stats_to_line(line)

answered May 22, 2011 at 19:08

Katriel

124k19 gold badges141 silver badges172 bronze badges

Comments

Rob Cowie · Accepted Answer · 2011-05-22 20:49:20Z

Some useful answers already but I'll throw mine in as well. Key points:

Use the csv module
Use collections.namedtuple for each row
Group the rows using a tuple of int field values as the key

If your source rows are sorted by the keys (the integer column values), you could use itertools.groupby. This would likely reduce memory consumption. Given your example data, and the fact that your files contain >= 1000 rows, this is probably not an issue to worry about.

def coerce_to_type(value):
    _types = (int, float)
    for _type in _types:
        try:
            return _type(value)
        except ValueError:
            continue
    return value

def parse_row(row):
    return [coerce_to_type(field) for field in row]

with open(datafile) as srcfile:
    data    = csv.reader(srcfile, delimiter='\t')

    ## Read headers, create namedtuple
    headers = srcfile.next().strip().split('\t')
    datarow = namedtuple('datarow', headers)

    ## Wrap with parser and namedtuple
    data = (parse_row(row) for row in data)
    data = (datarow(*row) for row in data)

    ## Group by the leading integer columns
    grouped_rows = defaultdict(list)
    for row in data:
        integer_fields = [field for field in row if isinstance(field, int)]
        grouped_rows[tuple(integer_fields)].append(row)

    ## DO SOMETHING INTERESTING WITH THE GROUPS
    import pprint
    pprint.pprint(dict(grouped_rows))

EDIT You may find the code at https://gist.github.com/985882 useful.

Collectives™ on Stack Overflow

Dynamically parsing research data in python

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related