Create separate lists for each opened file in 'for' loop

Question

I'm trying to create lists from multiple opened files, having some issues. I need to create two separate lists for each file, right now my code only creates two lists for the last file iterated. Suggestions to fix, and create unique 'sample_genes' and 'sample_values' for each file in 'file_list'?

Alternatively, creating a single unified list for 'gene_names' from all files and 'sample_values' from all files would work as well.

# Parse csv files for samples, creating lists of gene names and expression values.
file_list =  ['CRPC_278.csv', 'PCaP_470.csv', 'CRPC_543.csv', 'PCaN_5934.csv', 'PCaN_6102.csv', 'PCaP_17163.csv']
des_list = ['a', 'b', 'c', 'd', 'e', 'f']
for idx, (f_in, des) in enumerate(zip(file_list, des_list)):
    with open(f_in) as des:
        cread = list(csv.reader(des, delimiter = '\t'))
        sample_genes = [i for i, j in (sorted([x for x in {i: float(j) 
                                        for i, j in cread}.items()], key = lambda v: v[1]))]        
        sample_values = [j for i, j in (sorted([x for x in {i: float(j) 
                            for i, j in cread}.items()], key = lambda v: v[1]))]

# Compute row means.
mean_values = [((a + b + c + d + e + f)/len(file_list)) for i, (a, b, c, d, e, f) in enumerate(zip(sample_1_values, sample_2_values, sample_3_values, sample_4_values, sample_5_values, sample_6_values))]

# Provide proper gene names for mean values and replace original data values by corresponding means.
sample_genes_list = [i for i in sample_1_genes, sample_2_genes, sample_3_genes, sample_4_genes, sample_5_genes, sample_6_genes]

sample_final_list = [sorted(zip(sg, mean_values)) for sg in sample_genes_list]

The new code below:

# Parse csv files for samples, creating lists of gene names and expression values.
file_list =  ['CRPC_278.csv', 'PCaP_470.csv', 'CRPC_543.csv', 'PCaN_5934.csv', 'PCaN_6102.csv', 'PCaP_17163.csv']
full_dict = {}
for path in file_list:
    with open(path) as stream:
            data = list(csv.reader(stream, delimiter = '\t'))
    data = sorted([(i, float(j)) for i, j in data], key = lambda v: v[1])
    sample_genes = [i for i, j in data]
    sample_values = [j for i, j in data]
    full_dict[path] = (sample_genes, sample_values)

Results from unpacking the dictionaries within the dictionary shows some deep nested structure:

for key in full_dict: 
value = full_dict[key]
for key in full_dict[key]:
    for idx, items in enumerate(key):
        print idx

The first problem is: about des variable name: you use it twice in the loop scope, first to unpack the zipped list, and next time to the opened file object. Also: don't use file as variable name, since it is a built-in phrase in python. And you also have a syntax problem: {i: float(j)... — what is the curly bracket for? and the :? What are you trying to do? — Peter Varo
– Peter Varo, Commented May 26, 2013 at 0:50
@PeterVaro: {a:b for ...} is a dictionary comprehension, e.g., {i:i for i in range(3)} produces {0: 0, 1: 1, 2: 2}. Of course the use in the OP's question is pretty odd.... — torek
– torek, Commented May 26, 2013 at 0:58
That's a nested dictionary comprehension inside a nested list comprehensions, etc., etc., verified to give the correct output, that code is correct. Modifying file object name above. — user2277435
– user2277435, Commented May 26, 2013 at 1:00
It's necessary to sort the values first by name, a number of other reasons, it's just a combination of previous separate list comprehensions. — user2277435
– user2277435, Commented May 26, 2013 at 1:01
They're in there, just hidden. The construct is [x for x in dictcomp.items()] where dictcomp is {i: float(j) for i, j in ...}. — torek
– torek, Commented May 26, 2013 at 1:01

torek · Accepted Answer · 2013-05-26 01:39:36Z

3

I don't know for sure what's in your csv files but you're doing some redundant work, and some pointless work. Let's break these down a bit:

for idx, (f_in, des) in enumerate(zip(file_list, des_list)):

idx never appears in the body of the loop at all, so you don't need the enumerate.

des does appear in the body of the list but its first occurrence is in the construct:

with open(f_in) as des:

so that the one inside the loop is a different des, being the stream from opening the file. So presumably you don't need the zip either. Dropping both, you could just do:

for f_in in file_list:

Next, you read the file once (list(csv.reader(...)) which is fine. The result is saved under the name cread.

Then you have these two list comprehensions run over the result of sorted, which itself is given the result of a list comprehension run over the result of applying .items() to a dictionary comprehension. The point of the outer list comprehensions is to extract one or the other item from the list: first i, then j, from [... for i, j in ...].

That might be appropriate depending on what's happening inside the sorted, so let's take a look at that:

sorted(..., key = lambda v: v[1])

This means the list-elements must themselves be index-able and you're sorting by the second item (the first being v[0] of course).

When you're sorting by the second item, then taking the first item and discarding the second, it's at least not redundant. But if you're sorting by the second item, then taking the second item and discarding the first, you could simply take the second items, and then sort and be done. (But let's check one more thing before we go that far. :-) )

Last, let's look at the dictionary comprehensions and the .items() invocation. The dictcomp itself is, in both cases:

{i: float(j) for i, j in cread}

Presumably your CSV files must give you pairs, and whatever is in the first part can be used as a key while whatever is in the second part is convertible to float. So let's take a simple dictionary with two key-value pairs that are, say, string-and-float:

{'a': 2.71828, 'b': 3.14159}

and apply .items():

>>> {'a': 2.71828, 'b': 3.14159}.items()
[('a', 2.71828), ('b', 3.14159)]

Instead of making up a dictionary and collapsing it back down to a list of 2-element tuples, you could just use a list comprehension to make the two-element tuples. Let's test that out:

>>> cread = [['a', '2.71828'], ['b', '3.14159']]
>>> [(i, float(j)) for i, j in cread]
[('a', 2.71828), ('b', 3.14159)]

Now we can sort this thing once, by its second element. We can use sorted, or make a list and sort it in place, but once we're done let's save it. Before we start, I've picked a bad set of values as they're already sorted, let's add a pair to cread that sorts differently:

>>> cread.append(['c', '0']); print cread
[['a', '2.71828'], ['b', '3.14159'], ['c', '0']]
>>> by_second = sorted([(i, float(j)) for i, j in cread], key = lambda v: v[1])
>>> by_second
[('c', 0.0), ('a', 2.71828), ('b', 3.14159)]

Having saved this sorted thing, we can now get the sample_genes and sample_values lists via the original list-comprehension-to-pick-item. I'll change a few names too:

for path in file_list:
    with open(path) as stream:
        data = list(csv.reader(stream, delimiter = '\t'))
    data = sorted([(i, float(j)) for i, j in data], key = lambda v: v[1])
    sample_genes = [i for i, j in data]
    sample_values = [j for i, j in data]

The next step is of course to save these samples somehow. Presumably you were going to use either idx or des_list to name them, but it seems more direct to index them by csv-path-name:

    somedict[path] = (sample_genes, sample_values)

where somedict is initially an empty dictionary (created before entering the for loop). At some point here it's reasonable to start thinking about proper data structures and create a class, though.

answered May 26, 2013 at 1:39

torek

499k71 gold badges763 silver badges888 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user2277435 Over a year ago

Looks quite good so far. I've modified the code above. The issue now is, look at the dictionaries within the dictionary, there's 6 top-level indices for each file, good. But underneath is a strange number of subindices for each, more startling, not the same amount. Something is still incorrect, as each file gives the same 29,000 names, values for each csv. In any case, much appreciated, I'll mark as correct. Any further input on the remaining issue would be appreciated.

torek Over a year ago

enumerate on a dictionary iterates over the keys: for i, item in enumerate({'key1': 1.0, 'key2': 1.5}): print i, item prints 0 key2 and then 1 key1. You can access the values in the dictionary with: for key in adict: value = adict[key]; ... or one of the various dictionary accessors (.items, .keys, .values, etc).

user2277435 Over a year ago

'print full_dict' does however seem to give the full correct output, just have to understand how the dict is structured to call the values from the right dataset for the next step.

Sanjay Manohar · Accepted Answer · 2013-05-26 01:29:31Z

0

Not sure if I see the problem, can't you just do

sample_genes[idx]  = [i for i, j in (....
sample_values[idx] = [j for i, j in (....

or sample_genes[des] if you prefer named properties?

answered May 26, 2013 at 1:29

Sanjay Manohar

7,0564 gold badges37 silver badges61 bronze badges

1 Comment

user2277435 Over a year ago

It looks like it would work, but throws an 'IndexError: list assignment index out of range' error. I suspect because of the huge number of indices.

Collectives™ on Stack Overflow

Create separate lists for each opened file in 'for' loop

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related