0

I have the following data space separated (mydata.txt):

sample1 probe1 gene1 3.23
sample1 probe1 gene2 1.20
sample2 probe1 gene1 2.20
sample2 probe2 gene1 0.12

What I want to do is to create a data frame that looks like this:

probe   gene    sample1 sample2
probe1  gene1   3.23     2.20
probe1  gene2   1.20     NA
probe2  gene1   NA       0.12

However, instead of transforming the data right after reading the CSV (e.g. via pandas.DataFrame.from_csv), I'd like to construct that data frame from the for-loop. I tried this but failed

#!/usr/bin/env python
import pandas as pd
import csv

infile = "mydata.txt"

alltups = []
with open(infile, 'r') as tsvfile:
    tabreader = csv.reader(tsvfile, delimiter=' ')
    for row in tabreader:
        sample, probe, gene, foldchange = row 
        tup = (sample, [probe,gene,foldchange])
        alltups.append(tup)

df = pd.DataFrame.from_items(alltups)
print df

Which produces:

  sample1 sample1 sample2 sample2
0  probe1  probe1  probe1  probe2
1   gene1   gene2   gene1   gene1
2    3.23    1.20    2.20    0.12

What's the right way to do it?

2 Answers 2

1

You can create temp with a for loop:

alltups = []
tabreader = csv.reader(open(infile, 'r'), delimiter='\t')
for row in tabreader:
        alltups.append(row)
## -- End pasted text --

   In [1280]: pd.DataFrame(alltups).rename(columns={0:'Sample',1:'Probe',2:'Gene',3:'Value'})
Out[1280]: 
    Sample   Probe   Gene Value
0  sample1  probe1  gene1  3.23
1  sample1  probe1  gene2  1.20
2  sample2  probe1  gene1  2.20
3  sample2  probe2  gene1  0.12

In [1287]: temp['Value'] = temp['Value'].astype(float)

or with temp = pd.read_csv('test.txt', sep='\t') which is used below: This is gotten from a simple pivot, if you are ok to not using the for-loop:

In [1239]: temp.pivot_table(index=['Probe','Gene'], columns='Sample',values='Value')
Out[1239]: 
Sample        sample1  sample2
Probe  Gene                   
probe1 gene1     3.23     2.20
       gene2     1.20      NaN
probe2 gene1      NaN     0.12
Sign up to request clarification or add additional context in comments.

2 Comments

No. I mean temp in your code temp.pivot_table. And finally I'd like to have the CSV file written like in OP. So the probe column need to have no 'holes'. How can I achieve that?
The holes in the probe column don't appear in the csv when you do a to_csv; they are fleshed out properly.
0

I have no idea why you'd like to use a for loop. Isn't this a much simpler solution?

df = pd.read_csv('mydata.txt', 
                 sep=" ", 
                 index_col=[1, 2, 0], 
                 names=['sample', 'probe', 'gene', 'value']).unstack()

>>> df
               value        
sample       sample1 sample2
probe  gene                 
probe1 gene1    3.23    2.20
       gene2    1.20     NaN
probe2 gene1     NaN    0.12

2 Comments

because in my actual case mydata.txt is a data structure of another process.
Can you first build a DataFrame and then transform/reshape it? If so, this approach will still work (you just don't need to read_csv).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.