1

I have a txt file like this:

`Empty DataFrame 
 Columns: [0, 1, 2, 3, 4]
 Index: []
 Empty DataFrame
 Columns: [0, 1, 2, 3, 4]
 Index: []
                       0                         1                           2  \
46   RNA/4v6p.csv,46AA/U/551    RNA/4v6p.csv,46AA/A/33         RNA/4v6p.csv,46WW_cis   
47   RNA/4v6p.csv,46AA/G/550    RNA/4v6p.csv,46AA/C/34         RNA/4v6p.csv,46WW_cis   
48   RNA/4v6p.csv,46AA/A/553    RNA/4v6p.csv,46AA/U/30         RNA/4v6p.csv,46WW_cis   
49   RNA/4v6p.csv,46AA/U/552    RNA/4v6p.csv,46AA/A/33         RNA/4v6p.csv,46WW_cis   
50   RNA/4v6p.csv,46AA/U/1199   RNA/4v6p.csv,46AA/G/1058       RNA/4v6p.csv,46WW_cis   

     3   4  
46 NaN NaN  
47 NaN NaN  
48 NaN NaN  
49 NaN NaN  
50 NaN NaN`

And I want to read it into an array with 3 columns. For now I tried using pd.read_csv(self.filename,delim_whitespace=True), but that gives me a lot of errors while trying to read Empty DataFrame part. How can I make program ignore this part?

Edit: Optimal solution would be if there was no Empty DataFrames in my file. The file is an effect of searching in many files, among which some are empty. I thought I had filtered empty files by giving an exception so that effect of searching in empty files would not be stored in results. I suppose I did it in the wrong way. Can somebody please correct me?

from numpy import numpy.mean as nm
def find_same_direction_chain(self, results):
         separation= lambda x: pd.Series([i for i in x.split('/')])
         left_chain=self.data[0].apply(separation)
         right_chain=self.data[1].apply(separation)
         i=1
         try:
            while i<len(self.data[:])-5:
                if nm(left_chain[2][i:i+3])>=nm(left_chain[2][i+2:i+5])  and nm(right_chain[2][i:i+3])>=nm(right_chain[2][i+2:i+5]) and len(self.data[:])>0:   
                    if nm(left_chain[2][i+2:i+5])>=nm(left_chain[2][i+4:i+7])  and nm(right_chain[2][i+2:i+5])>=nm(right_chain[2][i+4:i+7]):   
                        results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))

                else: pass
                i+=1
         except ValueError:
                    results.bin.append(self.filename)
         except TypeError:
                    results.data_structure_error.append(self.filename)

1 Answer 1

1

You can use:

import pandas as pd
import io

temp=u"""Empty DataFrame 
 Columns: [0, 1, 2, 3, 4]
 Index: []
 Empty DataFrame
 Columns: [0, 1, 2, 3, 4]
 Index: []
                       0                         1                           2  \
46   RNA/4v6p.csv,46AA/U/551    RNA/4v6p.csv,46AA/A/33         RNA/4v6p.csv,46WW_cis   
47   RNA/4v6p.csv,46AA/G/550    RNA/4v6p.csv,46AA/C/34         RNA/4v6p.csv,46WW_cis   
48   RNA/4v6p.csv,46AA/A/553    RNA/4v6p.csv,46AA/U/30         RNA/4v6p.csv,46WW_cis   
49   RNA/4v6p.csv,46AA/U/552    RNA/4v6p.csv,46AA/A/33         RNA/4v6p.csv,46WW_cis   
50   RNA/4v6p.csv,46AA/U/1199   RNA/4v6p.csv,46AA/G/1058       RNA/4v6p.csv,46WW_cis   

     3   4  
46 NaN NaN  
47 NaN NaN  
48 NaN NaN  
49 NaN NaN  
50 NaN NaN"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), delim_whitespace=True, names=range(7))

#remove rows with NaN in columns 0 - 3
df = df.dropna(subset=[0,1,2,3])

#remove rows where first column contains text 'Columns'
df = df[~df.iloc[:,0].str.contains('Columns')] 

#shift first row
df.iloc[0,:] = df.iloc[0,:].shift(-3)

#set first column to index
df = df.set_index(df.iloc[:,0])
#remove unnecessary columns
df = df.drop([0,4,5,6], axis=1)
print df
                           1                         2                      3
0                                                                            
46   RNA/4v6p.csv,46AA/U/551    RNA/4v6p.csv,46AA/A/33  RNA/4v6p.csv,46WW_cis
47   RNA/4v6p.csv,46AA/G/550    RNA/4v6p.csv,46AA/C/34  RNA/4v6p.csv,46WW_cis
48   RNA/4v6p.csv,46AA/A/553    RNA/4v6p.csv,46AA/U/30  RNA/4v6p.csv,46WW_cis
49   RNA/4v6p.csv,46AA/U/552    RNA/4v6p.csv,46AA/A/33  RNA/4v6p.csv,46WW_cis
50  RNA/4v6p.csv,46AA/U/1199  RNA/4v6p.csv,46AA/G/1058  RNA/4v6p.csv,46WW_cis

Or solution with skiprows in read_csv:

#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), delim_whitespace=True, names=range(7), skiprows=6)

#remove rows with NaN
df = df.dropna(subset=[0,1,2,3])

#shift first row
df.iloc[0,:] = df.iloc[0,:].shift(-3)

#set first column to index
df = df.set_index(df.iloc[:,0])
#remove unnecessary columns
df = df.drop([0,4,5,6], axis=1)
print df
                           1                         2                      3
0                                                                            
46   RNA/4v6p.csv,46AA/U/551    RNA/4v6p.csv,46AA/A/33  RNA/4v6p.csv,46WW_cis
47   RNA/4v6p.csv,46AA/G/550    RNA/4v6p.csv,46AA/C/34  RNA/4v6p.csv,46WW_cis
48   RNA/4v6p.csv,46AA/A/553    RNA/4v6p.csv,46AA/U/30  RNA/4v6p.csv,46WW_cis
49   RNA/4v6p.csv,46AA/U/552    RNA/4v6p.csv,46AA/A/33  RNA/4v6p.csv,46WW_cis
50  RNA/4v6p.csv,46AA/U/1199  RNA/4v6p.csv,46AA/G/1058  RNA/4v6p.csv,46WW_cis

EDIT:

You can try change (I have no sample data, so untested):

results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))

to:

if len(self.data[0:3][i:i+5]) > 0:                      
    results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))
Sign up to request clarification or add additional context in comments.

7 Comments

I suppose I cant use skiprows, because in my file Empty DataFrame part is placed irregularly.
Ok, try first solution without skiprows.
But maybe better is filter empty DataFrames before writing to file - e.g. print [df for df in dfs if len(df) > 0] (dfs is list of DataFrames)
This is probably what I need although when conditions of certain elements in DataFrame are met I save them into a list like: results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5])), and then I save this list into a file with with open("chains.txt","a+") as f: f.write("\n".join(self.result.chains)) so I wonder, why there are empty DataFrames in my file? How did they get there?
I think it is really hard to help you, because this is uncompleted code nm, testing data are missing. But if results is list of DataFrames, try check code with append and where empty df are created, add if len(self.data[0:3][i:i+5]) > 0: results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5])) but without data in cannot be tested.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.