Ignore Empty DataFrame while reading a file with panda in Python

Question

I have a txt file like this:

`Empty DataFrame 
 Columns: [0, 1, 2, 3, 4]
 Index: []
 Empty DataFrame
 Columns: [0, 1, 2, 3, 4]
 Index: []
                       0                         1                           2  \
46   RNA/4v6p.csv,46AA/U/551    RNA/4v6p.csv,46AA/A/33         RNA/4v6p.csv,46WW_cis   
47   RNA/4v6p.csv,46AA/G/550    RNA/4v6p.csv,46AA/C/34         RNA/4v6p.csv,46WW_cis   
48   RNA/4v6p.csv,46AA/A/553    RNA/4v6p.csv,46AA/U/30         RNA/4v6p.csv,46WW_cis   
49   RNA/4v6p.csv,46AA/U/552    RNA/4v6p.csv,46AA/A/33         RNA/4v6p.csv,46WW_cis   
50   RNA/4v6p.csv,46AA/U/1199   RNA/4v6p.csv,46AA/G/1058       RNA/4v6p.csv,46WW_cis   

     3   4  
46 NaN NaN  
47 NaN NaN  
48 NaN NaN  
49 NaN NaN  
50 NaN NaN`

And I want to read it into an array with 3 columns. For now I tried using pd.read_csv(self.filename,delim_whitespace=True), but that gives me a lot of errors while trying to read Empty DataFrame part. How can I make program ignore this part?

Edit: Optimal solution would be if there was no Empty DataFrames in my file. The file is an effect of searching in many files, among which some are empty. I thought I had filtered empty files by giving an exception so that effect of searching in empty files would not be stored in results. I suppose I did it in the wrong way. Can somebody please correct me?

from numpy import numpy.mean as nm
def find_same_direction_chain(self, results):
         separation= lambda x: pd.Series([i for i in x.split('/')])
         left_chain=self.data[0].apply(separation)
         right_chain=self.data[1].apply(separation)
         i=1
         try:
            while i<len(self.data[:])-5:
                if nm(left_chain[2][i:i+3])>=nm(left_chain[2][i+2:i+5])  and nm(right_chain[2][i:i+3])>=nm(right_chain[2][i+2:i+5]) and len(self.data[:])>0:   
                    if nm(left_chain[2][i+2:i+5])>=nm(left_chain[2][i+4:i+7])  and nm(right_chain[2][i+2:i+5])>=nm(right_chain[2][i+4:i+7]):   
                        results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))

                else: pass
                i+=1
         except ValueError:
                    results.bin.append(self.filename)
         except TypeError:
                    results.data_structure_error.append(self.filename)

jezrael · Accepted Answer · 2016-03-21 13:28:44Z

1

You can use:

import pandas as pd
import io

temp=u"""Empty DataFrame 
 Columns: [0, 1, 2, 3, 4]
 Index: []
 Empty DataFrame
 Columns: [0, 1, 2, 3, 4]
 Index: []
                       0                         1                           2  \
46   RNA/4v6p.csv,46AA/U/551    RNA/4v6p.csv,46AA/A/33         RNA/4v6p.csv,46WW_cis   
47   RNA/4v6p.csv,46AA/G/550    RNA/4v6p.csv,46AA/C/34         RNA/4v6p.csv,46WW_cis   
48   RNA/4v6p.csv,46AA/A/553    RNA/4v6p.csv,46AA/U/30         RNA/4v6p.csv,46WW_cis   
49   RNA/4v6p.csv,46AA/U/552    RNA/4v6p.csv,46AA/A/33         RNA/4v6p.csv,46WW_cis   
50   RNA/4v6p.csv,46AA/U/1199   RNA/4v6p.csv,46AA/G/1058       RNA/4v6p.csv,46WW_cis   

     3   4  
46 NaN NaN  
47 NaN NaN  
48 NaN NaN  
49 NaN NaN  
50 NaN NaN"""

#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), delim_whitespace=True, names=range(7))

#remove rows with NaN in columns 0 - 3
df = df.dropna(subset=[0,1,2,3])

#remove rows where first column contains text 'Columns'
df = df[~df.iloc[:,0].str.contains('Columns')] 

#shift first row
df.iloc[0,:] = df.iloc[0,:].shift(-3)

#set first column to index
df = df.set_index(df.iloc[:,0])
#remove unnecessary columns
df = df.drop([0,4,5,6], axis=1)
print df
                           1                         2                      3
0                                                                            
46   RNA/4v6p.csv,46AA/U/551    RNA/4v6p.csv,46AA/A/33  RNA/4v6p.csv,46WW_cis
47   RNA/4v6p.csv,46AA/G/550    RNA/4v6p.csv,46AA/C/34  RNA/4v6p.csv,46WW_cis
48   RNA/4v6p.csv,46AA/A/553    RNA/4v6p.csv,46AA/U/30  RNA/4v6p.csv,46WW_cis
49   RNA/4v6p.csv,46AA/U/552    RNA/4v6p.csv,46AA/A/33  RNA/4v6p.csv,46WW_cis
50  RNA/4v6p.csv,46AA/U/1199  RNA/4v6p.csv,46AA/G/1058  RNA/4v6p.csv,46WW_cis

Or solution with skiprows in read_csv:

#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), delim_whitespace=True, names=range(7), skiprows=6)

#remove rows with NaN
df = df.dropna(subset=[0,1,2,3])

#shift first row
df.iloc[0,:] = df.iloc[0,:].shift(-3)

#set first column to index
df = df.set_index(df.iloc[:,0])
#remove unnecessary columns
df = df.drop([0,4,5,6], axis=1)
print df
                           1                         2                      3
0                                                                            
46   RNA/4v6p.csv,46AA/U/551    RNA/4v6p.csv,46AA/A/33  RNA/4v6p.csv,46WW_cis
47   RNA/4v6p.csv,46AA/G/550    RNA/4v6p.csv,46AA/C/34  RNA/4v6p.csv,46WW_cis
48   RNA/4v6p.csv,46AA/A/553    RNA/4v6p.csv,46AA/U/30  RNA/4v6p.csv,46WW_cis
49   RNA/4v6p.csv,46AA/U/552    RNA/4v6p.csv,46AA/A/33  RNA/4v6p.csv,46WW_cis
50  RNA/4v6p.csv,46AA/U/1199  RNA/4v6p.csv,46AA/G/1058  RNA/4v6p.csv,46WW_cis

EDIT:

You can try change (I have no sample data, so untested):

results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))

to:

if len(self.data[0:3][i:i+5]) > 0:                      
    results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))

edited Mar 21, 2016 at 13:28

answered Mar 21, 2016 at 11:48

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Leukonoe Over a year ago

I suppose I cant use skiprows, because in my file Empty DataFrame part is placed irregularly.

jezrael Over a year ago

Ok, try first solution without skiprows.

jezrael Over a year ago

But maybe better is filter empty DataFrames before writing to file - e.g. print [df for df in dfs if len(df) > 0] (dfs is list of DataFrames)

Leukonoe Over a year ago

This is probably what I need although when conditions of certain elements in DataFrame are met I save them into a list like: results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5])), and then I save this list into a file with with open("chains.txt","a+") as f: f.write("\n".join(self.result.chains)) so I wonder, why there are empty DataFrames in my file? How did they get there?

jezrael Over a year ago

I think it is really hard to help you, because this is uncompleted code nm, testing data are missing. But if results is list of DataFrames, try check code with append and where empty df are created, add

if len(self.data[0:3][i:i+5]) > 0:                                                   results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))

but without data in cannot be tested.

|

Collectives™ on Stack Overflow

Ignore Empty DataFrame while reading a file with panda in Python

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related