1

Im new to Python coming from Java world.

  1. I'm trying to write a simple python function that prints out only the data rows of a CSV or "arff" file. The non data rows begin with these 3 patterns @ , [@ , [%, and such rows should not be printed.

  2. Example data file snippet:

    % 1. Title: Iris Plants Database
    % 
    % 2. Sources:
    
    %      (a) Creator: R.A. Fisher
    %      (b) Donor: Michael Marshall (MARSHALL%[email protected])
    %      (c) Date: July, 1988
    
    @RELATION iris
    
    @ATTRIBUTE sepallength  REAL
    @ATTRIBUTE sepalwidth   REAL
    @ATTRIBUTE petallength  REAL
    @ATTRIBUTE petalwidth   REAL
    @ATTRIBUTE class    {Iris-setosa,Iris-versicolor,Iris-virginica}
    
    @DATA
    5.1,3.5,1.4,0.2,Iris-setosa
    4.9,3.0,1.4,0.2,Iris-setosa
    4.7,3.2,1.3,0.2,Iris-setosa
    4.6,3.1,1.5,0.2,Iris-setosa
    5.0,3.6,1.4,0.2,Iris-setosa
    5.4,3.9,1.7,0.4,Iris-setosa
    

Python script:

import csv
def loadCSVfile (path):
    csvData = open(path, 'rb') 
    spamreader = csv.reader(csvData, delimiter=',', quotechar='|')
    for row in spamreader:
        if row.__len__ > 0:
            #search the string from index 0 to 2 and if these substrings(@ ,'[\'%' , '[\'@') are not found, than print the row
            if (str(row).find('@',0,1) & str(row).find('[\'%',0,2) & str(row).find('[\'@',0,2) != 1):
                print str(row)

loadCSVfile('C:/Users/anaim/Desktop/Data Mining/OneR/iris.arff')

actual output:

['% 1. Title: Iris Plants Database']
['% ']
['% 2. Sources:']
['%      (a) Creator: R.A. Fisher']
['%      (b) Donor: Michael Marshall (MARSHALL%[email protected])']
['%      (c) Date: July', ' 1988']
['% ']
[]
['@RELATION iris']
[]
['@ATTRIBUTE sepallength\tREAL']
['@ATTRIBUTE sepalwidth \tREAL']
['@ATTRIBUTE petallength \tREAL']
['@ATTRIBUTE petalwidth\tREAL']
['@ATTRIBUTE class \t{Iris-setosa', 'Iris-versicolor', 'Iris-virginica}']
[]
['@DATA']
['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']
['5.4', '3.9', '1.7', '0.4', 'Iris-setosa']
['4.6', '3.4', '1.4', '0.3', 'Iris-setosa']
['5.0', '3.4', '1.5', '0.2', 'Iris-setosa']

Desired output:

['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']
['5.4', '3.9', '1.7', '0.4', 'Iris-setosa']
['4.6', '3.4', '1.4', '0.3', 'Iris-setosa']
['5.0', '3.4', '1.5', '0.2', 'Iris-setosa']
0

2 Answers 2

2

To test if a row was empty, just use it in a boolean context; empty lists are false.

To test if a string starts with some specific characters, use str.startswith(), which can take either a single string or a tuple of strings:

import csv
def loadCSVfile (path):
    with open(path, 'rb') as csvData:
        spamreader = csv.reader(csvData, delimiter=',', quotechar='|')
        for row in spamreader:
            if row and not row[0].startswith(('%', '@')):
                print row

Because you are really testing for fixed-width character strings, you can also just slice the first column and test with in against a sequence; a set would be most efficient:

def loadCSVfile (path):
    ignore = {'@', '%'}
    with open(path, 'rb') as csvData:
        spamreader = csv.reader(csvData, delimiter=',', quotechar='|')
        for row in spamreader:
            if row and not row[0][:1] in ignore:
                print row

Here the [:1] slice notation returns the first character of the row[0] column (or an empty string if that first column is empty).

I used the open file object as a context manager (with ... as ...) so that Python automatically closes the file for us when the code block is done (or an exception is raised).

You should never call double-underscore methods ("dunder" methods, or special methods) directly, the proper API call would be len(row) instead.

Demo:

>>> loadCSVfile('/tmp/iris.arff')
['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']
['5.4', '3.9', '1.7', '0.4', 'Iris-setosa']
Sign up to request clarification or add additional context in comments.

Comments

0

I would take advantage of the in operator and of Python list comprehension.

Here is what I mean:

import csv

def loadCSVfile (path):
    exclusions = ['@', '%', '\n', '[@' , '[%']
    csvData = open(path, 'r')
    spamreader = csv.reader(csvData, delimiter=',', quotechar='|')      

    lines = [line for line in spamreader if ( line and line[0][0:1] not in exclusions and line[0][0:2] not in exclusions )]

    for line in lines:
        print(line)


loadCSVfile('C:/Users/anaim/Desktop/Data Mining/OneR/iris.arff')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.