Text Parser in Python

Question

I have to write a code to read data in text file. This text file has a specific format. It is like comma-separated values (CSV) file that stores tabular data. And, I must be able to perform calculations on the data of that file.

Here's the format instruction of that file:

A dataset has to start with a declaration of its name:

@relation name

followed by a list of all the attributes in the dataset

@attribute attribute_name specification

If an attribute is nominal, specification contains a list of the possible attribute values in curly brackets:

@attribute nominal_attribute {first_value, second_value, third_value}

If an attribute is numeric, specification is replaced by the keyword

@attribute numeric_attribute numeric

After the attribute declarations, the actual data is introduced by a

@data

tag, which is followed by a list of all the instances. The instances are listed in comma-separated format, with a question mark representing a missing value.

Comments are lines starting with % and are ignored.

I must be able to make calculations on this data separated by comma, and must know which data is associated to which attribute.

Example dataset file: 1: https://drive.google.com/open?id=0By6GDPYLwp2cSkd5M0J0ZjczVW8 2: https://drive.google.com/open?id=0By6GDPYLwp2cejB5SVlhTFdubnM

I have no experience with parsing and very little experience with Python. So, I felt to ask the experts for the easy way to do it.

Thanks

Have you tried anything at all? The main point of Stack Overflow is to help developers overcome specific technical problems. I appreciate you probably don't feel like you know where to start but try opening the file, iterating over the lines,. See if you can formulate a scheme for converting the file data to an in memory representation. You will learn much more by getting your hands dirty like this. — Paul Rooney
– Paul Rooney, Commented Sep 27, 2016 at 1:48

Sreejith Menon · Accepted Answer · 2016-09-27 04:56:28Z

2

Here is a simple solution that I came up with:

The idea is to read the file line by line and apply rules depending on the type of line encountered.

As you see in the sample input, there could be broadly 5 types of input you may encounter.

A comment which could start with '%' -> no action is needed here.
A blank line i.e. '\n' -> no action needed here.
A line that starts with @, which indicates it could be an attribute or name of the relation.
If not any of these, then it is the data itself.

The code follows a simple if-else logic taking actions at every step. based on the above 4 rules.

with open("../Downloads/Reading_Data_Files.txt","r") as dataFl:
    lines = [line for line in dataFl]

attribute = []
data = []
for line in lines:
    if line.startswith("%") or 'data' in line or line=='\n': # this is a comment or the data line
        pass
    elif line.startswith("@"):
        if "relation" in line:
            relationName = line.split(" ")[1]
        elif "attribute" in line:
            attribute.append(line.split(" ")[1])
    else:
        data.append(list(map(lambda x : x.strip(),line.split(","))))

print("Relation Name is : %s" %relationName)
print("Attributes are " + ','.join(attribute))
print(data)

If you want to see which attribute is what here is a solution, which is essentially the same solution as above but with a minor tweak. The only issue with solution above is that the output is a list of lists and to tell which attribute is which is an issue. Hence, a rather better solution would be annotate each data element with the corresponding attribute name. The output will be of the form: {'distance': '45', 'temperature': '75', 'BusArrival': 'on_time', 'Students': '25'}

with open("/Users/sreejithmenon/Downloads/Reading_Data_Files.txt","r") as dataFl:
    lines = [line for line in dataFl]

attribute = []
data = []
for line in lines:
    if line.startswith("%") or 'data' in line or line=='\n': # this is a comment or the data line
        pass
    elif line.startswith("@"):
        if "relation" in line:
            relationName = line.split(" ")[1]
        elif "attribute" in line:
            attribute.append(line.split(" ")[1])
    else:
        dataLine = list(map(lambda x : x.strip(),line.split(",")))
        dataDict = {attribute[i] : dataLine[i] for i in range(len(attribute))} # each line of data is now a dictionary.
        data.append(dataDict)

print("Relation Name is : %s" %relationName)
print("Attributes are " + ','.join(attribute))
print(data)

You could use pandas Data frames to do more analysis, slicing, querying etc. Here is a link that should help you get started with http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Edit: Explanation to comments Meaning of the line: dataLine = list(map(lambda x : x.strip(),line.split(","))) split(<delimiter>) function will split a string into pieces wherever there is a delimiter and returns a list(iterator).

For instance, "hello, world".split(",") will return ['hello',' world'] Notice the space in front of "world".

map is a function that can apply a function(first argument) to each element in a iterator(second argument). It is generally used as a short-hand to apply transformations to each element of the iterator. strip() removes any leading or trailing whitespace. A lambda expression is a function and here it simply applies the strip function. map() extracts each element from the iterator and passes it to the lambda function and appends the returned answer to the final solution. Please read more about map function online. Pre-req: lambda expressions.

Part II in the comment: And when I am typing 'print(data[0])' all the data along with their attribute is printed. What if I want to print only no. of students of 5th row? What is I want to multiple all no. of students with corresponding temperature and store it in a new column with corresponding index?

When you print(data[0]) it should give you the first row as is, with the related attributes and should look something like this.

data[0]
Out[63]: 
{'BusArrival': 'on_time',
 'Students': '25',
 'distance': '45',
 'temperature': '75'}

I suggest you use pandas dataframe for quick manipulations of the data.

import pandas as pd
df = pd.DataFrame(data)
df
Out[69]: 
  BusArrival Students distance temperature
0     on_time       25       45          75
1      before       12       40          70
2       after       49       50          80
3     on_time       24       44          74
4      before       15       38          75
    # and so on

Now you want to extract the 5th row only,

df.iloc[5]
Out[73]: 
BusArrival     after
Students          45
distance          49
temperature       85
Name: 5, dtype: object

Product of students and temperature is now simply,

df['Students'] = df['Students'].astype('int') # making sure they are not strings
df['temperature'] = df['temperature'].astype('int') 
df['studentTempProd'] = df['Students'] * df['temperature']

df
Out[82]: 
   BusArrival  Students distance  temperature  studentTempProd
0     on_time        25       45           75             1875
1      before        12       40           70              840
2       after        49       50           80             3920
3     on_time        24       44           74             1776
4      before        15       38           75             1125

There is a lot more you can do with pandas. Like only extracting the 'on_time' bus arrivals etc.

edited Sep 27, 2016 at 4:56

answered Sep 27, 2016 at 2:02

Sreejith Menon

1,0873 gold badges18 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Stewart Smith Over a year ago

This doesn't make any attempt to explain the solution to the problem. The user will be no better off for the next issue they come across.

Sreejith Menon Over a year ago

Please be a little patient sir. I am still editing the solution. Had I not posted a partial answer someone might have answered it anyway. I understand stack overflow is a strict community but give beginners a chance to come up the reputation ladder.

Stewart Smith Over a year ago

I understand your concern. In the future please post full solutions to problems, or they will be judged in parts. I will change my voting based on the completed solution.

user6867490 Over a year ago

Hi @SreejithMenon, It works perfectly. You are a genius! Could you please explain a little upon the following lines of the code: 'dataLine = list(map(lambda x : x.strip(),line.split(","))) dataDict = {attribute[i] : dataLine[i] for i in range(len(attribute))}' And when I am typing 'print(data[0])' all the data along with their attribute is printed. What if I want to print only no. of students of 5th row? What is I want to multiple all no. of students with corresponding temperature and store it in a new column with corresponding index? Thanks

Sreejith Menon Over a year ago

@GT96: I have addressed all queries in your comment in the answer itself. Hope it helps. Thanks!

|

Collectives™ on Stack Overflow

Text Parser in Python

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related