Here is a simple solution that I came up with:
The idea is to read the file line by line and apply rules depending on the type of line encountered.
As you see in the sample input, there could be broadly 5 types of input you may encounter.
A comment which could start with '%' -> no action is needed here.
A blank line i.e. '\n' -> no action needed here.
A line that starts with @, which indicates it could be an attribute or name of the relation.
If not any of these, then it is the data itself.
The code follows a simple if-else logic taking actions at every step. based on the above 4 rules.
with open("../Downloads/Reading_Data_Files.txt","r") as dataFl:
lines = [line for line in dataFl]
attribute = []
data = []
for line in lines:
if line.startswith("%") or 'data' in line or line=='\n': # this is a comment or the data line
pass
elif line.startswith("@"):
if "relation" in line:
relationName = line.split(" ")[1]
elif "attribute" in line:
attribute.append(line.split(" ")[1])
else:
data.append(list(map(lambda x : x.strip(),line.split(","))))
print("Relation Name is : %s" %relationName)
print("Attributes are " + ','.join(attribute))
print(data)
If you want to see which attribute is what here is a solution, which is essentially the same solution as above but with a minor tweak. The only issue with solution above is that the output is a list of lists and to tell which attribute is which is an issue. Hence, a rather better solution would be annotate each data element with the corresponding attribute name. The output will be of the form:
{'distance': '45', 'temperature': '75', 'BusArrival': 'on_time', 'Students': '25'}
with open("/Users/sreejithmenon/Downloads/Reading_Data_Files.txt","r") as dataFl:
lines = [line for line in dataFl]
attribute = []
data = []
for line in lines:
if line.startswith("%") or 'data' in line or line=='\n': # this is a comment or the data line
pass
elif line.startswith("@"):
if "relation" in line:
relationName = line.split(" ")[1]
elif "attribute" in line:
attribute.append(line.split(" ")[1])
else:
dataLine = list(map(lambda x : x.strip(),line.split(",")))
dataDict = {attribute[i] : dataLine[i] for i in range(len(attribute))} # each line of data is now a dictionary.
data.append(dataDict)
print("Relation Name is : %s" %relationName)
print("Attributes are " + ','.join(attribute))
print(data)
You could use pandas Data frames to do more analysis, slicing, querying etc. Here is a link that should help you get started with http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
Edit: Explanation to comments
Meaning of the line: dataLine = list(map(lambda x : x.strip(),line.split(",")))
split(<delimiter>) function will split a string into pieces wherever there is a delimiter and returns a list(iterator).
For instance,
"hello, world".split(",") will return ['hello',' world'] Notice the space in front of "world".
map is a function that can apply a function(first argument) to each element in a iterator(second argument). It is generally used as a short-hand to apply transformations to each element of the iterator. strip() removes any leading or trailing whitespace. A lambda expression is a function and here it simply applies the strip function. map() extracts each element from the iterator and passes it to the lambda function and appends the returned answer to the final solution. Please read more about map function online. Pre-req: lambda expressions.
Part II in the comment: And when I am typing 'print(data[0])' all the data along with their attribute is printed. What if I want to print only no. of students of 5th row? What is I want to multiple all no. of students with corresponding temperature and store it in a new column with corresponding index?
When you print(data[0]) it should give you the first row as is, with the related attributes and should look something like this.
data[0]
Out[63]:
{'BusArrival': 'on_time',
'Students': '25',
'distance': '45',
'temperature': '75'}
I suggest you use pandas dataframe for quick manipulations of the data.
import pandas as pd
df = pd.DataFrame(data)
df
Out[69]:
BusArrival Students distance temperature
0 on_time 25 45 75
1 before 12 40 70
2 after 49 50 80
3 on_time 24 44 74
4 before 15 38 75
# and so on
Now you want to extract the 5th row only,
df.iloc[5]
Out[73]:
BusArrival after
Students 45
distance 49
temperature 85
Name: 5, dtype: object
Product of students and temperature is now simply,
df['Students'] = df['Students'].astype('int') # making sure they are not strings
df['temperature'] = df['temperature'].astype('int')
df['studentTempProd'] = df['Students'] * df['temperature']
df
Out[82]:
BusArrival Students distance temperature studentTempProd
0 on_time 25 45 75 1875
1 before 12 40 70 840
2 after 49 50 80 3920
3 on_time 24 44 74 1776
4 before 15 38 75 1125
There is a lot more you can do with pandas. Like only extracting the 'on_time' bus arrivals etc.