1

I am beginner in the programming world and a would like some tips on how to solve a challenge. Right now I have ~10 000 .dat files each with a single line following this structure:

Attribute1=Value&Attribute2=Value&Attribute3=Value...AttibuteN=Value

I have been trying to use python and the CSV library to convert these .dat files into a single .csv file.

So far I was able to write something that would read all files, store the contents of each file in a new line and substitute the "&" to "," but since the Attribute1,Attribute2...AttributeN are exactly the same for every file, I would like to make them into column headers and remove them from every other line.

Any tips on how to go about that?

Thank you!

0

3 Answers 3

1

Since you are a beginner, I prepared some code that works, and is at the same time very easy to understand.

I assume that you have all the files in the folder called 'input'. The code beneath should be in a script file next to the folder.

Keep in mind that this code should be used to understand how a problem like this can be solved. Optimisations and sanity checks have been left out intentionally.

You might want to check additionally what happens when a value is missing in some line, what happens when an attribute is missing, what happens with a corrupted input etc.. :)

Good luck!

import os

# this function splits the attribute=value into two lists
# the first list are all the attributes
# the second list are all the values
def getAttributesAndValues(line):
    attributes = []
    values = []

    # first we split the input over the &
    AtributeValues = line.split('&')
    for attrVal in AtributeValues:
        # we split the attribute=value over the '=' sign
        # the left part goes to split[0], the value goes to split[1]
        split = attrVal.split('=')
        attributes.append(split[0])
        values.append(split[1])

    # return the attributes list and values list
    return attributes,values

# test the function using the line beneath so you understand how it works
# line = "Attribute1=Value&Attribute2=Value&Attribute3=Vale&AttibuteN=Value"
# print getAttributesAndValues(line)

# this function writes a single file to an output file
def writeToCsv(inFile='', wfile="outFile.csv", delim=","):
    f_in = open(inFile, 'r')    # only reading the file
    f_out = open(wfile, 'ab+')  # file is opened for reading and appending

    # read the whole file line by line
    lines = f_in.readlines()

    # loop throug evert line in the file and write its values
    for line in lines:
        # let's check if the file is empty and write the headers then
        first_char = f_out.read(1)
        header, values = getAttributesAndValues(line)

        # we write the header only if the file is empty
        if not first_char:
            for attribute in header:
                f_out.write(attribute+delim)
            f_out.write("\n")

        # we write the values
        for value in values:
            f_out.write(value+delim)
        f_out.write("\n")

# Read all the files in the path (without dir pointer)
allInputFiles = os.listdir('input/')
allInputFiles = allInputFiles[1:]

# loop through all the files and write values to the csv file
for singleFile in allInputFiles:
    writeToCsv('input/'+singleFile)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much! As you have intended, this code helped me solve my problem and gave me a little something to study.
0

but since the Attribute1,Attribute2...AttributeN are exactly the same for every file, I would like to make them into column headers and remove them from every other line.

input = 'Attribute1=Value1&Attribute2=Value2&Attribute3=Value3'

once for the the first file:

','.join(k for (k,v) in map(lambda s: s.split('='), input.split('&')))

for each file's content:

','.join(v for (k,v) in map(lambda s: s.split('='), input.split('&')))

Maybe you need to trim the strings additionally; don't know how clean your input is.

1 Comment

Ok this is an interesting method! I'll try it out and let you know what happens. thank you!
0

Put the dat files in a folder called myDats. Put this script next to the myDats folder along with a file called temp.txt. You will also need your output.csv. [That is, you will have output.csv, myDats, and mergeDats.py in the same folder]

mergeDats.py

import csv
import os
g = open("temp.txt","w")
for file in os.listdir('myDats'):
    f = open("myDats/"+file,"r")
    tempData = f.readlines()[0]
    tempData = tempData.replace("&","\n")
    g.write(tempData)
    f.close()
g.close()
h = open("text.txt","r")
arr = h.read().split("\n")
dict = {}
for x in arr:
    temp2 = x.split("=")
    dict[temp2[0]] = temp2[1]
with open('output.csv','w' """use 'wb' in python 2.x""" ) as output:
    w = csv.DictWriter(output,my_dict.keys())
    w.writeheader()
    w.writerow(my_dict)

2 Comments

Thanks! Running this, I get: 'IOError: [Errno 2] No such file or directory: '1.dat''
that should fix it, try it again

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.