Using Python to Merge Single Line .dat Files into one .csv file

Question

I am beginner in the programming world and a would like some tips on how to solve a challenge. Right now I have ~10 000 .dat files each with a single line following this structure:

Attribute1=Value&Attribute2=Value&Attribute3=Value...AttibuteN=Value

I have been trying to use python and the CSV library to convert these .dat files into a single .csv file.

So far I was able to write something that would read all files, store the contents of each file in a new line and substitute the "&" to "," but since the Attribute1,Attribute2...AttributeN are exactly the same for every file, I would like to make them into column headers and remove them from every other line.

Any tips on how to go about that?

Thank you!

afabijan · Accepted Answer · 2015-10-31 17:30:03Z

Since you are a beginner, I prepared some code that works, and is at the same time very easy to understand.

I assume that you have all the files in the folder called 'input'. The code beneath should be in a script file next to the folder.

Keep in mind that this code should be used to understand how a problem like this can be solved. Optimisations and sanity checks have been left out intentionally.

You might want to check additionally what happens when a value is missing in some line, what happens when an attribute is missing, what happens with a corrupted input etc.. :)

Good luck!

import os

# this function splits the attribute=value into two lists
# the first list are all the attributes
# the second list are all the values
def getAttributesAndValues(line):
    attributes = []
    values = []

    # first we split the input over the &
    AtributeValues = line.split('&')
    for attrVal in AtributeValues:
        # we split the attribute=value over the '=' sign
        # the left part goes to split[0], the value goes to split[1]
        split = attrVal.split('=')
        attributes.append(split[0])
        values.append(split[1])

    # return the attributes list and values list
    return attributes,values

# test the function using the line beneath so you understand how it works
# line = "Attribute1=Value&Attribute2=Value&Attribute3=Vale&AttibuteN=Value"
# print getAttributesAndValues(line)

# this function writes a single file to an output file
def writeToCsv(inFile='', wfile="outFile.csv", delim=","):
    f_in = open(inFile, 'r')    # only reading the file
    f_out = open(wfile, 'ab+')  # file is opened for reading and appending

    # read the whole file line by line
    lines = f_in.readlines()

    # loop throug evert line in the file and write its values
    for line in lines:
        # let's check if the file is empty and write the headers then
        first_char = f_out.read(1)
        header, values = getAttributesAndValues(line)

        # we write the header only if the file is empty
        if not first_char:
            for attribute in header:
                f_out.write(attribute+delim)
            f_out.write("\n")

        # we write the values
        for value in values:
            f_out.write(value+delim)
        f_out.write("\n")

# Read all the files in the path (without dir pointer)
allInputFiles = os.listdir('input/')
allInputFiles = allInputFiles[1:]

# loop through all the files and write values to the csv file
for singleFile in allInputFiles:
    writeToCsv('input/'+singleFile)

Thank you very much! As you have intended, this code helped me solve my problem and gave me a little something to study.

decltype_auto · Accepted Answer · 2015-10-31 17:48:05Z

0

but since the Attribute1,Attribute2...AttributeN are exactly the same for every file, I would like to make them into column headers and remove them from every other line.

input = 'Attribute1=Value1&Attribute2=Value2&Attribute3=Value3'

once for the the first file:

','.join(k for (k,v) in map(lambda s: s.split('='), input.split('&')))

for each file's content:

','.join(v for (k,v) in map(lambda s: s.split('='), input.split('&')))

Maybe you need to trim the strings additionally; don't know how clean your input is.

edited Oct 31, 2015 at 17:48

answered Oct 31, 2015 at 16:40

decltype_auto

1,72611 silver badges19 bronze badges

1 Comment

brenogil Over a year ago

Ok this is an interesting method! I'll try it out and let you know what happens. thank you!

AMACB · Accepted Answer · 2015-11-01 20:51:43Z

0

Put the dat files in a folder called myDats. Put this script next to the myDats folder along with a file called temp.txt. You will also need your output.csv. [That is, you will have output.csv, myDats, and mergeDats.py in the same folder]

mergeDats.py

import csv
import os
g = open("temp.txt","w")
for file in os.listdir('myDats'):
    f = open("myDats/"+file,"r")
    tempData = f.readlines()[0]
    tempData = tempData.replace("&","\n")
    g.write(tempData)
    f.close()
g.close()
h = open("text.txt","r")
arr = h.read().split("\n")
dict = {}
for x in arr:
    temp2 = x.split("=")
    dict[temp2[0]] = temp2[1]
with open('output.csv','w' """use 'wb' in python 2.x""" ) as output:
    w = csv.DictWriter(output,my_dict.keys())
    w.writeheader()
    w.writerow(my_dict)

edited Nov 1, 2015 at 20:51

answered Oct 31, 2015 at 16:33

AMACB

1,2982 gold badges18 silver badges26 bronze badges

2 Comments

brenogil Over a year ago

Thanks! Running this, I get: 'IOError: [Errno 2] No such file or directory: '1.dat''

AMACB Over a year ago

that should fix it, try it again

Collectives™ on Stack Overflow

Using Python to Merge Single Line .dat Files into one .csv file

3 Answers 3

1 Comment

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related