2

I have distributed information over multiple large csv files. I want to combine all the files into one new file such as the first row from the first file is combined to the first row from the other file etc.

file1.csv

A,B
A,C
A,D

file2.csv

F,G
H,I
J,K

expected result:

output.csv

A,B,F,G
A,C,H,I
A,D,J,K

so consider I have an array ['file1.csv', 'file2.csv', ...] How to go from here ?

I tried to load each file into the memory and combine by np.column_stack but my files are too large to fit in memory.

3
  • 1
    I'm not going to write your code for you, but I would suggest iterating through the files line by line and use str.join(',',(file1line,file2line)) to build your output line. You might also have to strip newlines from the input lines. Commented Dec 14, 2015 at 12:44
  • @SiHa. Thank you for your comment. However my problem is that I have like 50 files. how can I iterate through all files in parallel ? Commented Dec 14, 2015 at 12:50
  • 50 files is a bit more tricky :) See answer below. Commented Dec 14, 2015 at 13:22

2 Answers 2

2

Not pretty code, but this should work.

I'm not using with(open'filename','r') as myfile for the inputs. It could get a bit messy with 50 files, so these are opened and closed explicitly.

It opens each file then places the handle in a list. The first handle is taken as the master file, then we iterate through it line-by-line, each time reading one line from all the other open files and joining them with ',' then output that to the output file.

Note that if the other files have more lines, they won't be included. If any have less lines, this will raise an exception. I'll leave it to you to deal with these situations gracefully.

Note also that you can use glob to create filelist if the names follow a logical pattern (thanks to N. Wouda, below)

filelist = ['book1.csv','book2.csv','book3.csv','book4.csv']
openfiles = []
for filename in filelist:
    openfiles.append(open(filename,'rb'))

# Use first file in the list as the master
# All files must have same number of lines (or greater)
masterfile = openfiles.pop(0) 

with (open('output.csv','w')) as outputfile:
    for line in masterfile:
        outputlist = [line.strip()]
        for openfile in openfiles:
            outputlist.append(openfile.readline().strip())
        outputfile.write(str.join(',', outputlist)+'\n')

masterfile.close()
for openfile in openfiles:
    openfile.close()

Input Files

a   b   c   d   e   f
1   2   3   4   5   6
7   8   9   10  11  12
13  14  15  16  17  18

Output

a   b   c   d   e   f   a   b   c   d   e   f   a   b   c   d   e   f   a   b   c   d   e   f
1   2   3   4   5   6   1   2   3   4   5   6   1   2   3   4   5   6   1   2   3   4   5   6
7   8   9   10  11  12  7   8   9   10  11  12  7   8   9   10  11  12  7   8   9   10  11  12
13  14  15  16  17  18  13  14  15  16  17  18  13  14  15  16  17  18  13  14  15  16  17  18
Sign up to request clarification or add additional context in comments.

2 Comments

Note that you can avoid manually listing all files in filelist if they share a logical structure (like file1.csv, file2.csv, etc.). Simply do this: from glob import glob and then obtain the files like so filelist = glob('file*.csv')
@N.Wouda: Thanks, added your suggestion to the answer.
1

Instead of completely reading the files into the memory you can iterate over them line by line.

from itertools import izip # like zip but gives us an iterator

with open('file1.csv') as f1, open('file2.csv') as f2, open('output.csv', 'w') as out:
    for f1line, f2line in izip(f1, f2):
        out.write('{},{}'.format(f1line.strip(), f2line))

Demo:

$ cat file1.csv 
A,B
A,C
A,D
$ cat file2.csv 
F,G
H,I
J,K
$ python2.7 merge.py
$ cat output.csv 
A,B,F,G
A,C,H,I
A,D,J,K

1 Comment

For completeness, in python 3 the built-in zip also produces an iterator.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.