splitting values based on values in a specific column

Question

I have a file that I would like to break up into multiple files with uniq values for the first column. For example, here is a file:

fileA.txt

1    Cat
1    Dog
1    Frog
2    Boy
2    Girl
3    Tree
3    Leaf
3    Branch
3    Trunk

I would like my output to look something like this:

file1.txt

1    Cat
2    Boy
3    Tree

file2.txt

1    Dog
2    Girl
3    Leaf

file3.txt

1    Frog
3    Branch

file4.txt

3    Trunk

If a value does not exist, I want it to be skipped. I have tried to search for similar situations to mine, but I've come up short. Does anyone have idea of how to do this?

Theoretically, this awk command should work: awk '{print > "file" ++a[$1] ".txt"}' input. However, I can't get it to work appropriately (most likely due to the fact that I work on a mac) Does anyone know of an alternative way?

@EdMorton that was the problem. Thank you!

interstellar
– interstellar

2016-02-02 19:20:05 +00:00
Commented Feb 2, 2016 at 19:20 — interstellar
– interstellar, Commented Feb 2, 2016 at 19:20

Ed Morton · Accepted Answer · 2016-02-02 23:30:36Z

3

An unparenthesized expression on the right side of output redirection is undefined behavior. Try awk '{print > ("file" ++a[$1] ".txt")}' input.

If having too many files open concurrently is an issue then get GNU awk, but if you cant:

$ ls
 fileA.txt

$ awk '{f="file" ++a[$1] ".txt"; print >> f; close(f)}' fileA.txt

$ ls
file1.txt  file2.txt  file3.txt  file4.txt  fileA.txt

$ cat file1.txt
1    Cat
2    Boy
3    Tree

edited Feb 2, 2016 at 23:30

answered Feb 2, 2016 at 19:20

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

interstellar Over a year ago

I've noticed that when I use this that I get an error though. awk: file18.txt makes too many open files input record number 19, file input.txt source line number 1 Do you know if there is a way to close the files as I read through my actual file? Maybe awk '{print >> ("file" ++a[$1] ".txt")}' input?

e0k Over a year ago

@interstellar In AWK, > and >> are slightly different from shells such as bash. Both print > and print >> will append to the file every time that line is executed. The difference is what happens the first time the line is executed. The first time print > is executed, it will truncate the file to zero, then append to it (and every subsequent time). print >> will append without truncating the the first time. You want AWK to create fresh output and not keep whatever garbage you had from before, so > is appropriate in @EdMorton's answer.

Ed Morton Over a year ago

@interstellar of course, just call close(). If you use GNU awk you wont have this problem as it manages the file statuses internally. I edited my answer to show how to add a close().

John Conley · Accepted Answer · 2016-02-02 19:42:22Z

2

Here's a solution in Python:

from collections import Counter
fd_dict = {}
ind_counter = Counter()

with open('fileA.txt') as inf:
    for line in inf:
        ind, _ = line.split()
        ind_counter[ind] += 1
        file_ind = ind_counter[ind]
        fd = (
            fd_dict[file_ind] if file_ind in fd_dict else
            fd_dict.setdefault(
                file_ind, 
                open('file{}.txt'.format(file_ind), 'w')))
        fd.write(line)

for fd in fd_dict.itervalues():
    fd.close()

answered Feb 2, 2016 at 19:42

John Conley

4082 silver badges3 bronze badges

Collectives™ on Stack Overflow

splitting values based on values in a specific column

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related