0

I have a file that I would like to break up into multiple files with uniq values for the first column. For example, here is a file:

fileA.txt

1    Cat
1    Dog
1    Frog
2    Boy
2    Girl
3    Tree
3    Leaf
3    Branch
3    Trunk

I would like my output to look something like this:

file1.txt

1    Cat
2    Boy
3    Tree

file2.txt

1    Dog
2    Girl
3    Leaf

file3.txt

1    Frog
3    Branch

file4.txt

3    Trunk

If a value does not exist, I want it to be skipped. I have tried to search for similar situations to mine, but I've come up short. Does anyone have idea of how to do this?

Theoretically, this awk command should work: awk '{print > "file" ++a[$1] ".txt"}' input. However, I can't get it to work appropriately (most likely due to the fact that I work on a mac) Does anyone know of an alternative way?

1
  • 2
    @EdMorton that was the problem. Thank you! Commented Feb 2, 2016 at 19:20

2 Answers 2

3

An unparenthesized expression on the right side of output redirection is undefined behavior. Try awk '{print > ("file" ++a[$1] ".txt")}' input.

If having too many files open concurrently is an issue then get GNU awk, but if you cant:

$ ls
 fileA.txt

$ awk '{f="file" ++a[$1] ".txt"; print >> f; close(f)}' fileA.txt

$ ls
file1.txt  file2.txt  file3.txt  file4.txt  fileA.txt

$ cat file1.txt
1    Cat
2    Boy
3    Tree
Sign up to request clarification or add additional context in comments.

3 Comments

I've noticed that when I use this that I get an error though. awk: file18.txt makes too many open files input record number 19, file input.txt source line number 1 Do you know if there is a way to close the files as I read through my actual file? Maybe awk '{print >> ("file" ++a[$1] ".txt")}' input?
@interstellar In AWK, > and >> are slightly different from shells such as bash. Both print > and print >> will append to the file every time that line is executed. The difference is what happens the first time the line is executed. The first time print > is executed, it will truncate the file to zero, then append to it (and every subsequent time). print >> will append without truncating the the first time. You want AWK to create fresh output and not keep whatever garbage you had from before, so > is appropriate in @EdMorton's answer.
@interstellar of course, just call close(). If you use GNU awk you wont have this problem as it manages the file statuses internally. I edited my answer to show how to add a close().
2

Here's a solution in Python:

from collections import Counter
fd_dict = {}
ind_counter = Counter()

with open('fileA.txt') as inf:
    for line in inf:
        ind, _ = line.split()
        ind_counter[ind] += 1
        file_ind = ind_counter[ind]
        fd = (
            fd_dict[file_ind] if file_ind in fd_dict else
            fd_dict.setdefault(
                file_ind, 
                open('file{}.txt'.format(file_ind), 'w')))
        fd.write(line)

for fd in fd_dict.itervalues():
    fd.close()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.