1

I have a tab-delimited file like shown below, and would like to merge the rows based on matches in any of the columns. The number of columns are usually 2, but could vary in some cases and be 3.

input:

AMAZON NILE 
ALASKA NILE
HELLO MY
MANGROVE AMAZON
MY NAME
IS NAME

desired output:

AMAZON NILE ALASKA MANGROVE
HELLO MY NAME IS

How could one go about this with awk?

Will this work for the below file also? input:

apple_bin2file       strawberry_24files
mango2files      strawberry_39files
apple_bin8file       strawberry_39files
dastool_bin6files  strawberry_40files
apple_bin6file       strawberry_40files
orange_bin004file      dastool_bin004files
orange_bin005file      dastool_bin005files
apple_bin3file       dastool_bin3files
apple_bin5file       dastool_bin5files
apple_bin6file       dastool_bin6files
apple_bin7file       dastool_bin7files
apple_bin8file       mango2files

expected output in tab-delimited format:

apple_bin2file strawberry_24files
mango2files strawberry_39files apple_bin8file
dastool_bin6files strawberry_40files apple_bin6file
orange_bin004file dastool_bin004files
orange_bin005file dastool_bin005files
apple_bin3file dastool_bin3files
apple_bin5file dastool_bin5files
apple_bin7file dastool_bin7files

Sorry to those who answered, I updated the input files!

6
  • Will you always have exactly two words on each line? Commented Jan 14, 2020 at 13:07
  • Not always, it could also be 3 in some cases. Commented Jan 14, 2020 at 14:36
  • Then please edit your question and add that. Commented Jan 14, 2020 at 14:37
  • @terdon: Question updated to reflect another file input as well. Commented Jan 14, 2020 at 14:39
  • Your new file still only has two words per line. And you don't explain what output you expect from it. Commented Jan 14, 2020 at 14:51

2 Answers 2

0

Using GNU awk

gawk '
    {
        grp = 0
        # see if any of these words already have a group
        for (i=1; i<=NF; i++) {
            if (group[$i]) {
                grp = group[$i]
                break
            }
        }
        # no words have been seen before: new group
        if (!grp) {
            grp = ++n
        }
        # if we have not seen this word, add it to the output
        for (i=1; i<=NF; i++) {
            if (!group[$i]) {
                line[grp] = line[grp] $i OFS
            }
            group[$i] = grp
        }
    }
    END {
        PROCINFO["sorted_in"] = "@ind_num_asc"
        for (n in line) {
            print line[n]
        }
    }
' input.file

With the first input:

AMAZON NILE ALASKA MANGROVE
HELLO MY NAME IS

With the second input (piping the output to column -t):

apple_bin2file     strawberry_24files
mango2files        strawberry_39files   apple_bin8file
dastool_bin6files  strawberry_40files   apple_bin6file
orange_bin004file  dastool_bin004files
orange_bin005file  dastool_bin005files
apple_bin3file     dastool_bin3files
apple_bin5file     dastool_bin5files
apple_bin7file     dastool_bin7files
1
  • Awesome, thanks a lot @glenn jackman, this worked perfectly. Commented Jan 14, 2020 at 15:07
0

For exactly your given example, try

awk '
    {for (j=1; j<=MX; j++)  {for (i=1; i<=NF && !(m=match (LN[j], $i)); i++);
                 if (m) {$i = ""
                     break
                    }
                }
     LN[j] = LN[j] $0 " "
     if (j>MX) MX = j
    }
END {for (l in LN) print LN[l]
    }
' file3
AMAZON NILE  ALASKA  MANGROVE  
HELLO MY  NAME IS  

EDIT: with the new data, this should work:

awk '
        {for (j=1; j<=MX; j++)  {m = 0
                                 for (i=1; i<=NF; i++)  {if (match (LN[j], $i)) {$i = ""
                                                                                 m = 1
                                                                                }
                                                        }
                                 if (m) break
                                }
         LN[j] = LN[j] $0 OFS
         if (j>MX) MX = j
        }
END     {for (l in LN)  {gsub (/ +/, OFS, LN[l])
                         gsub (OFS"+", OFS, LN[l])
                         print LN[l]
                        }
        }
' OFS="\t" file
4
  • Thanks a lot for the quick response. I tried it with a different file (below the first input in the question), but the first column is being repeated. Commented Jan 14, 2020 at 14:35
  • This works too, aside from the repeat! Commented Jan 14, 2020 at 15:07
  • Thanks a lot, the updated version works. The tab-separation though is missing in the output file. Thanks for the help! Commented Jan 14, 2020 at 15:36
  • Removal of <TAB>s removed... Commented Jan 14, 2020 at 15:56

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.