Merge rows using common values in any column

Question

I have a tab-delimited file like shown below, and would like to merge the rows based on matches in any of the columns. The number of columns are usually 2, but could vary in some cases and be 3.

input:

AMAZON NILE 
ALASKA NILE
HELLO MY
MANGROVE AMAZON
MY NAME
IS NAME

desired output:

AMAZON NILE ALASKA MANGROVE
HELLO MY NAME IS

How could one go about this with awk?

Will this work for the below file also? input:

apple_bin2file       strawberry_24files
mango2files      strawberry_39files
apple_bin8file       strawberry_39files
dastool_bin6files  strawberry_40files
apple_bin6file       strawberry_40files
orange_bin004file      dastool_bin004files
orange_bin005file      dastool_bin005files
apple_bin3file       dastool_bin3files
apple_bin5file       dastool_bin5files
apple_bin6file       dastool_bin6files
apple_bin7file       dastool_bin7files
apple_bin8file       mango2files

expected output in tab-delimited format:

apple_bin2file strawberry_24files
mango2files strawberry_39files apple_bin8file
dastool_bin6files strawberry_40files apple_bin6file
orange_bin004file dastool_bin004files
orange_bin005file dastool_bin005files
apple_bin3file dastool_bin3files
apple_bin5file dastool_bin5files
apple_bin7file dastool_bin7files

Sorry to those who answered, I updated the input files!

@terdon: Question updated to reflect another file input as well. — Susheel Busi
– Susheel Busi, Commented Jan 14, 2020 at 14:39
Your new file still only has two words per line. And you don't explain what output you expect from it. — terdon
– terdon ♦, Commented Jan 14, 2020 at 14:51

glenn jackman · Accepted Answer · 2020-01-14 14:50:43Z

Using GNU awk

gawk '
    {
        grp = 0
        # see if any of these words already have a group
        for (i=1; i<=NF; i++) {
            if (group[$i]) {
                grp = group[$i]
                break
            }
        }
        # no words have been seen before: new group
        if (!grp) {
            grp = ++n
        }
        # if we have not seen this word, add it to the output
        for (i=1; i<=NF; i++) {
            if (!group[$i]) {
                line[grp] = line[grp] $i OFS
            }
            group[$i] = grp
        }
    }
    END {
        PROCINFO["sorted_in"] = "@ind_num_asc"
        for (n in line) {
            print line[n]
        }
    }
' input.file

With the first input:

AMAZON NILE ALASKA MANGROVE
HELLO MY NAME IS

With the second input (piping the output to column -t):

apple_bin2file     strawberry_24files
mango2files        strawberry_39files   apple_bin8file
dastool_bin6files  strawberry_40files   apple_bin6file
orange_bin004file  dastool_bin004files
orange_bin005file  dastool_bin005files
apple_bin3file     dastool_bin3files
apple_bin5file     dastool_bin5files
apple_bin7file     dastool_bin7files

Awesome, thanks a lot @glenn jackman, this worked perfectly. — Susheel Busi
– Susheel Busi, Commented Jan 14, 2020 at 15:07

RudiC · Accepted Answer · 2020-01-14 15:55:48Z

0

For exactly your given example, try

awk '
    {for (j=1; j<=MX; j++)  {for (i=1; i<=NF && !(m=match (LN[j], $i)); i++);
                 if (m) {$i = ""
                     break
                    }
                }
     LN[j] = LN[j] $0 " "
     if (j>MX) MX = j
    }
END {for (l in LN) print LN[l]
    }
' file3
AMAZON NILE  ALASKA  MANGROVE  
HELLO MY  NAME IS

EDIT: with the new data, this should work:

awk '
        {for (j=1; j<=MX; j++)  {m = 0
                                 for (i=1; i<=NF; i++)  {if (match (LN[j], $i)) {$i = ""
                                                                                 m = 1
                                                                                }
                                                        }
                                 if (m) break
                                }
         LN[j] = LN[j] $0 OFS
         if (j>MX) MX = j
        }
END     {for (l in LN)  {gsub (/ +/, OFS, LN[l])
                         gsub (OFS"+", OFS, LN[l])
                         print LN[l]
                        }
        }
' OFS="\t" file

edited Jan 14, 2020 at 15:55

answered Jan 14, 2020 at 14:25

RudiC

9,0592 gold badges12 silver badges22 bronze badges

Thanks a lot for the quick response. I tried it with a different file (below the first input in the question), but the first column is being repeated.

Susheel Busi
– Susheel Busi

2020-01-14 14:35:36 +00:00
Commented Jan 14, 2020 at 14:35
This works too, aside from the repeat!

Susheel Busi
– Susheel Busi

2020-01-14 15:07:54 +00:00
Commented Jan 14, 2020 at 15:07
Thanks a lot, the updated version works. The tab-separation though is missing in the output file. Thanks for the help!

Susheel Busi
– Susheel Busi

2020-01-14 15:36:28 +00:00
Commented Jan 14, 2020 at 15:36
Removal of <TAB>s removed...

RudiC
– RudiC

2020-01-14 15:56:14 +00:00
Commented Jan 14, 2020 at 15:56

Add a comment |

Stack Exchange Network

Merge rows using common values in any column

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Merge rows using common values in any column

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions