How can you remove specific duplicate strings in a file in linux

Question

I have a list that has data paired with IP addresses and I only want to see the IP address once and I don't want to change the order.

192.168.0.100    fred is happy
192.168.0.100    fred likes pie
192.168.0.100    pie is good
192.168.0.110    tom like cake
192.168.0.110    cake is good
192.168.0.110    pie is better
192.168.0.112    bill like lettuce
192.168.0.112    lettuce is good for you
192.168.0.112    cake and pie are better tasting than lettuce

WHat I want to do is just remove the duplicate IP address but leave everything exactly the same.

I want to make it look like this

192.168.0.100    fred is happy
                 fred likes pie
                 pie is good
192.168.0.110    tom like cake
                 cake is good
                 pie is better
192.168.0.112    bill like lettuce
                 lettuce is good for you
                 cake and pie are better tasting than lettuce

I don't want to touch any of the duplicate words and I can't change the order

Thank you if you can help

Ed Morton · Accepted Answer · 2013-09-14 14:45:54Z

2

This will work no matter what kind of spacing and/or RE metacharacters are in the file:

$ awk '
{ key = $1 }
key == prev { sub(/[^[:space:]]+/,sprintf("%*s",length(key),"")) }
{ prev = key; print }
' file
192.168.0.100    fred is happy
                 fred likes pie
                 pie is good
192.168.0.110    tom like cake
                 cake is good
                 pie is better
192.168.0.112    bill like lettuce
                 lettuce is good for you
                 cake and pie are better tasting than lettuce

Beware of solutions that use $1 in an RE context as those "."s in an IP address are RE metacharacters that mean "any character" so they might work for some sample data but you could get false matches given other input.

answered Sep 14, 2013 at 14:45

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kent · Accepted Answer · 2013-09-13 20:01:18Z

1

I guess the separator between ip and the text is tab, then this one-liner should work for you:

awk -F'\t' -v OFS='\t' 'a[$1]{gsub(/./," ",$1);print;next}{a[$1]=1}7' file

test with your file:

kent$  awk -F'\t' -v OFS='\t' 'a[$1]{gsub(/./," ",$1);print;next}{a[$1]=1}7' f
192.168.0.100   fred is happy
                fred likes pie
                pie is good
192.168.0.110   tom like cake
                cake is good
                pie is better
192.168.0.112   bill like lettuce
                lettuce is good for you
                cake and pie are better tasting than lettuce

answered Sep 13, 2013 at 20:01

Kent

197k36 gold badges248 silver badges317 bronze badges

1 Comment

Steve Byrum Over a year ago

I was wrong i did not get it to work the separator is spaces.

konsolebox · Accepted Answer · 2013-09-13 20:04:54Z

1

Using awk:

awk 'BEGIN{FS=OFS="    "}{t=$1;if(t in a){gsub(/./," ",$1);a[t]=a[t]RS$0}else{a[t]=$0}}END{for(i in a)print a[i]}' file

Output:

192.168.0.100    fred is happy
                 fred likes pie
                 pie is good
192.168.0.110    tom like cake
                 cake is good
                 pie is better
192.168.0.112    bill like lettuce
                 lettuce is good for you
                 cake and pie are better tasting than lettuce

answered Sep 13, 2013 at 20:04

konsolebox

76.3k13 gold badges110 silver badges114 bronze badges

8 Comments

Steve Byrum Over a year ago

Thanks konsolebox, I had to make a minor adjustment but I got to where i needed with your example.

Ed Morton Over a year ago

That could completely re-order the output courtesy of the in operator - output will be in the order of traversal of that arrays hash map which may not be the order of input.

konsolebox Over a year ago

@EdMorton I'm actually assuming that Gawk sets it in order always unless something is deleted, but is that incorrect? Imagining an implementation of awk, the new key would always be appended at the end of the list anyway.

Ed Morton Over a year ago

Yes, that is incorrect. There are ways you can specify ordering using PROCINFO[] but by default you need to assume any order of traversal is fine.

Ed Morton Over a year ago

@konsolebox - arrays are not stored as lists, they are stored as hash tables for fast access. Also, imagine a[x]=3; a[y]=4; a[x]=2 - when printing the array a should a[x] be printed before a[y] because it was created first or after a[y] because a[x] was populated with it's final value after a[y] or should a[x] be printed first because it comes first alphabetically or something else? The point is there's no obvious order that's more likely to be right than any other order for any given application so it makes sense to leave it to the users to manage the order if it matters.

|

Scrutinizer · Accepted Answer · 2013-09-14 15:38:04Z

1

One more:

awk 'A[$1]++{s=$1; gsub(/./,FS,s); sub($1,s)}1' file

answered Sep 14, 2013 at 15:38

Scrutinizer

9,9661 gold badge24 silver badges23 bronze badges

2 Comments

Ed Morton Over a year ago

1 Nice! Took a bit of thought to convince myself that that final sub($1,s) wouldn't have problems with the '.'s in $1 but I don't think they will since the initial A[$1]++ guarantees the line starts with exactly the same $1 you're using in the sub() so the .s will line up.

Scrutinizer Over a year ago

Thanks @EdMorton, indeed the ERE dots will always match the literal ones here :-) ..

potong · Accepted Answer · 2013-09-14 10:45:26Z

0

This might work for you (GNU sed):

sed -r '1{:a;p;h;s/\s.*//;s/./ /g;H;d};G;s/^(\S+)(\s.*)\n\1.*\n(.*)/\3\2/;t;s/\n.*//;ba' file

Print the first record and those records where the key changes and store the key and its complement in spaces in the hold space. For subsequent records compare the stored key with the current key and for those that match replace the current key with the complement of spaces. For those keys that do not match remove the stored key and complement and repeat from the beginning.

answered Sep 14, 2013 at 10:45

potong

59.3k6 gold badges55 silver badges92 bronze badges

Collectives™ on Stack Overflow

How can you remove specific duplicate strings in a file in linux

5 Answers 5

Comments

1 Comment

8 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

8 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related