Generate unique IDs for a list of strings with duplicates

Question

I want to generate IDs for strings that are being read from a text file. If the strings are duplicates, I want the first instance of the string to have an ID containing 6 characters. For the duplicates of that string, I want the ID to be the same as the original one, but with an additional two characters. I'm having trouble with the logic. Here's what I've done so far:

from itertools import groupby
import uuid
f = open('test.txt', 'r')
addresses = f.readlines()

list_of_addresses = ['Address']
list_of_ids = ['ID']


for x in addresses:
    list_of_addresses.append(x)


def find_duplicates():

    for x, y in groupby(sorted(list_of_addresses)):
        id = str(uuid.uuid4().get_hex().upper()[0:6])
        j = len(list(y))
        if j > 1:
            print str(j) + " instances of " + x
            list_of_ids.append(id)
        print list_of_ids

find_duplicates()

How should I approach this?

Edit: here's the contents of test.txt:

123 Test
123 Test
123 Test
321 Test
567 Test
567 Test

And the output:

3 occurences of 123 Test

['ID', 'C10DD8']
['ID', 'C10DD8']
2 occurences of 567 Test

['ID', 'C10DD8', '595C5E']
['ID', 'C10DD8', '595C5E']

And by repeated "strings" do you mean repeated lines repeated words in a line? — pylang
– pylang, Commented Feb 5, 2018 at 19:50
@pylang Sorry, added input/output. And I mean duplicate text entries. — rustyshackleford
– rustyshackleford, Commented Feb 5, 2018 at 19:53
Take look at your output again. You are missing 321 and your ids are the same for your duplicates. You mentioned adding two more characters. — pylang
– pylang, Commented Feb 5, 2018 at 20:04

pylang · Accepted Answer · 2018-02-07 18:16:39Z

If the strings are duplicates, I want the first instance of the string to have an ID containing 6 characters. For the duplicates of that string, I want the ID to be the same as the original one, but with an additional two characters.

Try using a collections.defaultdict.

Given

import ctypes
import collections as ct


filename = "test.txt"


def read_file(fname):
    """Read lines from a file."""
    with open(fname, "r") as f:
        for line in f:
            yield line.strip()

Code

dd = ct.defaultdict(list)
for x in read_file(filename):
    key = str(ctypes.c_size_t(hash(x)).value)      # make positive hashes
    if key[:6] not in dd:
        dd[key[:6]].append(x)
    else:
        dd[key[:8]].append(x)

dd

Output

defaultdict(list,
            {'133259': ['123 Test'],
             '13325942': ['123 Test', '123 Test'],
             '210763': ['567 Test'],
             '21076377': ['567 Test'],
             '240895': ['321 Test']})

The resulting dictionary has keys (of length 6) for every first occurrence of a unique line. For every successive replicate line, two additional characters are sliced for the key.

You can implement the keys however you wish. In this case, we used hash() to correlate the key to each unique line. We then sliced the desired sequence from the key. See also a post on making positive hash values from ctypes.

To inspect your results, create the appropriate lookup dictionaries from the defaultdict.

# Lookups 
occurrences = ct.defaultdict(int)
ids = ct.defaultdict(list)

for k, v in dd.items():
    key = v[0]
    occurrences[key] += len(v)
    ids[key].append(k)

# View data
for k, v in occurrences.items():
    print("{} instances of {}".format(v, k))
    print("IDs:", ids[k])
    print()

Output

1 instances of 321 Test
IDs: ['240895']

2 instances of 567 Test
IDs: ['21076377', '210763']

3 instances of 123 Test
IDs: ['13325942', '133259']

Aaditya Ura · Accepted Answer · 2018-02-06 09:57:17Z

0

Your question is little confusing, I don't get what is criteria to generate id , here i am showing you just logic not exact solution, You can take help from logic

track={}
with open('file.txt') as f:
    for line_no,line in enumerate(f):
        if line.split()[0] not in track:
            track[line.split()[0]]=[['ID','your_unique_id']]
        else:
            #here put your logic what you want to append if id is dublicate
            track[line.split()[0]].append(['ID','dublicate_id'+str(line_no)])

print(track)

output:

{'123': [['ID', 'your_unique_id'], ['ID', 'dublicate_id1'], ['ID', 'dublicate_id2']], '321': [['ID', 'your_unique_id']], '567': [['ID', 'your_unique_id'], ['ID', 'dublicate_id5']]}

answered Feb 6, 2018 at 9:57

Aaditya Ura

12.8k7 gold badges60 silver badges96 bronze badges

Collectives™ on Stack Overflow

Generate unique IDs for a list of strings with duplicates

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related