0

I am trying to make unique ID from a list of words. I want these numbers to be globally unique. For example, if another list appears, I want the unique ID to be the same e.g. for "density", the ID might be 151111911, and this will be the same if "density" occurs in a different list.

As you can see, my current method is not working using id and intern - the ID for rrb is exactly the same as lrb.

featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot', u'south', u'africa', u'respective', u'population', u'density', u'lrb', u'capita', u'per', u'square', u'kilometer', u'rrb', u'global', u'rank', u'number_slot', u'years', u'growthguinea', u'bissau', u'population', u'density', u'positive', u'growth', u'lrb', u'rrb', u'last', u'years', u'lrb', u'rrb', u'LOCATION_SLOT~-appos+LOCATION~-prep_of', u'LOCATION~-prep_of+that~-prep_to', u'that~-prep_to+similar~prep_with', u'similar~prep_with+density~prep_of', u'density~prep_of+NUMBER~appos', u'NUMBER~appos+NUMBER~amod', u'NUMBER~amod+NUMBER_SLOT']

featureVector = mydefaultdict(mydouble)

for featureID,featureVal in enumerate(featureList):
        print "featureID is",featureID
        print "featureVal is ",featureVal
        print "Encoded feature value is", id(intern(str(featureVal.encode("utf-8"))))
        featureVector[featureID] = featureVal


featureID is 0
featureVal is  guinea
Encoded feature value is 4569583120.0
featureID is 1
featureVal is  bissau
Encoded feature value is 4569581632.0
featureID is 2
featureVal is  compared
Encoded feature value is 4569583120.0
featureID is 3
featureVal is  countriesthe
Encoded feature value is 4567944360.0
featureID is 4
featureVal is  population
Encoded feature value is 4347153072.0
featureID is 5
featureVal is  density
Encoded feature value is 4455561472.0
featureID is 6
featureVal is  guinea
Encoded feature value is 4569581632.0
featureID is 7
featureVal is  bissau
Encoded feature value is 4569583120.0
featureID is 8
featureVal is  similar
Encoded feature value is 4496118144.0
featureID is 9
featureVal is  iran
Encoded feature value is 4569583120.0
featureID is 10
featureVal is  afghanistan
Encoded feature value is 4569581632.0
featureID is 11
featureVal is  cameroon
Encoded feature value is 4569583120.0
featureID is 12
featureVal is  panama
Encoded feature value is 4569581632.0
featureID is 13
featureVal is  montenegro
Encoded feature value is 4569583120.0
featureID is 14
featureVal is  guinea
Encoded feature value is 4569581632.0
featureID is 15
featureVal is  belarus
Encoded feature value is 4569583120.0
featureID is 16
featureVal is  palau
Encoded feature value is 4569581632.0
featureID is 17
featureVal is  location_slot
Encoded feature value is 4567944360.0
featureID is 18
featureVal is  south
Encoded feature value is 4569583120.0
featureID is 19
featureVal is  africa
Encoded feature value is 4569581632.0
featureID is 20
featureVal is  respective
Encoded feature value is 4569583120.0
featureID is 21
featureVal is  population
Encoded feature value is 4347153072.0
featureID is 22
featureVal is  density
Encoded feature value is 4455561472.0
featureID is 23
featureVal is  lrb
Encoded feature value is 4537993216.0
featureID is 24
featureVal is  capita
Encoded feature value is 4569581632.0
featureID is 25
featureVal is  per
Encoded feature value is 4455914152.0
featureID is 26
featureVal is  square
Encoded feature value is 4347127296.0
featureID is 27
featureVal is  kilometer
Encoded feature value is 4569581632.0
featureID is 28
featureVal is  rrb
Encoded feature value is 4537993216.0
featureID is 29
featureVal is  global
Encoded feature value is 4346597072.0
featureID is 30
featureVal is  rank
Encoded feature value is 4346629984.0
featureID is 31
featureVal is  number_slot
Encoded feature value is 4569583120.0
featureID is 32
featureVal is  years
Encoded feature value is 4569581632.0
featureID is 33
featureVal is  growthguinea
Encoded feature value is 4567944360.0
featureID is 34
featureVal is  bissau
Encoded feature value is 4569583120.0
featureID is 35
featureVal is  population
Encoded feature value is 4347153072.0
featureID is 36
featureVal is  density
Encoded feature value is 4455561472.0
featureID is 37
featureVal is  positive
Encoded feature value is 4514096160.0
featureID is 38
featureVal is  growth
Encoded feature value is 4569583120.0
featureID is 39
featureVal is  lrb
Encoded feature value is 4537993216.0
featureID is 40
featureVal is  rrb
Encoded feature value is 4537993216.0
featureID is 41
featureVal is  last
Encoded feature value is 4346568112.0
featureID is 42
featureVal is  years
Encoded feature value is 4569583120.0
featureID is 43
featureVal is  lrb
Encoded feature value is 4537993216.0
featureID is 44
featureVal is  rrb
Encoded feature value is 4537993216.0
featureID is 45
featureVal is  LOCATION_SLOT~-appos+LOCATION~-prep_of
Encoded feature value is 4538026784.0
featureID is 46
featureVal is  LOCATION~-prep_of+that~-prep_to
Encoded feature value is 6043251168.0
featureID is 47
featureVal is  that~-prep_to+similar~prep_with
Encoded feature value is 6043251168.0
featureID is 48
featureVal is  similar~prep_with+density~prep_of
Encoded feature value is 6043251168.0
featureID is 49
featureVal is  density~prep_of+NUMBER~appos
Encoded feature value is 6043251168.0
featureID is 50
featureVal is  NUMBER~appos+NUMBER~amod
Encoded feature value is 6043247024.0
featureID is 51
featureVal is  NUMBER~amod+NUMBER_SLOT
Encoded feature value is 6043247024.0

What am I doing wrong? The reason I need to convert these into floats or numbers is that the above sentence would go into a classifier that needs to use numerical/vectorized features.

6
  • I do not understand how it should work, but you might be looking for uuid module. Commented Sep 4, 2016 at 12:32
  • Doesn't seem to convert unicode strings to unique IDs? Only 16 digit hex? docs.python.org/2/library/uuid.html#module-uuid. Commented Sep 4, 2016 at 12:37
  • Python's id is not meant to be used like that. In CPython the value returned is simply the memory address of the underlying objects and python is free to reuse those objects, which means the same id can be associated with different objects during the lifetime of the program. I'd simply keep an itertools.count() object to generate ids, the list of objects that we are keeping track and a dict that maps the objects to their ids (in this way you have both mappings from objects to ids and from ids to objects by simply indexing that list). Commented Sep 4, 2016 at 13:10
  • @Bakuriu I just used this - stackoverflow.com/questions/39316897/… Commented Sep 4, 2016 at 13:12
  • @DhruvGhulati if you're doing works on features/classifications etc... are you using any libraries like numpy, scipy or pandas? Commented Sep 4, 2016 at 13:14

4 Answers 4

2

From the docs

Interned strings are not immortal (like they used to be in Python 2.2 and before); you must keep a reference to the return value of intern() around to benefit from it.

At the time the next string is interned the previous strings may be deleted, and the new one may occasionally get the same id. So keep the references in a container. I'll use a dict:

featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot', u'south', u'africa', u'respective', u'population', u'density', u'lrb', u'capita', u'per', u'square', u'kilometer', u'rrb', u'global', u'rank', u'number_slot', u'years', u'growthguinea', u'bissau', u'population', u'density', u'positive', u'growth', u'lrb', u'rrb', u'last', u'years', u'lrb', u'rrb', u'LOCATION_SLOT~-appos+LOCATION~-prep_of', u'LOCATION~-prep_of+that~-prep_to', u'that~-prep_to+similar~prep_with', u'similar~prep_with+density~prep_of', u'density~prep_of+NUMBER~appos', u'NUMBER~appos+NUMBER~amod', u'NUMBER~amod+NUMBER_SLOT']

# dict of id:featureVal pairs 
seen = {}

for featureID,featureVal in enumerate(featureList):
    print "featureID is",featureID
    print "featureVal is ",featureVal
    interned = intern(str(featureVal.encode("utf-8")))
    interned_id = id(interned)

    # ensure that no other string with the same id has been seen
    assert interned_id not in seen or seen[interned_id] == featureVal

    # change this to seen[interned_id] = None and you'll (probably) get AssertionError
    # from the line above
    seen[interned_id] = interned

    print "Encoded feature value is", interned_id
Sign up to request clarification or add additional context in comments.

2 Comments

Where is val defined?
Yep this works! Maybe to improve the answer do explain what the assert step does?
1

You could use the words themselves, a hash of the words, or can even convert the string into a number.

2 Comments

How do I convert the string into a number? By the characters? Could you post some code I can test pls?
So what you want to do is to take the binary representation of each character, as a string, (with format(ord(c), 'b').zfill(8) and concatenate them all together. You can do this with ''.join(format(ord(c), 'b').zfill(8) for c in string). Then you want to convert to integer as such: int('0b'+''.join(format(ord(c), 'b').zfill(8) for c in string), 2). You have to prepend the bit string with 0b, otherwise it will interpret it as an integer directly.
1

Perhaps the easiest way is to use a defaultdict with a itertools.count with a float as its starting position, eg:

from collections import defaultdict
from itertools import count

# Start from 1.0 and increment by one - can change to start from any value or even add a step
# eg: `count(716345.0, 9)` will start at at 716345.0 and increment by 9 for new keys
unique_id = defaultdict(lambda c=count(1.0): next(c))
featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot']
for feature in featureList:
    print(feature, unique_id[feature])

This prints:

guinea 1.0
bissau 2.0
compared 3.0
countriesthe 4.0
population 5.0
density 6.0
guinea 1.0
bissau 2.0
similar 7.0
iran 8.0
afghanistan 9.0
cameroon 10.0
panama 11.0
montenegro 12.0
guinea 1.0
belarus 13.0
palau 14.0
location_slot 15.0

We can do a couple of other checks:

unique_id['cameroon'] 
# 10.0
unique_id['this is new']
# 16.0

4 Comments

How does this work for lrb and rrb which appear twice in my example sentence? Can you include in your answer rather than only including pre-unique words in your test ? :)
@DhruvGhulati the same way it works for guinea and bissau... look at the unique keys, then the example of looking up cameroon again... Swap in your full feature list and try running it - this'll get a very large answer otherwise :)
@DhruvGhulati also - a possible advantage of this approach is that you can persist it to disk and resume where you left off more easily than relying on ids (which can differ by Python implementation) across sessions... but whatever works...
This is a great answer and I want to extend it to basically create my own global vocabulary. How do I use your defaultdict to allow the most common 5000 words only to appear in a dict of their ids, if I was to extend to lots of the lists like the featureList given, looping through?
-1

You could directly use the hash() function in Python. Hash function will return a unique hash which can be used as an ID for any given string as is your case but it may differ on different platforms (32 bit/64 bit, OS, python version)

hash("answer")
-8597262460139880008

If you want hashes to be same then you can use Pythons hashlibs module but that won't give you numbers. It will return a hash string.

import hashlib
test = hashlib.sha224()
test.update("HI How are you")
test.hexdigest()
'3284ec5f391e0c6b4f974d3bc317a77bb50875081d2bcb2436fc2001'

You can choose from various algorithms

 hashlib.algorithms
 ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')

5 Comments

Can I store all the separate words using this? Or does it modify inline? pls could you show how I could apply to my list?
You need to process every word using either of the two approaches. Replace the part of your code which generates featureVal with above code and it will do the job. you can store it in whatever way you want.
No, hash function don't return unique ids. They return hashes. It is possible to have two different strings with the same hash value. Even with sha224.
Yes that's true and it's also true that a perfect hash algorithm is not known. Conflicts can arrive in any of them. That's accepted knowledge. I am unsure why that should be a criticism for the answer
Because it doesn't achieve the aims of my question I guess :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.