0

I have this list:

t=[['universitario de deportes'],['lancaster'],['universitario de'],['juan aurich'],['muni'],['juan']]

I want to reorder the list according to the jaccard distance. If I reorder t the expected ouput should be:

[['universitario de deportes'],['universitario de'],['lancaster'],['juan aurich'],['juan'],['muni']]

The code of the jackard distance is working OK, but the rest of the code doesn't give the expected output.The code is below:

def jack(a,b):
    x=a.split()
    y=b.split()
    k=float(len(set(x)&set(y)))/float(len((set(x) | set(y))))
    return k
t=[['universitario de deportes'],['lancaster'],['universitario de'],['juan aurich'],['muni'],['juan']]

import copy as cp


b=cp.deepcopy(t)

c=[]

while (len(b)>0):
    c.append(b[0][0])
    d=b[0][0]
    del b[0]
    for m in range (0 , len(b)+1):
        if m > len(b):
            break
            if jack(d,b[m][0])>0.3:
                c.append(b[m][0])
                del b[m]

Unfortunately, the unexpected output is the same list :

print c
['universitario de deportes', 'lancaster', 'universitario de', 'juan aurich', 'muni', 'juan']

EDIT:

I tried to correct my code but it didn't work too but I got a little closer to the expected output:

t=[['universitario de deportes'],['lancaster'],['universitario de'],['juan aurich'],['muni'],['juan']]

import copy as cp


b=cp.deepcopy(t)

c=[]

while (len(b)>0):
    c.append(b[0][0])
    d=b[0][0]
    del b[0]
    for m in range(0,len(b)-1):
        if jack(d,b[m][0])>0.3:
            c.append(b[m][0])
            del b[m]

The "close" output is:

['universitario de deportes', 'universitario de', 'lancaster', 'juan aurich', 'muni', 'juan']

Second edit:

Finally, I came up with a solution that has quite fast computational. Currently, I'll use the code to order 60 thousands names. The code is below:

t=['universitario de deportes','lancaster','lancaste','juan aurich','lancaster','juan','universitario','juan franco']

import copy as cp


b=cp.deepcopy(t)

c=[]

while (len(b)>0):
    c.append(b[0])
    e=b[0]
    del b[0]
    for val in b:
        if jack(e,val)>0.3:
            c.append(val)
            b.remove(val)

print c
['universitario de deportes', 'universitario', 'lancaster', 'lancaster', 'lancaste', 'juan aurich', 'juan', 'juan franco'
4
  • Why does t contain single-item lists? Running jack on your values, only two entries have non-zero values, so the sorting won't do much. Commented Apr 6, 2014 at 16:55
  • According to t, there are two pairs with jaccard index larger than 0.3 and should be together in the output, but it doesn´t. Commented Apr 6, 2014 at 16:59
  • "I got a little closer to the expected output" is extremely unhelpful. Please provide inputs and expected and actual outputs. It would be useful if you tried to describe in words what the sorting algorithm should do, too. Also, review your variable names - they are currently pretty bad. Commented Apr 6, 2014 at 18:42
  • range(0,len(b)-1): should be range(len(b)) - range doesn't goes up to but doesn't include the stop parameter. Better yet, adopt the enumerate my answer suggests. Commented Apr 6, 2014 at 18:57

1 Answer 1

1

Firstly, not sure why you've got everything in single-item lists, so I suggest flattening it out first:

t = [l[0] for l in t]

This gets rid of the extra zero indices everywhere, and means you only need shallow copies (as strings are immutable).

Secondly, the last three lines of your code never run:

if m > len(b):
    break # nothing after this will happen
    if jack(d,b[m][0])>0.3:
       c.append(b[m][0])
       del b[m]

I think what you want is:

out = [] # this will be the sorted list
for index, val1 in enumerate(t): # work through each item in the original list
    if val1 not in out: # if we haven't already put this item in the new list
        out.append(val1) # put this item in the new list
    for val2 in t[index+1:]: # search the rest of the list
        if val2 not in out: # if we haven't already put this item in the new list
            jack(val1, val2) > 0.3: # and the new item is close to the current item
                out.append(val2) # add the new item too

This gives me

out == ['universitario de deportes', 'universitario de', 
      'lancaster', 'juan aurich', 'juan', 'muni']

I would generally recommend using better variable names than a, b, c, etc..

Sign up to request clarification or add additional context in comments.

7 Comments

Your code doesn´t work for case : t=["cala","cala lima","uni","ali","uni le","ali po", "tr", "wq","tr uni"]
Edited - is that better? It would be helpful if you provided the answer you were expecting, rather that just "doesn't work".
["cala","cala lima","ali","ali po","uni","uni le","tr uni","tr","wq"]
Check mi edit, I got a little closer to the expected output, maybe you can correct me.
Your code is nice, it´s very close to the expected output but I don´t want duplicates so I choose the first time jaccard similarity is larger than 0.3
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.