4

I have a categorical attributes that contains string values. three of them contains dayname(mon---sun) monthname and time interval(morning afternoon evening), the other two as i mentioned before has district and street names. followed by gender ,role, comments(it is a predefined fixed field that have values as good, bad strong agree etc)surname and first name.my intention is to cluster them and visualize it. I applied k-mean clustering using this WEKA bur it did not work. Now I wish to apply hierarchical clustering on it. I found this code:

import scipy
import scipy.cluster.hierarchy as sch
X = scipy.randn(100, 2)     # 100 2-dimensional observations
d = sch.distance.pdist(X)   # vector of (100 choose 2) pairwise distances
L = sch.linkage(d, method='complete')
ind = sch.fcluster(L, 0.5*d.max(), 'distance')

However, X in above code is numeric; I have categorical data. Is there some way that I can use a numarray of categorical data to find the distance? In other words can I use categorical data of string values to find the distance? I would then use that distance in sch.linkage(d, method='complete')

3
  • How do you plan to define the distance between strings -- or is that part of your question? Commented May 31, 2017 at 23:08
  • it is the question, my understanding is that the method of distance calcualtion can be defined in sch.distance.pdist .I intent to use cosine function though not sure if it is the right way to find the distance, so my first problem is how to define the variable X in above code for categorical variable. Commented Jun 1, 2017 at 9:03
  • the basic question I guess is how to represent categorical variables having multiple values. I understand kmode method is used for categorical variables but i intent to use hierarchical clustering. Commented Jun 1, 2017 at 14:15

2 Answers 2

2

I think we've identified the problem, then: you leave the X values as they are, string data. You can pass those to pdist, but you also have to supply a 2-arity function (2 inputs, numeric output) for the distance metric.

The simplest one would be that equal classifications have 0 distance; everything else is 1. You can do this with

d = sch.distance.pdist(X, lambda u, v: u != v)

If you have other class discrimination in mind, just code logic to return the desired distance, wrap it in a function, and then pass the function name to pdist. We can't help with that, because you've told us nothing about your classes or the model semantics.

Does that get you moving?

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. My data has more than 10 attributes and two of them contain the name of district and streets in a city. Each of them could have many distinct values may be over 20. I am not sure the above technique is applicable for this kind of categorical values. please advice.
I advise you to specify the problem. "I am not sure ..." is not a specification. I've answered your given question: how to represent the categorical data (just as you're already doing) and how to handle a distance function. I gave you a trivial sample. There is no way I can handle a follow-up question when you continue to avoid any productive comment on what you do need for a distance metric.
ok here is my specification. i have a categorical attribute that contains string values. three of them contains dayname(mon---sun) monthname and time interval(morning afternoon evening), the other two as i mentioned before has district and street names. followed by gender ,role, comments(it is a predefined fixed field that have values as good, badm strong agree etc)surname and first name.my intention is to cluster them and visualize it. I applied k-mean clustering using this WEKA bur it did not work. I hope I have specified the problem now.
Please edit this into the question's main body, for all to see. Also discuss what you imagine as a useful distance function. For instance, is the distance Sun-to-Tue twice as far as Sun-to-Mon, or the same?
0

Another possibility is the use of the Hamming distance.

Y = pdist(X, 'hamming')

Computes the normalized Hamming distance, or the proportion of those vector elements between two n-vectors u and v which disagree. To save memory, the matrix X can be of type boolean.

If your categorical data is represented by a single character e.g.: "m"/"f" it could be what you are looking for.

https://en.wikipedia.org/wiki/Hamming_distance

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.