0

I have many arrays of same dimension,such as

x = np.array([3,2,0,4,5,2,1...]) #the dimension of the vectors is above 50000 
y = np.array([1,3,4,2,4,1,4...])

What I want to do is to use Feature Hashing to reduce the dimensionality of these vectors(although there will be collisions).Then lower dimension vectors can be employed in classifiers.

What I have tried is

from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher()
hash_vector = hasher.transform(x)

However, it seems that FeatureHasher cannot be used directly and it saysAttributeError: 'matrix' object has no attribute 'items'

Therefore, in order to do feature hashing smoothly, what should I do next? Can anyone let me know if I am missing something? Or if there is another way to do feature hashing more effectively?

1 Answer 1

1

The argument to the transform method must be an iterable of samples, not a single sample -- see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html .

But, there are more issues with your code: you're not passing input_type to build the hasher, so it's defaulting to dict -- "dictionaries over (feature_name, value)" (whence the need for items:-).

And anyway, no input type can make a hasher accepting "unnamed" features which you seem to want to pass to transform ... that's just not how feature hashing works.

You might consider different approaches to dimensionality reduction, such as http://scipy-lectures.github.io/advanced/scikit-learn/#dimension-reduction-with-principal-component-analysis ...

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for your answer. You mean if I want to use feature hashing, I must pass feature name to transform method, right? So is the feature name necessary?
Yes, that's what gets hashed in feature hashing -- no feature name, no feature hashing! See more details at en.wikipedia.org/wiki/Feature_hashing
Hi,teacher,thank you very much for your kind answer. I have another question. In scikit,it says feature hashing can be used in text classification. But if we use feature hashing to reduce the dimension of the training vector, it will worsen the accuracy, right? I really need your help.
@wanglan8498, in theory it might, but in practice with natural-language text the saving in space and time make it work well -- see en.wikipedia.org/wiki/Feature_hashing and the links at the end of that article, esp. Ganchev and Dredze's.
Thank you very much, sir. After reading the links you mentioned, now I see feature hashing is an effective method to save memory space, though it will reduce a little accuracy. By the way, if I want to focus on improving the accuracy of text classification,what kind of tricks can I use? Could you give me some ideas? I need your help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.