how to use feature hashing correctly in python

Question

I have many arrays of same dimension,such as

x = np.array([3,2,0,4,5,2,1...]) #the dimension of the vectors is above 50000 
y = np.array([1,3,4,2,4,1,4...])

What I want to do is to use Feature Hashing to reduce the dimensionality of these vectors(although there will be collisions).Then lower dimension vectors can be employed in classifiers.

What I have tried is

from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher()
hash_vector = hasher.transform(x)

However, it seems that FeatureHasher cannot be used directly and it saysAttributeError: 'matrix' object has no attribute 'items'

Therefore, in order to do feature hashing smoothly, what should I do next? Can anyone let me know if I am missing something? Or if there is another way to do feature hashing more effectively?

Alex Martelli · Accepted Answer · 2014-12-20 05:41:55Z

1

The argument to the transform method must be an iterable of samples, not a single sample -- see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html .

But, there are more issues with your code: you're not passing input_type to build the hasher, so it's defaulting to dict -- "dictionaries over (feature_name, value)" (whence the need for items:-).

And anyway, no input type can make a hasher accepting "unnamed" features which you seem to want to pass to transform ... that's just not how feature hashing works.

You might consider different approaches to dimensionality reduction, such as http://scipy-lectures.github.io/advanced/scikit-learn/#dimension-reduction-with-principal-component-analysis ...

answered Dec 20, 2014 at 5:41

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

allenwang Over a year ago

Thanks for your answer. You mean if I want to use feature hashing, I must pass feature name to transform method, right? So is the feature name necessary?

Alex Martelli Over a year ago

Yes, that's what gets hashed in feature hashing -- no feature name, no feature hashing! See more details at en.wikipedia.org/wiki/Feature_hashing

allenwang Over a year ago

Hi,teacher,thank you very much for your kind answer. I have another question. In scikit,it says feature hashing can be used in text classification. But if we use feature hashing to reduce the dimension of the training vector, it will worsen the accuracy, right? I really need your help.

Alex Martelli Over a year ago

@wanglan8498, in theory it might, but in practice with natural-language text the saving in space and time make it work well -- see en.wikipedia.org/wiki/Feature_hashing and the links at the end of that article, esp. Ganchev and Dredze's.

allenwang Over a year ago

Thank you very much, sir. After reading the links you mentioned, now I see feature hashing is an effective method to save memory space, though it will reduce a little accuracy. By the way, if I want to focus on improving the accuracy of text classification,what kind of tricks can I use? Could you give me some ideas? I need your help.

Collectives™ on Stack Overflow

how to use feature hashing correctly in python

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related