I am trying to apply a univariate feature selection method using the Python module scikit-learn to a regression (i.e. continuous valued response values) dataset in svmlight format.
I am working with scikit-learn version 0.11.
I have tried two approaches - the first of which failed and the second of which worked for my toy dataset but I believe would give meaningless results for a real dataset.
I would like advice regarding an appropriate univariate feature selection approach I could apply to select the top N features for a regression dataset. I would either like (a) to work out how to make the f_regression function work or (b) to hear alternative suggestions.
The two approaches mentioned above:
- I tried using sklearn.feature_selection.f_regression(X,Y).
This failed with the following error message: "TypeError: copy() takes exactly 1 argument (2 given)"
- I tried using chi2(X,Y). This "worked" but I suspect this is because the two response values 0.1 and 1.8 in my toy dataset were being treated as class labels? Presumably, this would not yield a meaningful chi-squared statistic for a real dataset for which there would be a large number of possible response values and the number in each cell [with a particular response value and value for the attribute being tested] would be low?
Please find my toy dataset pasted into the end of this message.
The following code snippet should give the results I describe above.
from sklearn.datasets import load_svmlight_file
X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to the name of my toy dataset file
from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func="one of the two functions I refer to above",k=2) #sorry, I hope this message is clear
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))] #This should print the indices of the top 2 features
Thanks in advance.
Richard
Contents of my contrived svmlight file - with additional blank lines inserted for clarity:
1.8 1:1.000000 2:1.000000 4:1.000000 6:1.000000#mA
1.8 1:1.000000 2:1.000000#mB
0.1 5:1.000000#mC
1.8 1:1.000000 2:1.000000#mD
0.1 3:1.000000 4:1.000000#mE
0.1 3:1.000000#mF
1.8 2:1.000000 4:1.000000 5:1.000000 6:1.000000#mG
1.8 2:1.000000#mH
chi2is for classification only. To make it work in a regression setting, you'll have to bin your Y values.