1

I am having a dataframe which I have converted to an array to model the data using a regression algorithm. I used the following code to do it

X=df.iloc[:, 0:345].values
Y=df.iloc[:,345].values

Hence X & Y are arrays now.There are many columns because, the categorical variables have been created into dummy variables. Further, I create train and test split

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler

X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=0)

Now, after I have completed building the model and making predictions, I want to get back the value of my categorical variables (X & Y have been created after creating dummy variables for all categorical variables).For this, I am trying to convert my X_test back to a dataframe with the column names in the original dataframe df. I tried the following code

dff=df.iloc[:, 0:345]

The above statement is to get the first 345 columns (of the data frame).

Then,

pd.DataFrame(X_test, index=dff.index, columns=dff.columns)

I get the following error

ValueError: Shape of passed values is (345, 25000), indices imply (345, 100000)

I don't understand why it matters how many rows I have. I have lesser rows because my train and test have been split up 75%-25%. And I am performing the split after data is converted to an array. How do i now convert the array data into a dataframe with column names from dff dataframe?

2
  • Do you need to reset_index()? Just a hunch here, might not be the issue Commented Aug 16, 2018 at 19:01
  • No. I tried that too earlier! Commented Aug 16, 2018 at 19:33

2 Answers 2

1
pd.DataFrame(X_test, index=dff.index, columns=dff.columns)

X_test being a numpy.ndarray

Modified the above statement to just this:

df_new=pd.DataFrame(X_test)
df_new.columns=list(dff.columns)

The new dataframe contains the X_test data and the column names are assigned from the dff dataframe to the newly created dataframe as well.

Sign up to request clarification or add additional context in comments.

Comments

0

I would recommend using the DataFrame for train_test_split, and then passing in arrays to your algorithm using numpy:

my_algorithm(np.asarray(X_train), np.asarray(y_train))

This way you can look at your data the same way you would for any df, but can run the model with the array. I'm not sure what library you are using - but I'm pretty sure some can take DataFrames now for modeling.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.