I am having a dataframe which I have converted to an array to model the data using a regression algorithm. I used the following code to do it
X=df.iloc[:, 0:345].values
Y=df.iloc[:,345].values
Hence X & Y are arrays now.There are many columns because, the categorical variables have been created into dummy variables. Further, I create train and test split
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=0)
Now, after I have completed building the model and making predictions, I want to get back the value of my categorical variables (X & Y have been created after creating dummy variables for all categorical variables).For this, I am trying to convert my X_test back to a dataframe with the column names in the original dataframe df. I tried the following code
dff=df.iloc[:, 0:345]
The above statement is to get the first 345 columns (of the data frame).
Then,
pd.DataFrame(X_test, index=dff.index, columns=dff.columns)
I get the following error
ValueError: Shape of passed values is (345, 25000), indices imply (345, 100000)
I don't understand why it matters how many rows I have. I have lesser rows because my train and test have been split up 75%-25%. And I am performing the split after data is converted to an array. How do i now convert the array data into a dataframe with column names from dff dataframe?
reset_index()? Just a hunch here, might not be the issue