Converting an array structure to a dataframe to get the column names

Question

I am having a dataframe which I have converted to an array to model the data using a regression algorithm. I used the following code to do it

X=df.iloc[:, 0:345].values
Y=df.iloc[:,345].values

Hence X & Y are arrays now.There are many columns because, the categorical variables have been created into dummy variables. Further, I create train and test split

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler

X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=0)

Now, after I have completed building the model and making predictions, I want to get back the value of my categorical variables (X & Y have been created after creating dummy variables for all categorical variables).For this, I am trying to convert my X_test back to a dataframe with the column names in the original dataframe df. I tried the following code

dff=df.iloc[:, 0:345]

The above statement is to get the first 345 columns (of the data frame).

Then,

pd.DataFrame(X_test, index=dff.index, columns=dff.columns)

I get the following error

ValueError: Shape of passed values is (345, 25000), indices imply (345, 100000)

I don't understand why it matters how many rows I have. I have lesser rows because my train and test have been split up 75%-25%. And I am performing the split after data is converted to an array. How do i now convert the array data into a dataframe with column names from dff dataframe?

Do you need to reset_index()? Just a hunch here, might not be the issue — rahlf23
– rahlf23, Commented Aug 16, 2018 at 19:01

Harikrishna · Accepted Answer · 2018-08-16 19:55:42Z

1

pd.DataFrame(X_test, index=dff.index, columns=dff.columns)

X_test being a numpy.ndarray

Modified the above statement to just this:

df_new=pd.DataFrame(X_test)
df_new.columns=list(dff.columns)

The new dataframe contains the X_test data and the column names are assigned from the dff dataframe to the newly created dataframe as well.

answered Aug 16, 2018 at 19:55

Harikrishna

1,1405 gold badges13 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Nick Tallant · Accepted Answer · 2018-08-16 19:07:00Z

0

I would recommend using the DataFrame for train_test_split, and then passing in arrays to your algorithm using numpy:

my_algorithm(np.asarray(X_train), np.asarray(y_train))

This way you can look at your data the same way you would for any df, but can run the model with the array. I'm not sure what library you are using - but I'm pretty sure some can take DataFrames now for modeling.

answered Aug 16, 2018 at 19:07

Nick Tallant

3253 silver badges6 bronze badges

Collectives™ on Stack Overflow

Converting an array structure to a dataframe to get the column names

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related