1

I would like to create a dataframe without reading it from CSV.

For example, I would like to create the columns and one record. Please assume something like this:

    Feature1 Feature 2  Feature 3 ... Feature n
1     20      False        3.2          True

I build a classifier and I would like to make prediction: classifier.predict(dataframe)

I received the record as string with "," between the features. I used split for extracting list of features:

record_features = "16,713,Danny, ..."
features = record_features.split(',')

After that I convert the list into series:

series = pd.Series(features)

And after that I would like to create a dataframe: column_names = ['feature1', 'feature2', ..., 'feature102']

 df = pd.DataFrame(series, columns=column_names)

I got an error:

ValueError: Shape of passed values is (1, 102), indices imply (102, 102)

I have really 102 features and I would like to create a dataframe with columns and one record

Any suggestions?

2 Answers 2

3

You can add []:

column_names = ['Feature1','Feature2','Feature102']
record_features = "16,713,Danny"
features = record_features.split(',')

df = pd.DataFrame([features], columns=column_names)
print (df)
  Feature1 Feature2 Feature102
0       16      713      Danny

Another numpy solution with reshape:

df = pd.DataFrame(np.array(features)
                    .reshape(len(features) // len(column_names), len(column_names)), 
                 columns=column_names)
print (df)
  Feature1 Feature2 Feature102
0       16      713      Danny

Timings:

column_names = ['Feature' + str(x) for x in range(102)]
record_features = "16,713,Danny"
features = record_features.split(',')
features = features * 34

In [222]: %timeit pd.DataFrame([features], columns=column_names)
100 loops, best of 3: 5.94 ms per loop

In [223]: %timeit pd.DataFrame(dict(zip(column_names, features)), index=[0], columns=column_names)
The slowest run took 4.48 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 5.25 ms per loop

In [224]: %timeit pd.DataFrame(np.array(features).reshape(len(features) // len(column_names), len(column_names)), columns=column_names)
The slowest run took 5.60 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 206 µs per loop
Sign up to request clarification or add additional context in comments.

Comments

0

You can pass in a dictionary to the DataFrame constructor:

column_names = ['Feature1','Feature2','Feature102']
record_features = "16",713,"Danny"

print pd.DataFrame(dict(zip(column_names, record_features)), index=[0], columns=column_names)

>>>   Feature1  Feature2 Feature102
0       16       713      Danny

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.