0

I am importing a CSV file using pandas,

CSV Column header - Year, Model, Trim, Result

The values coming in from the csv file are as follows -

Year  |  Model  | Trim  | Result

2012  | Camry   | SR5   | 1
2014  | Tacoma  | SR5   | 1
2014  | Camry   | XLE   | 0
etc..

There are 2500+ rows in the data set containing over 200 unique models.

All Values are then converted to numerical values for analysis purposes.

Here the inputs are the first 3 columns of the csv file and the output is the fourth result column

Here is my script:

import pandas as pd
inmport numpy as np

c1 = []
c2 = []
c3 = []
input = []
output = []

# read in the csv file containing 4 columns
df = pd.read_csv('success.csv')
df.convert_objects(convert_numeric=True)
df.fillna(0, inplace=True)

# convert string values to numerical values
def handle_non_numerical_data(df):
    columns = df.columns.values

    for column in columns:
        text_digit_vals = {}
        def convert_to_int(val):
            return text_digit_vals[val]
        if df[column].dtype != np.int64 and df[column].dtype != np.float64:
            column_contents = df[column].values.tolist()
            unique_elements = set(column_contents)
            x = 0
            for unique in unique_elements:
                if unique not in text_digit_vals:
                    text_digit_vals[unique] = x
                    x+=1

            df[column] = list(map(convert_to_int, df[column]))

    return df

df = handle_non_numerical_data(df)

# extract each column to insert into input array later
c1.append(df['Year'])
c2.append(df['Model'])
c3.append(df['Trim'])

#create input array containg the first 3 rows of the csv file
input = np.stack_column(c1,c2,c3)
output.append(df['Result'])

This works fine except append only excepts 1 value, would I use extend as that seems it would attach it to the end of the array?

UPDATE

Essentially all of this works great, my problem is creating the input array, I would like the array to consist of 3 columns - Year, Model, Trim.

input = ([['Year'], ['Model'], ['Trim']],[['Year'], ['Model'], ['Trim']]...)

I can only seem to add one value on top of the other rather than having them sequence..

What I get now -

input = ([['Year'], ['Year'], ['Year']].., [['Model'], ['Model'], ['Model']]..[['Trim'], ['Trim'], ['Trim']]...) 
6
  • 1
    I'm struggling to understand the problem. Can you please rephrases, or perhaps add an example of current and expected behavior? Commented Feb 21, 2017 at 1:47
  • It is unclear what you are doing exactly, since we do not know anything about your csv. You should try to give an example of input and expected output. In this case, namely, why the result of pd.read_csv is not acceptable. I suspect that whatever you are trying to accomplish can be done in a much more straightforward manner. Commented Feb 21, 2017 at 1:52
  • Sorry I tried updated the question to better explain my problem, basically I cant sequence the 3 arrays into the one array without stacking them Commented Feb 21, 2017 at 2:41
  • @RyanD You need to explain what your input data looks like, i.e. the csv, and exactly what you want as an output. Your function, handle_non_numerical_data is probably not the best way to convert your values integers, that can be handled much more easily and efficiently using built-in pandas/numpy functions. Also, why you are putting all the columns in a list, intead of using df.values is not clear either. I will repeat, I suspect that whatever you are trying to accomplish can be done in a much more straightforward manner. Commented Feb 21, 2017 at 3:50
  • Thanks @juanpa.arrivillaga I updated the question with the csv values to help better explain.. Commented Feb 21, 2017 at 4:03

1 Answer 1

2

To elaborate on my comment, suppose you have some DataFrame consisting of non-integer values:

>>> df = pd.DataFrame([[np.random.choice(list('abcdefghijklmnop')) for _ in range(3)] for _ in range(10)])
>>> df
   0  1  2
0  j  p  j
1  d  g  b
2  n  m  f
3  o  b  j
4  h  c  a
5  p  m  n
6  c  c  l
7  o  d  e
8  b  g  h
9  h  o  k

And there is also an output:

>>> df['output'] = np.random.randint(0,2,10)
>>> df
   0  1  2  output
0  j  p  j       0
1  d  g  b       0
2  n  m  f       1
3  o  b  j       1
4  h  c  a       1
5  p  m  n       0
6  c  c  l       1
7  o  d  e       0
8  b  g  h       1
9  h  o  k       0

To convert all the string values to integers, use np.unique with return_inverse=True, this inverse will be the array you need, just keep in mind, you need to reshape (because np.unique will have flattened it):

>>> unique, inverse  = np.unique(df.iloc[:,:3].values, return_inverse=True)
>>> unique
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n',
       'o', 'p'], dtype=object)
>>> inverse
array([ 8, 14,  8,  3,  6,  1, 12, 11,  5, 13,  1,  8,  7,  2,  0, 14, 11,
       12,  2,  2, 10, 13,  3,  4,  1,  6,  7,  7, 13,  9])
>>> input = inverse.reshape(df.shape[0], df.shape[1] - 1)
>>> input
array([[ 8, 14,  8],
       [ 3,  6,  1],
       [12, 11,  5],
       [13,  1,  8],
       [ 7,  2,  0],
       [14, 11, 12],
       [ 2,  2, 10],
       [13,  3,  4],
       [ 1,  6,  7],
       [ 7, 13,  9]])

And you can always go back:

>>> unique[input]
array([['j', 'p', 'j'],
       ['d', 'g', 'b'],
       ['n', 'm', 'f'],
       ['o', 'b', 'j'],
       ['h', 'c', 'a'],
       ['p', 'm', 'n'],
       ['c', 'c', 'l'],
       ['o', 'd', 'e'],
       ['b', 'g', 'h'],
       ['h', 'o', 'k']], dtype=object)

To get an array for the output, again, you simply use the .values of the df taking the appropriate column -- since these are already numpy arrays!

>>> output = df['output'].values
>>> output
array([0, 0, 1, 1, 1, 0, 1, 0, 1, 0])

You might want to reshape it, depending on what libraries you are going to use for analysis (sklearn, scipy, etc):

>>> output.reshape(output.size, 1)
array([[0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0]])
Sign up to request clarification or add additional context in comments.

8 Comments

Thank you for the great explanation! I apologize I forgot to mention I have 2500+ rows in the data set containing over 200 unique models would that be to many unique models?
@RyanD No, it shouldn't be a problem at all.
Ok cool I will give this a go first thing in the morning and report back, Thanks a Mill!
Thanks again for the awesome explanation but I am still struggling with this, I am new to numpy.. Where is the input being populated? I see the output column and can get that but am unclear on how you are structuring the input array
@RyanD Ah, there was a typo. I named the input output accidentally so that may make things confusing... I edited it to make more sense.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.