how to populate array with multiple rows from csv file using python pandas

Question

I am importing a CSV file using pandas,

CSV Column header - Year, Model, Trim, Result

The values coming in from the csv file are as follows -

Year  |  Model  | Trim  | Result

2012  | Camry   | SR5   | 1
2014  | Tacoma  | SR5   | 1
2014  | Camry   | XLE   | 0
etc..

There are 2500+ rows in the data set containing over 200 unique models.

All Values are then converted to numerical values for analysis purposes.

Here the inputs are the first 3 columns of the csv file and the output is the fourth result column

Here is my script:

import pandas as pd
inmport numpy as np

c1 = []
c2 = []
c3 = []
input = []
output = []

# read in the csv file containing 4 columns
df = pd.read_csv('success.csv')
df.convert_objects(convert_numeric=True)
df.fillna(0, inplace=True)

# convert string values to numerical values
def handle_non_numerical_data(df):
    columns = df.columns.values

    for column in columns:
        text_digit_vals = {}
        def convert_to_int(val):
            return text_digit_vals[val]
        if df[column].dtype != np.int64 and df[column].dtype != np.float64:
            column_contents = df[column].values.tolist()
            unique_elements = set(column_contents)
            x = 0
            for unique in unique_elements:
                if unique not in text_digit_vals:
                    text_digit_vals[unique] = x
                    x+=1

            df[column] = list(map(convert_to_int, df[column]))

    return df

df = handle_non_numerical_data(df)

# extract each column to insert into input array later
c1.append(df['Year'])
c2.append(df['Model'])
c3.append(df['Trim'])

#create input array containg the first 3 rows of the csv file
input = np.stack_column(c1,c2,c3)
output.append(df['Result'])

This works fine except append only excepts 1 value, would I use extend as that seems it would attach it to the end of the array?

UPDATE

Essentially all of this works great, my problem is creating the input array, I would like the array to consist of 3 columns - Year, Model, Trim.

input = ([['Year'], ['Model'], ['Trim']],[['Year'], ['Model'], ['Trim']]...)

I can only seem to add one value on top of the other rather than having them sequence..

What I get now -

input = ([['Year'], ['Year'], ['Year']].., [['Model'], ['Model'], ['Model']]..[['Trim'], ['Trim'], ['Trim']]...)

I'm struggling to understand the problem. Can you please rephrases, or perhaps add an example of current and expected behavior? — Marat
– Marat, Commented Feb 21, 2017 at 1:47
It is unclear what you are doing exactly, since we do not know anything about your csv. You should try to give an example of input and expected output. In this case, namely, why the result of pd.read_csv is not acceptable. I suspect that whatever you are trying to accomplish can be done in a much more straightforward manner. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Feb 21, 2017 at 1:52
Sorry I tried updated the question to better explain my problem, basically I cant sequence the 3 arrays into the one array without stacking them — Ryan D
– Ryan D, Commented Feb 21, 2017 at 2:41
@RyanD You need to explain what your input data looks like, i.e. the csv, and exactly what you want as an output. Your function, handle_non_numerical_data is probably not the best way to convert your values integers, that can be handled much more easily and efficiently using built-in pandas/numpy functions. Also, why you are putting all the columns in a list, intead of using df.values is not clear either. I will repeat, I suspect that whatever you are trying to accomplish can be done in a much more straightforward manner. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Feb 21, 2017 at 3:50
Thanks @juanpa.arrivillaga I updated the question with the csv values to help better explain.. — Ryan D
– Ryan D, Commented Feb 21, 2017 at 4:03

juanpa.arrivillaga · Accepted Answer · 2017-02-21 18:01:29Z

2

To elaborate on my comment, suppose you have some DataFrame consisting of non-integer values:

>>> df = pd.DataFrame([[np.random.choice(list('abcdefghijklmnop')) for _ in range(3)] for _ in range(10)])
>>> df
   0  1  2
0  j  p  j
1  d  g  b
2  n  m  f
3  o  b  j
4  h  c  a
5  p  m  n
6  c  c  l
7  o  d  e
8  b  g  h
9  h  o  k

And there is also an output:

>>> df['output'] = np.random.randint(0,2,10)
>>> df
   0  1  2  output
0  j  p  j       0
1  d  g  b       0
2  n  m  f       1
3  o  b  j       1
4  h  c  a       1
5  p  m  n       0
6  c  c  l       1
7  o  d  e       0
8  b  g  h       1
9  h  o  k       0

To convert all the string values to integers, use np.unique with return_inverse=True, this inverse will be the array you need, just keep in mind, you need to reshape (because np.unique will have flattened it):

>>> unique, inverse  = np.unique(df.iloc[:,:3].values, return_inverse=True)
>>> unique
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n',
       'o', 'p'], dtype=object)
>>> inverse
array([ 8, 14,  8,  3,  6,  1, 12, 11,  5, 13,  1,  8,  7,  2,  0, 14, 11,
       12,  2,  2, 10, 13,  3,  4,  1,  6,  7,  7, 13,  9])
>>> input = inverse.reshape(df.shape[0], df.shape[1] - 1)
>>> input
array([[ 8, 14,  8],
       [ 3,  6,  1],
       [12, 11,  5],
       [13,  1,  8],
       [ 7,  2,  0],
       [14, 11, 12],
       [ 2,  2, 10],
       [13,  3,  4],
       [ 1,  6,  7],
       [ 7, 13,  9]])

And you can always go back:

>>> unique[input]
array([['j', 'p', 'j'],
       ['d', 'g', 'b'],
       ['n', 'm', 'f'],
       ['o', 'b', 'j'],
       ['h', 'c', 'a'],
       ['p', 'm', 'n'],
       ['c', 'c', 'l'],
       ['o', 'd', 'e'],
       ['b', 'g', 'h'],
       ['h', 'o', 'k']], dtype=object)

To get an array for the output, again, you simply use the .values of the df taking the appropriate column -- since these are already numpy arrays!

>>> output = df['output'].values
>>> output
array([0, 0, 1, 1, 1, 0, 1, 0, 1, 0])

You might want to reshape it, depending on what libraries you are going to use for analysis (sklearn, scipy, etc):

>>> output.reshape(output.size, 1)
array([[0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0]])

edited Feb 21, 2017 at 18:01

answered Feb 21, 2017 at 3:59

juanpa.arrivillaga

97.6k14 gold badges141 silver badges190 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Ryan D Over a year ago

Thank you for the great explanation! I apologize I forgot to mention I have 2500+ rows in the data set containing over 200 unique models would that be to many unique models?

juanpa.arrivillaga Over a year ago

@RyanD No, it shouldn't be a problem at all.

Ryan D Over a year ago

Ok cool I will give this a go first thing in the morning and report back, Thanks a Mill!

Ryan D Over a year ago

Thanks again for the awesome explanation but I am still struggling with this, I am new to numpy.. Where is the input being populated? I see the output column and can get that but am unclear on how you are structuring the input array

juanpa.arrivillaga Over a year ago

@RyanD Ah, there was a typo. I named the input output accidentally so that may make things confusing... I edited it to make more sense.

|

Collectives™ on Stack Overflow

how to populate array with multiple rows from csv file using python pandas

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related