Creating new array in for loop (Python)

Question

I'm preparing a data set to run in the program rpy (R, which runs in Python) for statistical analysis. It looks like this:

data = [[0, 1, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 1, 0, 0, 1, 0, 0, 0, 0], 
[0, 1, 1, 0, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 1, 0, 0, 0, 0, 1],
[0, 0, 1, 1, , 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 1, 0, 0, 0, 1, 0, 0, 0, 0], 
[0, 0, 0, 0, 1, 0, 0, 0, 1, 0]]

For me to use this data, I need to isolate the dependent variable (y) from the independent ones (x). I need to create a new list for each column for year as such:

y = data[:,9]
x1 = data[:,0]
x2 = data[:,1]
x3 = data[:,2]
x4 = data[:,3]
x5 = data[:,4]
x6 = data[:,5]
x7 = data[:,6]
x8 = data[:,7]
x9 = data[:,8]
x10 = data[:,9]

Suppose my data has 67 columns. Is there a way to loop through all the columns and create each one automatically without having to type out all of them? I do not want to hard code all the arrays up to 67.

Something along the lines of this, but it doesn't work:

i=0
for d in data:
    "x%d"%i = data[:,i-1]
    i+=1

This is the rest of the code:

rpy.set_default_mode(rpy.NO_CONVERSION)
linear_model = rpy.r.lm(rpy.r("y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10"), data = rpy.r.data_frame(x1=x1,x2=x2,x3=x3,x4=x4,x5=x5,x6=x6,x7=x7,x8=x8,x9=x9,x10=x10,y=y))
rpy.set_default_mode(rpy.BASIC_CONVERSION)
print linear_model.as_py()['coefficients']
summary = rpy.r.summary(linear_model)

What is the output you are expecting? Question was hard to follow. — Sibi
– Sibi, Commented Jan 14, 2013 at 22:08
I want to automatically create x1=data[:,1], x2=data[:,2].... not having to hard code it in up to x67=data[:,67]. — ono
– ono, Commented Jan 14, 2013 at 22:12
Are you sure that you want to include x10 as an independent variable when your dependent variable y is created as y = x10 ? — lgautier
– lgautier, Commented Jan 15, 2013 at 9:42
Sorry I didn't clarify: Y is my last column, so it would be right after x67. — ono
– ono, Commented Jan 15, 2013 at 15:51

Master_Yoda · Accepted Answer · 2013-01-14 22:54:57Z

12

Why not try something like this to transpose the columns:

x = []

for d in xrange(0,66):
    x.append(data[:,d])

Unless it's absolutely essential that there is a separate data structure for each item, although I don't know why you would need separate data strucures...

EDIT: If not here's something that should work precisely the way you described:

for d in xrange(1,68):
    exec 'x%s = data[:,%s]' %(d,d-1)

edited Jan 14, 2013 at 22:54

answered Jan 14, 2013 at 22:18

Master_Yoda

1,1622 gold badges11 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

lgautier · Accepted Answer · 2013-01-15 09:54:35Z

0

As you show a little bit of the rpy code, I thought that I could show how it would look like with rpy2.

# build a DataFrame
from rpy2.robjects.vectors import IntVector
d = dict(('x%i' % (i+1), IntVector(data[:, i]) for i in range(68) if i != 9)
d['y'] = data[:, 9]
from rpy2.robjects.vectors import DataFrame
dataf = DataFrame(d)
del(d) # dictionary no longer needed

# import R's stats package
from rpy2.robjects.packages import importr
stats = importr('stats')

# fit model
dep_var = 'y'
formula = '%s ~ %s ' % (dep_var, '+'.join(x for x in dataf.names if x != dep_var))
linear_model = stats.lm(formula, data = dataf)

answered Jan 15, 2013 at 9:54

lgautier

11.6k31 silver badges43 bronze badges

3 Comments

ono Over a year ago

This works. My goal is to get the coefficients in the end result. However, many of the ones I get (out of 67) are NaN. How do I interpret that? Is there are problem with my data?

lgautier Over a year ago

This is because some of the coefficients are not estimable, may be because some of the independent variables are linear combinations of others independent variables. Without knowing more about the exact dataset, I'd think that this is looking like a lot of independent variables to fit a linear model anyway. Are you sure you need of all them ? Did you try variable selection ? This is becoming more a question for the "cross-validated" stackexchange site...

ono Over a year ago

That makes sense then. The NaN variables are less significant. I've noticed the bigger the data set, the less NaN coefficients I'm receiving. The goal for me is to find out which variables are most important so I need to provide the entire set. I currently have the data fitting the curve y = Ax1 + Bx2 + Cx3.... I suppose a different equation, such as logistic could work better. I'm not sure.

Collectives™ on Stack Overflow

Creating new array in for loop (Python)

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related