1

I'm preparing a data set to run in the program rpy (R, which runs in Python) for statistical analysis. It looks like this:

data = [[0, 1, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 1, 0, 0, 1, 0, 0, 0, 0], 
[0, 1, 1, 0, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 1, 0, 0, 0, 0, 1],
[0, 0, 1, 1, , 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 1, 0, 0, 0, 1, 0, 0, 0, 0], 
[0, 0, 0, 0, 1, 0, 0, 0, 1, 0]]   

For me to use this data, I need to isolate the dependent variable (y) from the independent ones (x). I need to create a new list for each column for year as such:

y = data[:,9]
x1 = data[:,0]
x2 = data[:,1]
x3 = data[:,2]
x4 = data[:,3]
x5 = data[:,4]
x6 = data[:,5]
x7 = data[:,6]
x8 = data[:,7]
x9 = data[:,8]
x10 = data[:,9]

Suppose my data has 67 columns. Is there a way to loop through all the columns and create each one automatically without having to type out all of them? I do not want to hard code all the arrays up to 67.

Something along the lines of this, but it doesn't work:

i=0
for d in data:
    "x%d"%i = data[:,i-1]
    i+=1

This is the rest of the code:

rpy.set_default_mode(rpy.NO_CONVERSION)
linear_model = rpy.r.lm(rpy.r("y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10"), data = rpy.r.data_frame(x1=x1,x2=x2,x3=x3,x4=x4,x5=x5,x6=x6,x7=x7,x8=x8,x9=x9,x10=x10,y=y))
rpy.set_default_mode(rpy.BASIC_CONVERSION)
print linear_model.as_py()['coefficients']
summary = rpy.r.summary(linear_model)
4
  • 1
    What is the output you are expecting? Question was hard to follow. Commented Jan 14, 2013 at 22:08
  • I want to automatically create x1=data[:,1], x2=data[:,2].... not having to hard code it in up to x67=data[:,67]. Commented Jan 14, 2013 at 22:12
  • Are you sure that you want to include x10 as an independent variable when your dependent variable y is created as y = x10 ? Commented Jan 15, 2013 at 9:42
  • Sorry I didn't clarify: Y is my last column, so it would be right after x67. Commented Jan 15, 2013 at 15:51

2 Answers 2

12

Why not try something like this to transpose the columns:

x = []

for d in xrange(0,66):
    x.append(data[:,d])

Unless it's absolutely essential that there is a separate data structure for each item, although I don't know why you would need separate data strucures...

EDIT: If not here's something that should work precisely the way you described:

for d in xrange(1,68):
    exec 'x%s = data[:,%s]' %(d,d-1)
Sign up to request clarification or add additional context in comments.

Comments

0

As you show a little bit of the rpy code, I thought that I could show how it would look like with rpy2.

# build a DataFrame
from rpy2.robjects.vectors import IntVector
d = dict(('x%i' % (i+1), IntVector(data[:, i]) for i in range(68) if i != 9)
d['y'] = data[:, 9]
from rpy2.robjects.vectors import DataFrame
dataf = DataFrame(d)
del(d) # dictionary no longer needed

# import R's stats package
from rpy2.robjects.packages import importr
stats = importr('stats')

# fit model
dep_var = 'y'
formula = '%s ~ %s ' % (dep_var, '+'.join(x for x in dataf.names if x != dep_var))
linear_model = stats.lm(formula, data = dataf) 

3 Comments

This works. My goal is to get the coefficients in the end result. However, many of the ones I get (out of 67) are NaN. How do I interpret that? Is there are problem with my data?
This is because some of the coefficients are not estimable, may be because some of the independent variables are linear combinations of others independent variables. Without knowing more about the exact dataset, I'd think that this is looking like a lot of independent variables to fit a linear model anyway. Are you sure you need of all them ? Did you try variable selection ? This is becoming more a question for the "cross-validated" stackexchange site...
That makes sense then. The NaN variables are less significant. I've noticed the bigger the data set, the less NaN coefficients I'm receiving. The goal for me is to find out which variables are most important so I need to provide the entire set. I currently have the data fitting the curve y = Ax1 + Bx2 + Cx3.... I suppose a different equation, such as logistic could work better. I'm not sure.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.