14

I have a dataframe in pandas that I'm using to produce a scatterplot, and want to include a regression line for the plot. Right now I'm trying to do this with polyfit.

Here's my code:

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from numpy import *

table1 = pd.DataFrame.from_csv('upregulated_genes.txt', sep='\t', header=0, index_col=0)
table2 = pd.DataFrame.from_csv('misson_genes.txt', sep='\t', header=0, index_col=0)
table1 = table1.join(table2, how='outer')

table1 = table1.dropna(how='any')
table1 = table1.replace('#DIV/0!', 0)

# scatterplot
plt.scatter(table1['log2 fold change misson'], table1['log2 fold change'])
plt.ylabel('log2 expression fold change')
plt.xlabel('log2 expression fold change Misson et al. 2005')
plt.title('Root Early Upregulated Genes')
plt.axis([0,12,-5,12])

# this is the part I'm unsure about
regres = polyfit(table1['log2 fold change misson'], table1['log2 fold change'], 1)

plt.show()

But I get the following error:

TypeError: cannot concatenate 'str' and 'float' objects

Does anyone know where I'm going wrong here? I'm also unsure how to add the regression line to my plot. Any other general comments on my code would also be hugely appreciated, I'm still a beginner.

6
  • at what line do you get the error? Commented Oct 15, 2013 at 11:18
  • @usethedeathstar regres = polyfit(table1['log2 fold change misson'], table1['log2 fold change'], 1) Commented Oct 15, 2013 at 11:38
  • sure there are no NaN values in the tables? because pylab.scatter just does not plot the x,y points where x or y are NaN (meaning that it doesnt give errors either), but maybe polyfit doesnt know that? (just guessing at where the problem might be here- how are not-a-number values stored in your csv-file?) Commented Oct 15, 2013 at 11:48
  • Nope no NaN values. The only not-a-number values were '#DIV/0!', which I removed Commented Oct 15, 2013 at 11:58
  • what is the type of table1['log2 fold change misson'] and table1['log2 fold change'] ? (as far as i know they should be numpy.array with float as dtype (and both should have the same shape)) Commented Oct 15, 2013 at 12:01

1 Answer 1

28

Instead of replacing '#DIV/0!' by hand, force the data to be numeric. This does two things at once: it ensures that the result is numeric type (not str), and it substitutes NaN for any entries that cannot be parsed as a number. Example:

In [5]: Series([1, 2, 'blah', '#DIV/0!']).convert_objects(convert_numeric=True)
Out[5]: 
0     1
1     2
2   NaN
3   NaN
dtype: float64

This should fix your error. But, on the general subject of fitting a line to data, I keep handy two ways of doing this that I like better than polyfit. The second of the two is more robust (and can potentially return much more detailed information about the statistics) but it requires statsmodels.

from scipy.stats import linregress
def fit_line1(x, y):
    """Return slope, intercept of best fit line."""
    # Remove entries where either x or y is NaN.
    clean_data = pd.concat([x, y], 1).dropna(0) # row-wise
    (_, x), (_, y) = clean_data.iteritems()
    slope, intercept, r, p, stderr = linregress(x, y)
    return slope, intercept # could also return stderr

import statsmodels.api as sm
def fit_line2(x, y):
    """Return slope, intercept of best fit line."""
    X = sm.add_constant(x)
    model = sm.OLS(y, X, missing='drop') # ignores entires where x or y is NaN
    fit = model.fit()
    return fit.params[1], fit.params[0] # could also return stderr in each via fit.bse

To plot it, do something like

m, b = fit_line2(x, y)
N = 100 # could be just 2 if you are only drawing a straight line...
points = np.linspace(x.min(), x.max(), N)
plt.plot(points, m*points + b)
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! Forcing the data to be numeric has fixed the error I was getting, but I'm getting NaN output from polyfit and the code you suggested... Any idea why this could be?
Some NaNs, or all NaNs? Can you reproduce the problem with a small subset of your data, and share that here?
Sorry that was just a mistake by me, it's working now. Do you know how I would add this as a line to my scatterplot?
See the bottom of my answer. If it's not showing up in the same plot, try adding the keyword argument ax=plt.gca() in plot.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.