Implementation of Stochastic Gradient Descent in Python

Question

I am attempting to implement a basic Stochastic Gradient Descent algorithm for a 2-d linear regression in Python. I was given some boilerplate code for vanilla GD, and I have attempted to convert it to work for SGD.

Specifically -- I am a little unsure as to if I correctly implemented the loss function and partial derivatives, since I am new to regressions in general.

I do see that the errors tend to "zig zag" as expected. Does the following look like a correct implementation or have I made any mistakes?

    #sample data
    data  = [(1,1),(2,3),(4,3),(3,2),(5,5)]    
    
    def compute_error_for_line_given_points(b, m, points):
        totalError = 0
        x = points[0]
        y = points[1]
        return float(totalError + (y - (m * x + b)) ** 2)
    
    def step_gradient(b_current, m_current, points, learningRate):
        N = float(1)
        for i in range(0, 1):
            x = points[0]
            y = points[1]
            b_gradient = -(2/N) * (y - ((m_current * x) + b_current)) #this is the part I am unsure
            m_gradient = -(2/N) * x * (y - ((m_current * x) + b_current)) #here as well
        new_b = b_current - (learningRate * b_gradient)
        new_m = m_current - (learningRate * m_gradient)
        return [new_b, new_m]
    
    err_log = []
    coef_log = []
    b = 0 #initial intercept
    m = 0 #initial slope
    
    iterations = 4
    for i in range(iterations): #epochs
        for point in data: #one point at a time for SGD
            err = compute_error_for_line_given_points(b,m, point)
            err_log.append(err)
            b,m = step_gradient(b,m,point,.01)
            coef_log.append((b,m))
```

JahKnows · Accepted Answer · 2018-04-25 02:30:05Z

There is only one small difference between gradient descent and stochastic gradient descent. Gradient descent calculates the gradient based on the loss function calculated across all training instances, whereas stochastic gradient descent calculates the gradient based on the loss in batches. Both of these techniques are used to find optimal parameters for a model.

Let us try to implement SGD on this 2D dataset.

The algorithm

The dataset has 2 features, however we will want to add a bias term so we append a column of ones to the end of the data matrix.

shape = x.shape 
x = np.insert(x, 0, 1, axis=1)

Then we initialize our weights, there are many strategies to do this. For simplicity I will set them all to 1 however setting the initial weights randomly is probably better in order to be able to use multiple restarts.

w = np.ones((shape[1]+1,))

Our initial line looks like this

Now we will iteratively update the weights of the model if it mistakenly classifies an example.

for ix, i in enumerate(x):
   pred = np.dot(i,w)
   if pred > 0: pred = 1
   elif pred < 0: pred = -1
   if pred != y[ix]:
      w = w - learning_rate * pred * i

This line is the weight update w = w - learning_rate * pred * i.

We can see that doing this process continuously will lead to convergence.

After 10 epochs

After 20 epochs

After 50 epochs

After 100 epochs

And finally,

The code

The dataset for this code can be found here.

The function which will train the weights takes in the feature matrix $x$ and the targets $y$. It returns the trained weights $w$ and a list of historical weights encountered throughout the training process.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def get_weights(x, y, verbose = 0):
    shape = x.shape
    x = np.insert(x, 0, 1, axis=1)
    w = np.ones((shape[1]+1,))
    weights = []

    learning_rate = 10
    iteration = 0
    loss = None
    while iteration <= 1000 and loss != 0:
        for ix, i in enumerate(x):
            pred = np.dot(i,w)
            if pred > 0: pred = 1
            elif pred < 0: pred = -1
            if pred != y[ix]:
                w = w - learning_rate * pred * i
            weights.append(w)    
            if verbose == 1:
                print('X_i = ', i, '    y = ', y[ix])
                print('Pred: ', pred )
                print('Weights', w)
                print('------------------------------------------')


        loss = np.dot(x, w)
        loss[loss<0] = -1
        loss[loss>0] = 1
        loss = np.sum(loss - y )

        if verbose == 1:
            print('------------------------------------------')
            print(np.sum(loss - y ))
            print('------------------------------------------')
        if iteration%10 == 0: learning_rate = learning_rate / 2
        iteration += 1    
    print('Weights: ', w)
    print('Loss: ', loss)
    return w, weights

We will apply this SGD to our data in perceptron.csv.

df = np.loadtxt("perceptron.csv", delimiter = ',')
x = df[:,0:-1]
y = df[:,-1]

print('Dataset')
print(df, '\n')

w, all_weights = get_weights(x, y)
x = np.insert(x, 0, 1, axis=1)

pred = np.dot(x, w)
pred[pred > 0] =  1
pred[pred < 0] = -1
print('Predictions', pred)

Let's plot the decision boundary

x1 = np.linspace(np.amin(x[:,1]),np.amax(x[:,2]),2)
x2 = np.zeros((2,))
for ix, i in enumerate(x1):
    x2[ix] = (-w[0] - w[1]*i) / w[2]

plt.scatter(x[y>0][:,1], x[y>0][:,2], marker = 'x')
plt.scatter(x[y<0][:,1], x[y<0][:,2], marker = 'o')
plt.plot(x1,x2)
plt.title('Perceptron Seperator', fontsize=20)
plt.xlabel('Feature 1 ($x_1$)', fontsize=16)
plt.ylabel('Feature 2 ($x_2$)', fontsize=16)
plt.show()

To see the training process you can print the weights as they changed through the epochs.

for ix, w in enumerate(all_weights):
    if ix % 10 == 0:
        print('Weights:', w)
        x1 = np.linspace(np.amin(x[:,1]),np.amax(x[:,2]),2)
        x2 = np.zeros((2,))
        for ix, i in enumerate(x1):
            x2[ix] = (-w[0] - w[1]*i) / w[2]
        print('$0 = ' + str(-w[0]) + ' - ' + str(w[1]) + 'x_1'+ ' - ' + str(w[2]) + 'x_2$')

        plt.scatter(x[y>0][:,1], x[y>0][:,2], marker = 'x')
        plt.scatter(x[y<0][:,1], x[y<0][:,2], marker = 'o')
        plt.plot(x1,x2)
        plt.title('Perceptron Seperator', fontsize=20)
        plt.xlabel('Feature 1 ($x_1$)', fontsize=16)
        plt.ylabel('Feature 2 ($x_2$)', fontsize=16)
        plt.show()

Use verbose option if you want a full readout to see what the algorithm is doing throughout the iterations. — JahKnows
– JahKnows, Commented Apr 25, 2018 at 2:31

Stack Exchange Network

Implementation of Stochastic Gradient Descent in Python

1 Answer 1

The algorithm

The code

Your Answer

Linked

Hot Network Questions

Implementation of Stochastic Gradient Descent in Python

1 Answer 1

The algorithm

The code

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions