Neural Network built from scratch using numpy isn't learning

Question

I'm building a neural network from scratch using only Python and numpy, It's meant for classifying the MNIST data set, I got everything to work but the network isn't really learning, at epoch 0 it's accuracy is about 12% after 20 epochs, it increases to 14% but then gradually drops to back to around 12% after 40 epochs. So, it's clear that there's something wrong with my Backpropagation (And yes, I tried increasing epochs to 150 but I still get the same results).

I actually followed this video, But I handled dimensions in a different way, which lead to the code being different, He made it so that the rows are the features while the columns are the samples, But I did the opposite, So while backpropagating I had to transpose some arrays to make his algorithm compatible (I think this might be the reason why it's not working).

Loading the data:

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train, x_test = x_train / 255, x_test / 255
x_train, x_test = x_train.reshape(len(x_train), 28 * 28), x_test.reshape(len(x_test), 28 * 28)
print(x_train.shape) # (60000, 784)
print(x_test.shape) # (10000, 784)

Here's the meat of the model:

W1 = np.random.randn(784, 10)
b1 = np.random.randn(10)
W2 = np.random.randn(10, 10)
b2 = np.random.randn(10)

def relu(x, dir=False):
    if dir: return x > 0
    return np.maximum(x, 0)

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=1, keepdims = True)

def one_hot_encode(y):
    y_hot = np.zeros(shape=(len(y), 10))
    for i in range(len(y)):
        y_hot[i][y[i]] = 1
    return y_hot

def loss_function(predictions, true):
    return predictions - true

def predict(x):
    Z1 = x.dot(W1) + b1
    A1 = relu(Z1)
    Z2 = A1.dot(W2) + b2
    A2 = softmax(Z2)
    # The final prediction is A2 at index 3 or -1:
    return Z1, A1, Z2, A2

def get_accuracy(predictions, Y):
    guesses = predictions.argmax(axis=1)
    average = 0
    i = 0
    while i < len(guesses):
        if guesses[i] == Y[i]:
            average += 1
        i += 1
    percent = (average / len(guesses)) * 100
    return percent
    

def train(data, labels, epochs=40, learning_rate=0.1):
    for i in range(epochs):
        labels_one_hot = one_hot_encode(labels)

        # Forward:
        m = len(labels_one_hot)
        Z1, A1, Z2, A2 = predict(data)
        
        # I think the error is in this chunk:
        # backwards pass: 
        dZ2 = A2 - labels_one_hot
        dW2 = 1 / m * dZ2.T.dot(A1)
        db2 = 1 / m * np.sum(dZ2, axis=1)
        dZ1 = W2.dot(dZ2.T).T * relu(Z1, dir=True)
        dW1 = 1 / m * dZ1.T.dot(data)
        db1 = 1 / m * np.sum(dZ1)

        # Update parameters:
        update(learning_rate, dW1, db1, dW2, db2)

        print("Iteration: ", i + 1)
        predictions = predict(data)[-1] # item at -1 is the final prediction.
        print(get_accuracy(predictions, labels))

def update(learning_rate, dW1, db1, dW2, db2):
    global W1, b1, W2, b2
    W1 = W1 - learning_rate * dW1.T # I have to transpose it here.
    b1 = b1 - learning_rate * db1
    W2 = W2 - learning_rate * dW2
    b2 = b2 - learning_rate * db2

train(x_train, y_train)

predictions = predict(x_test)[-1]
print(get_accuracy(predictions, y_test)) # The result is about 11.5% accuracy.

Dmitry543 · Accepted Answer · 2025-07-02 08:03:40Z

5

dW*/db* just have the wrong axes.
Because of that the two bias gradients end up with the wrong shape and
your updates trash the weights every step, so the net hovers at chance
(≈ 10 %).

m = x.shape[0]                 # samples in a batch

# ---------- forward ----------
Z1 = x @ W1 + b1               # (m,784)·(784,10) = (m,10)
A1 = np.maximum(Z1, 0)
Z2 = A1 @ W2 + b2              # (m,10)
A2 = softmax(Z2)               # (m,10)

# ---------- backward ----------
dZ2 = A2 - y_hot               # (m,10)

dW2 = A1.T @ dZ2 / m           # (10,10)
db2 = dZ2.sum(0) / m           # (10,)

dZ1 = (dZ2 @ W2.T) * (Z1 > 0)  # (m,10)

dW1 = x.T @ dZ1 / m            # (784,10)
db1 = dZ1.sum(0) / m           # (10,)

# ---------- SGD step ----------
W2 -= lr * dW2;  b2 -= lr * db2
W1 -= lr * dW1;  b1 -= lr * db1

(Notice the .T is always on the left matrix in each product, so no
extra transposes are needed in the update.)

A numerically-safe soft-max helps too:

def softmax(z):
    z = z - z.max(1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(1, keepdims=True)

With these fixes (plus e.g. He initialisation and a smaller learning
rate like 0.01) the same two-layer net reaches ~93 % on MNIST in 15–20
epochs.

answered Jul 2 at 8:03

Dmitry543

1,2822 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

buzzbuzz20xx Jul 2 at 15:06

Thank you so much, It works perfectly now, After playing around with the learning rate and epochs I managed to get an accuracy of 60% on both training and testing data when the learning rate is set to 0.5 over 300 epochs, I will research Kaiming Initialization and implement it into my NN in order to increase accuracy.

buzzbuzz20xx Jul 2 at 16:24

Update: I implemented 'He' weight initialization and I managed to get 91% accuracy with a learning rate of 0.25 over 200 epochs !

Collectives™ on Stack Overflow

Neural Network built from scratch using numpy isn't learning

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related