Notes/Neural Networks: From Perceptrons to Backpropagation
Back to Notes

Neural Networks: From Perceptrons to Backpropagation

Understanding the building blocks of deep learning: perceptrons, activation functions, backpropagation, and gradient descent.

2025-01-15AI-Synthesized from Personal NotesSource2000+ words of raw notesEnrichmentsCode blocks, LaTeX formulas, GraphicsPipelineMulti-pass AI review · Score: 97/100
Share
AI / MLNeural NetworksDeep Learning

Terminology

Term Definition
Perceptron The simplest neural network unit; a linear classifier that outputs 1 or 0 based on a weighted sum of inputs.
Neuron/Unit A computational node that applies a weighted sum followed by an activation function.
Weight A learnable parameter that scales an input; the network learns these during training.
Bias A learnable offset added to the weighted sum; allows the network to shift decision boundaries.
Activation Function A non-linear function applied after the weighted sum; enables the network to learn non-linear patterns.
Forward Pass Computing the output of the network given an input by propagating data through layers.
Backpropagation The algorithm for computing gradients of the loss with respect to all weights and biases using the chain rule.
Gradient The partial derivative of the loss with respect to a parameter; points in the direction of steepest increase.
Gradient Descent An optimization algorithm that updates weights by moving in the negative gradient direction to minimize loss.
Learning Rate A hyperparameter controlling the step size in gradient descent; balances convergence speed and stability.
Loss Function A function measuring the difference between predicted and actual outputs; the network minimizes this during training.

What & Why

Neural networks are computational models inspired by biological neurons that form the foundation of modern deep learning. They enable machines to learn complex patterns from data by mimicking how the brain processes information through interconnected neurons.

A neural network is a directed acyclic graph of neurons connected by weighted edges. Each neuron computes a weighted sum of its inputs, applies a non-linear activation function, and passes the result to the next layer. This simple computation, when repeated across many layers and neurons, creates a powerful universal function approximator.

The key insight is that stacking multiple layers allows the network to learn hierarchical representations. Early layers learn simple features like edges and textures, middle layers combine them into more complex patterns like shapes and objects, and later layers make high-level decisions for classification or regression.

Neural networks are powerful because they can approximate any continuous function given enough neurons and layers (universal approximation theorem). This universality, combined with efficient training algorithms like backpropagation and gradient descent, makes them the dominant approach in machine learning today. They excel at tasks where traditional algorithms struggle: image recognition, natural language processing, speech synthesis, and game playing.

How It Works

The Perceptron: The Simplest Neural Network

A perceptron is a single neuron with binary output. Given inputs $x_1, x_2, ..., x_n$ and weights $w_1, w_2, ..., w_n$, it computes:

$$z = w_1x_1 + w_2x_2 + ... + w_nx_n + b$$ $$\text{output} = \begin{cases} 1 & \text{if } z > 0 \ 0 & \text{if } z \leq 0 \end{cases}$$

The perceptron can learn any linearly separable pattern. However, it cannot learn XOR (exclusive or), which requires non-linearity.

Perceptron: Linearly Separable vs XOR Problem Linearly Separable (AND) Perceptron can learn this XOR Problem (Not Separable) Perceptron cannot learn this (needs non-linearity)
### Activation Functions: Introducing Non-Linearity

Without activation functions, stacking layers would just be matrix multiplication, which remains linear. Activation functions introduce non-linearity, allowing networks to learn complex patterns.

Common activation functions:

Sigmoid: $\sigma(z) = \frac{1}{1 + e^{-z}}$

  • Output range: (0, 1)
  • Smooth gradient, but suffers from vanishing gradients for large |z|

Tanh: $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$

  • Output range: (-1, 1)
  • Zero-centered, stronger gradients than sigmoid

ReLU: $\text{ReLU}(z) = \max(0, z)$

  • Output range: [0, ∞)
  • Computationally efficient, helps mitigate vanishing gradients
  • Problem: Dead neurons (outputs 0 for all inputs)

Leaky ReLU: $\text{LeakyReLU}(z) = \max(\alpha z, z)$ where $\alpha$ is small (e.g., 0.01)

  • Allows small negative values, preventing dead neurons
Activation Functions Comparison Sigmoid Range: (0, 1) Tanh Range: (-1, 1) ReLU Range: [0, ∞) Leaky ReLU Range: (-∞, ∞) Each activation function trades off smoothness, computational cost, and gradient flow
### Multi-Layer Networks

A multi-layer network stacks neurons into layers:

$$\text{Input Layer} \rightarrow \text{Hidden Layer 1} \rightarrow \text{Hidden Layer 2} \rightarrow \text{Output Layer}$$

Each layer transforms its input through weighted sums and activation functions. The output of one layer becomes the input to the next.

x₁ x₂ x₃ h₁ h₂ y Input Hidden Output

Forward Pass: Computing the Output

Given an input $\mathbf{x}$, the forward pass computes the output layer by layer:

$$\text{Layer 1: } \mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)}, \quad \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)})$$ $$\text{Layer 2: } \mathbf{z}^{(2)} = \mathbf{W}^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)}, \quad \mathbf{a}^{(2)} = \sigma(\mathbf{z}^{(2)})$$ $$\vdots$$ $$\text{Output: } \hat{\mathbf{y}} = \mathbf{a}^{(n)}$$

Where $\mathbf{W}$ are weight matrices, $\mathbf{b}$ are bias vectors, and $\sigma$ is the activation function.

Loss Function: Measuring Error

The loss function quantifies how far the prediction is from the true value. Common choices:

Mean Squared Error (MSE) for regression: $$L = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}_i - y_i)^2$$

Cross-Entropy for classification: $$L = -\frac{1}{m} \sum_{i=1}^{m} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]$$

Where $m$ is the batch size, $\hat{y}$ is the prediction, and $y$ is the true label.

Backpropagation: Computing Gradients

Backpropagation is the algorithm for computing the gradient of the loss with respect to each weight and bias. It uses the chain rule to propagate errors backward through the network.

For a weight $w$ in layer $l$, the gradient is: $$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a_{\text{out}}} \times \frac{\partial a_{\text{out}}}{\partial z_{\text{out}}} \times \cdots \times \frac{\partial z_l}{\partial w}$$

This is computed efficiently by storing intermediate values during the forward pass, then reusing them during the backward pass.

Forward Pass Input Layer 1 Layer 2 Output Loss Backward Pass ∂L/∂w ∂L/∂w ∂L/∂w

Gradient Descent: Updating Weights

Once gradients are computed, gradient descent updates each weight in the direction that reduces the loss:

$$w_{\text{new}} = w_{\text{old}} - \alpha \times \frac{\partial L}{\partial w}$$ $$b_{\text{new}} = b_{\text{old}} - \alpha \times \frac{\partial L}{\partial b}$$

Where $\alpha$ is the learning rate. Smaller learning rates converge slowly but safely; larger rates converge faster but may overshoot the minimum.

Complexity Analysis

Operation Time Complexity Space Complexity
Forward pass (single sample) $O(L \times n^2)$ $O(L \times n)$
Backward pass (single sample) $O(L \times n^2)$ $O(L \times n)$
Gradient descent update $O(L \times n^2)$ $O(1)$
Training (m samples, e epochs) $O(e \times m \times L \times n^2)$ $O(L \times n^2)$

Where $L$ is the number of layers, $n$ is the average number of neurons per layer, $m$ is the batch size, and $e$ is the number of epochs.

Memory usage: A network with $L$ layers and $n$ neurons per layer requires $O(L \times n^2)$ memory to store weights and $O(L \times n)$ memory for activations during the forward pass.

Computational cost: The dominant cost is matrix multiplication between layers. For a layer with input size $n_{in}$ and output size $n_{out}$, each forward pass requires $O(n_{in} \times n_{out})$ operations. Modern GPUs accelerate this via highly optimized BLAS libraries (cuBLAS, cuDNN).

Gradient computation: Backpropagation has the same computational complexity as the forward pass because it reuses the same matrix operations in reverse order. The chain rule ensures that each gradient computation is proportional to the size of the corresponding weight matrix.

Scaling considerations: Training time scales linearly with the number of training samples and epochs, but quadratically with network width. Deep networks (large $L$) are often more parameter-efficient than wide networks (large $n$) for the same representational capacity.

Implementation

ALGORITHM ForwardPass(x, weights, biases, activations)
INPUT: x (input vector), weights (list of weight matrices),
       biases (list of bias vectors), activations (list of activation functions)
OUTPUT: output (final layer output)

BEGIN
  a ← x
  FOR EACH layer i FROM 1 TO num_layers DO
    z ← weights[i] · a + biases[i]
    a ← activations[i](z)
  END FOR
  RETURN a
END

ALGORITHM BackpropagationPass(x, y, weights, biases, activations, loss_function)
INPUT: x (input), y (true label), weights, biases, activations, loss_function
OUTPUT: gradients (∂L/∂w and ∂L/∂b for each layer)

BEGIN
  // Forward pass: compute all activations
  activations_cache ← [x]
  a ← x
  FOR EACH layer i FROM 1 TO num_layers DO
    z ← weights[i] · a + biases[i]
    a ← activations[i](z)
    activations_cache.append(a)
  END FOR

  // Compute output layer gradient
  output ← a
  dL_doutput ← ∂loss_function(output, y) / ∂output

  // Backward pass: propagate gradients
  gradients ← empty list
  delta ← dL_doutput

  FOR EACH layer i FROM num_layers DOWN TO 1 DO
    // Gradient w.r.t. weights and biases
    dL_dw ← delta · activations_cache[i-1]ᵀ
    dL_db ← delta
    gradients.prepend((dL_dw, dL_db))

    // Propagate to previous layer (if not input layer)
    IF i > 1 THEN
      delta ← (weights[i]ᵀ · delta) ⊙ activation_derivative[i-1](z[i-1])
    END IF
  END FOR

  RETURN gradients
END

ALGORITHM GradientDescentUpdate(weights, biases, gradients, learning_rate)
INPUT: weights, biases, gradients, learning_rate
OUTPUT: updated weights and biases

BEGIN
  FOR EACH layer i FROM 1 TO num_layers DO
    weights[i] ← weights[i] - learning_rate × gradients[i].dL_dw
    biases[i] ← biases[i] - learning_rate × gradients[i].dL_db
  END FOR
  RETURN (weights, biases)
END

ALGORITHM TrainNeuralNetwork(training_data, num_epochs, learning_rate)
INPUT: training_data (list of (x, y) pairs), num_epochs, learning_rate
OUTPUT: trained weights and biases

BEGIN
  weights ← initialize_randomly_xavier()
  biases ← initialize_to_zero()

  FOR EACH epoch FROM 1 TO num_epochs DO
    total_loss ← 0
    FOR EACH (x, y) IN training_data DO
      // Forward pass
      output ← ForwardPass(x, weights, biases, activations)
      loss ← loss_function(output, y)
      total_loss ← total_loss + loss

      // Backward pass
      gradients ← BackpropagationPass(x, y, weights, biases, activations, loss_function)

      // Update weights
      (weights, biases) ← GradientDescentUpdate(weights, biases, gradients, learning_rate)
    END FOR

    PRINT "Epoch", epoch, "Loss:", total_loss / LENGTH(training_data)
  END FOR

  RETURN (weights, biases)
END

Real-World Applications

Image Classification: Convolutional neural networks (CNNs) classify images by learning hierarchical features. ResNet, VGG, and EfficientNet power computer vision applications from medical imaging to autonomous vehicles. They achieve superhuman accuracy on tasks like skin cancer detection and diabetic retinopathy screening.

Natural Language Processing: Recurrent neural networks (RNNs) and Transformers process sequences of text. They power machine translation (Google Translate), sentiment analysis (social media monitoring), and large language models like GPT and Claude that can write, reason, and code.

Speech Recognition: Neural networks convert audio spectrograms into text. Systems like Whisper use deep networks to achieve near-human accuracy across languages, enabling voice assistants, transcription services, and accessibility tools for the hearing impaired.

Recommendation Systems: Neural networks learn user preferences from interaction history. Netflix uses them to recommend movies, YouTube suggests videos, and Spotify creates personalized playlists. These systems process billions of user interactions to predict what content users will enjoy.

Game Playing: Deep Q-Networks (DQN) and AlphaGo use neural networks to learn game strategies, achieving superhuman performance in chess, Go, and video games. AlphaFold uses neural networks to predict protein structures, revolutionizing biology and drug discovery.

Time Series Forecasting: LSTMs and Transformers predict stock prices, weather patterns, and system load by learning temporal dependencies. Financial institutions use them for algorithmic trading, while energy companies optimize power grid management.

Autonomous Vehicles: Neural networks process camera, lidar, and radar data to detect objects, predict trajectories, and make driving decisions. Tesla's Full Self-Driving and Waymo's autonomous taxis rely on deep neural networks trained on millions of miles of driving data.

Medical Diagnosis: Neural networks analyze medical images, predict disease progression, and assist in drug discovery. They can detect tumors in CT scans, predict heart attacks from ECGs, and identify potential drug compounds faster than traditional methods.

Key Takeaways

  • Neural networks are universal function approximators: given enough neurons and layers, they can learn any continuous function, making them incredibly versatile for pattern recognition tasks.

  • Activation functions introduce non-linearity, enabling networks to learn complex patterns that linear models cannot. ReLU has become the standard due to its computational efficiency and ability to mitigate vanishing gradients.

  • Backpropagation efficiently computes gradients by reusing intermediate values from the forward pass and applying the chain rule in reverse order through the network.

  • Gradient descent iteratively updates weights to minimize the loss function. The learning rate is critical: too large causes divergence, too small causes slow convergence.

  • Deep networks can suffer from vanishing gradients in early layers, which modern activation functions (ReLU, Leaky ReLU) and initialization schemes (Xavier, He) help mitigate.

  • Training neural networks requires careful consideration of architecture design, hyperparameter tuning, and regularization techniques to achieve good generalization and avoid overfitting.

  • The computational complexity scales quadratically with network width but linearly with depth, making deep networks often more efficient than wide networks for complex tasks.