Building an Intuition for Deep Learning, Historically
Deep neural networks are loosely based on real brains and capable of learning nearly
anything, making them effective in driving cars, identifying images, and even holding humanlike conversations.
It's fascinating to me how an architecture could appear to exhibit such high levels of intelligence over so many different
applications, which left me curious as to why it works.
This site aims to build an intuition behind neural networks by following their progression through time,
beginning with the first attempt to model the brain mathematically nearly 100 years ago.
I started this project in the hope that it's satisfying to answer the "How?" and "Why?" of neural nets in a cumulative nature, following
several brilliant researchers' contributions to machine learning as they progress toward a modern feedforward neural
network.
Table of Contents
- Introduction
- Warren McCulloch and Walter Pitts: The McCulloch-Pitts Neuron - 1943
- Frank Rosenblatt: The Perceptron - 1957
- Bernard Widrow and Ted Hoff: ADALINE - 1960
- Bernard Widrow and Ted Hoff: MADALINE - 1961
- Rumelhart, Hinton, and Williams: The Neural Network - 1986
1943 - The McCulloch-Pitts Neuron
In 1943, Warren McCulloch and Walter Pitts established a new field of research by developing the first mathematical
representation of neurons in the brain, published in their paper "A Logical Calculus of the Ideas Immanent in Nervous Activity".
The unlikely duo of college professor and homeless teenager understood thought by organizing neurons and synapses as logical expressions, an approach
that was unheard of at the time. The fact that their paper cited just three sources, all foundational texts in the field of logic, speaks to the
novelty of their proposal.
When McCulloch and Pitts published, it had been known for decades that the brain consisted of neurons that were connected by synapses, and that
information collected by these synapses caused the neuron to turn on or off, sending an "on" signal out to other neurons accordingly. What
wasn't known, and still isn't, is exactly what each neuron does with its input signals to turn on or off. McCulloch and Pitts theorized that the
decisions could be represented by logical expressions. Then, as an example of its ability to model something as abstract as a
thought, they demonstrated that perception of and heat and cold could be modeled by a series of connected logical "neurons". The paper visualized this
logic using connected neurons, and their visualization is shown below.

In this example,

The pair sought to understand thought more than to recreate it, but other academics soon realized that machines would be capable of implementing their model, simulating intelligence. Here, learning-focused interpretations of the McCulloch-Pitts model will be followed, but it's worth noting the model's contribution to standard computing as part of its legacy. John Von Neumann proposed the Von Neumann architecture, ubiquitous in modern computuers, in "First Draft of a Report on the EDVAC" in 1945. The draft was never completed, but Von Neumann did manage to cite one paper: McCulloch and Pitts'. His implementation of computer memory was inspired by a circular group of neurons passing the same piece of information around indefinitely.
McCulloch and Pitts avoided the issue of learning in their model because it would greatly complicate how the logic would flow through a network while its logical components were being modified. The McCulloch-Pitts neuron's successors would define and improve upon learning algorithms, but these algorithms were designed more heavily through intuition than neuroscience. In fact, nearly all of the shared attributes between a modern neural network and a brain can also be found in the McCulloch-Pitts neuron. This means that the McCulloch-Pitts neuron model is the source of the Artificial Neural Network's name.
1957 - Rosenblatt's Perceptron
In 1957, Frank Rosenblatt, a psychology professor at Cornell, implemented the first machine learning algorithm for a linear classifier, known
as the perceptron. Very likely inspired by McCulloch and Pitts' mathematical model of neurons, Rosenblatt's
perceptron is capable of learning to determine a decision boundary for any linearly separable dataset.
However, many researchers in Rosenblatt's time were concerned
by the perceptron's limitations and didn't see it as the future of machine learning. Professor Rosenblatt
passionately, and correctly, defended its potential. Sadly, he passed away more than a decade before the first
modern neural network was implemented. Although this algorithm is quite simple and limited by modern standards, all subsequent advancements in neural networks were built on this breakthrough, and the fundamentals from the perceptron are still present in every modern neural network.
The essence of a perceptron is that it calculates the weighted sum of its inputs (so that some inputs
can be assigned greater importance than others) and compares that sum to a threshold: If the weighted
sum is above the threshold, then the perceptron outputs "True", or a 1. If the weighted sum is less than
the threshold, then the perceptron outputs a "False", or a 0. While a historic breakthrough in machine
learning, it's limited in that that it's only effective for linearly separable data and binary outputs.
We'll look at it in closer detail now. Rosenblatt implemented this algorithm in hardware, but we'll use
software and visualize it geometrically.
Let's first consider a perceptron with just one input, making it one-dimensional. This model accepts one input value (x1) and has two parameters, a weight (w1) and a threshold (θ) that the product of x1 and w1 must cross to be classified as positive. Since our input is one dimensional, our x vector only contains x1 and our weight vector only contains w1.

From this visualization, we can see that our decision boundary occurs at
Red dots indicate
Green dots indicate
Note that adjusting θ moves our decision boundary horizontally.
Our threshold is more commonly represented with bias "b" where b = -θ. Making this substitution yields a decision boundary equation that's easier to use in practice:
It's common practice to refer to the expression on the left as

Now, let's give this perceptron multiple input dimensions. When we add another input (making it two
dimensional), nothing else in the structure of the perceptron changes. This second input is called
This notation obviously becomes inconvenient if there are many inputs, which is one reason why input and weight values are commonly held in an input vector (x) and a weight vector (w):

As mentioned earlier, this dot product can be thought of as a weighted sum of the inputs, so that the "importance" of each input (its weight) is factored into the calculation of z. Visually, this can be represented as:

The step function classfies the inputs as a 1 if z > 0, or in other words, if z lies above the decision boundary.
In the case that z < 0 (z lies below the decision boundary), the step function classifies the input as a 0.
In practice, it's quite rare for z equal exactly 0 because this would mean that z is exactly on the decision boundary. This means it's not significant how we classify it, but it's still included for the sake of completeness. In the following definition I'll group it with the "0" classification. The output of the perceptron (ŷ) is equal to the output of the step function F:


Inputs:
Model Parameters:
It can be insightful to represent this model geometrically. The decision boundary is the plane separating the red and green points. Red points (inputs) are instances where I do eat a meal (y=1), and green points are inputs where I don't eat a meal(y=0). The perceptron classifies any new input point beyond the decision boundary as 1 (ŷ = 1), and it classifies any input inside of the decision boundary as a 0 (ŷ = 0).
Link to this graph

In this example, the weight vector is represented geometrically by the blue vector. The weight vector of any perceptron is always perpendicular to the decision boundary (which is possible because the boundary is always linear). Since each element of the weight vector can take on any value, its geometric representation can point in any direction. It follows that weights determine the angle and direction of the decision boundary.
One way to think about the weight vector is as a ratio of dimensions, similarly to how the slope of a line is a ratio of dimensions (rise/run). The bias moves the decision boundary along that weight vector. Note that although the bias is included with weights and inputs in the calculation of z, it does not contribute a dimension. Rather, it moves the decision boundary according to the ratio of dimensions provided by w. Three dimensions/inputs are the highest that are accessible for us to visualize, but these principles hold true for hyperplane decision boundaries in any number of dimensions. Feel free to experiment with different weights and biases yourself.
Training
Perceptron CodeWhat made the perceptron groundbreaking was its ability to learn. Given enough labeled examples to train on, the perceptron decided the optimal values of the weights and bias completely on its own (This assumes that the data is linearly separable, which we will go into detail on later). The perceptron modeled above learned to place the decision plane with just the points on the graph. We'll follow the perceptron through this learning process.
The learning rate is a value (denoted by α), usually between 0 and 1, that is used to control how fast
the perceptron
learns. For example, if it's 0, the perceptron won't learn at all, and if it's one, the perceptron
learns rather quickly. We'll see later what happens if the perceptron learns too quickly,
but for now, we'll use an appropriate setting to see how it works.
The perceptron updates its own weights and biases according to this learning rule:
Let's apply the learning rule to my meal prediction perceptron, using the points shown on the graph as training data. Since this took 580 iterations to train, we will only look at a handful of examples. From now on, I will demonstrate calculations using vector notation, but explicit notation is included below for reference.
Explicit notation:
Vector notation:
Before a perceptron begins training, we initialize the weight vector and bias of the perceptron to 0 (Sometimes they are initialized to small random values, but either method works). This means that z will always be zero, so F(z) will always be zero. For any input vector, this untrained perceptron will always output "0".
Input vectors:
Training Point 1:
The first input the perceptron will train on is the point x = (x1, x2, x3) = (3, 1, 1). Its corresponding y value is 0, so this point is labeled with red on the graph. We already know that this perceptron will always output zero, but we'll follow its calculations for clarity:Model's Prediction:
Perceptron's output: 0
Parameter update:
error = (0 - 0) = 0
w1 = 0 + (0.1)(0)(3) = 0
w2 = 0 + (0.1)(0)(1) = 0
w3 = 0 + (0.1)(0)(1) = 0
b = 0 + (0.1)(0) = 0
This result demonstrates that the perceptron does not update itself in every training example. The perceptron only looks at one point at a time, and it classified this point correctly (error = 0), so it doesn't update parameters. In the second training example, the perceptron will misclassify a point and we'll see it update its parameters.
Training Point 2:
The second point the perceptron will train on is x = (7, 3, 3). y = 1 for this point, so it's labeled with green on the graph. Its calculations are as follows:Model's Prediction:
Perceptron's output: 0
Parameter update:
error = (1 - 0) = 1
w1 = 0 + (0.1)(1)(7) = 0.7
w2 = 0 + (0.1)(1)(3) = 0.3
w3 = 0 + (0.1)(1)(3) = 0.3
b = 0 + (0.1)(1) = 0.1
This updated perceptron is represented geometrically below. Notice that the weights already orient the decision boundary in vaguely the right direction, and the bias actually updated in the wrong direction in this instance (It hasn't seen a point yet where it needs to move the decision boundary further into the first quadrant, this happens when it predicts a "1" for a "0"). The weights and biases will both improve as the model sees more training examples.

Training Point 3:
The third point the perceptron will train on isModel's Prediction:
Perceptron's output: 1
Parameter update:
error = (0 - 1) = -1
This updated perceptron is again represented geometrically below. Notice that it begins to separate the red and green points. Our weights are slightly more accurate, and the bias has moved the decision boundary back towards the first quadrant as desired.


If the training data for a perceptron is not linearly separable, it will never converge because it will keep moving the decision boundary indefinitely. This limitation in machine learning is addressed by the perceptron's successor only a handful of years later.
Another way to think about the perceptron
We can get a deeper geometric understanding of the perceptron by treating each x value as a vector rather than a point. Each x vector starts at the origin and ends where their corresponding point was.Again, we'll use the dot product of vectors to determine z. In this format, we update the weights when the angle between a "1" x vector and the weight vector is more than 90 degrees (this means that their dot product is negative), or if the angle between a "0" x vector and the weight vector is less than 90 degrees (this means that their dot product is positive). I need to find a better way to explain this.
1960 - ADALINE - Bernard Widrow and Ted Hoff
ADALINE (Adaptive Linear Element) was an early improvement on the Perceptron. Invented at Stanford University
by Bernard Widrow and his student Ted Hoff, ADALINE updated the Perceptorn's learning rule so that
it was now capable of finding a best decision boundary for data that was not linearly separable. Still, the decision
boundary itself was constrained to be linear, and while it could find the best approximation of the data, it couldn't perfectly fit nonlinear data.
This constraint was removed only a year later by Widrow and Hoff with MADALINE, discussed later.
ADALINE utilizes the same structure as the perceptron, but with a different learning algorithm. The key difference
between the models is that the perceptron implements its learning algorithm using the result of the step
function, while ADALINE learns from the z value before it's passed through the step function.


Below is Widrow and Hoff's original diagram of ADALINE. As with the perceptron, ADALINE was originally implemented in hardware. For reference, a's are weights, and a0 and the "1" make up the bias (adding a "1" to the start of the input vector and a weight "w0" at the start of the weight vector produces the same effect as adding a bias with value w0 to the dot product of inputs and weights). Anytime the bias isn't mentioned in a calculation of z, it's safe to assume that the weight vector includes it already as described.

The gradient is a vector made up of derivatives, each of which can be thought of as slope at a point with respect to one input dimension. Therefore, the vector containing all of these points in the combined direction of all the slopes. This property guarantees that it points in the direction of fastest increase at any point. It follows that we can take the negative of the gradient to discern the direction of fastest decrease. Additionally, the length of the gradient vector corresponds to the steepness of the change at that point.
This is useful for our loss function, we can calculate its gradient to find the direction of fastest decrease of loss and update the weights in that direction, yielding a lower loss. Repeating this process will eventually cause the model to converge to the best solution. This fundamental machine learning process is known as gradient descent and it's still used to train modern neural networks. An example of gradient descent applied to a simple function follows.
For the sake of an introductory example, imagine a simple ADALINE with one weight and a bias value. Suppose the loss of this fictitious model is given by
Function:
Gradient:
In this example,

∇L = [
=
=
=
Our gradient vector is graphed below on our loss function. It's pointing in the right direction, but it's clear that it's far too long. This vector is overshooting the minimum, a problem solved in machine learning by using a learning rate. For the rest of this example, we'll scale our gradient vectors by a learning rate of 0.1 to avoid overshooting.

w = 2 - 4 = -2
b = 1 - 2 = -1
L(-2, -1) = (-2)2 + (-1)2 = 5
The unscaled vector yields no improvement in the loss. Let's recalculate using a learning rate of 0.1:
w = 2 + (0.1)(-4) = 1.6
b = 1 + (0.1)(-2) = 0.8
L(1.6, 0.8) = 3.2
Utilizing the scaled vector improves the loss significantly.

Taking the negative and scaling with the learning rate yields the vector that we'll adjust the parameters with:
[ -0.32, -0.16 ]
Calculating new w and b values:
Again, loss decreases as desired. In practice, this process is repeated many, many times. A few more steps are visualized below.

The ball also illustrates a weakness of gradient descent: it can get "stuck" in a region above the function's minimum if it has no downhill path to it. Consider this two-dimensional example where the green ball iterates lower until it gets "stuck":

ADALINE's Loss Function
Although Widrow and Hoff proposed this formula as the sum of mean squared errors over each point (as in gradient descent), they were forced to implement it using the squared error of one point at a time due to the limitations of hardware at their time. This method of calculating the gradient with respect to the error of one point at a time is called stochastic gradient descent. They proved that stochastic gradient descent (SGD) was a strong approximation of gradient descent (GD) and verified their results in practice. We'll look at the sum of mean squared errors because subsequent models will be using stochastic gradient descent, but SGD actually does have advantages that emerge later in multilayer neural networks.
At each point i, evaluating the gradient of this function involves taking the its derivative with respect to each weight wj and then its derivative with respect to the bias, formatting all of these into a vector of m weights:
Before calculating the gradient, it's important to recall the chain rule, which dictates
For each weight wj in the gradient vector described above, the gradient of this loss function is calculated as:
Training
Code for ADALINE (Gradient descent)Let's train an ADALINE model on the same dataset as the perceptron:
If we train an ADALINE model with the dataset as is, it actually doesn't converge to a solution that separates the datapoints correctly. This happens because the model's loss function isn't working correctly. In just a handful of iterations, the model's MSE is on the order of thousands, and MSE continues to increase with more epochs.

The exploding weights issue can be remedied by lowering the learning rate, preventing overshooting. After retraining the same model with a learning rate of 0.01, the model converges with the decision boundary visualized below.


The solution to this issue is to calculate the error equally for a misclassified 0 or 1. An easy way to do this is to change every "0" value in the y vector to "-1". The corrected error calculations are shown below. Note that any

With a learning rate of 0.015, ADALINE easily outperforms the perceptron in terms of steps to a correct solution. This model learns a decision boundary that classifies every point in the dataset correctly in 227 steps, a 60% improvement over the perceptron. Unfortunately the loss function has too many dimensions to be visualized, but it's still easy to watch the decision boundary update as demonstrated in the next example.
This ADALINE model begins with a mean squared error (MSE) of 1. This is due to the fact that for every input point,
After the average gradient has been calculated, negated, and scaled according to the learning rate (as shown above), the parameters can be reassigned:
Link to this graph, consider following along by changing weights in decision boundary equation

Before the second parameter update, the model recalculates the average gradient, then negates and scales it. This updated gradient is added to the current paramters in the parameter update:


Normalizing the input data allows us to use the original learning rate of 0.1, so that model perfectly separates the input data after only 29 iterations. We can use a higher learning rate here because the x values, being between 0 and 1 (rather than 0 and 10) produce much smaller z values, leading to smaller weight updates. Batch normalization is so effective in this example that it can handle a learning rate of up to 0.6, in which case it classifies the input data perfectly after only 3 updates.
The training of modern neural networks are often assessed using "elbow" graphs, such as the one shown below (this one was generated from the 0.1 learning rate ADALINE model). The idea is to train model until the loss stops decreasing significantly.

Looking at the 0.1 learning rate ADALINE model, which classified perfectly after 29 updates:
It has 4 inputs. Multiplied by 5, this yields 20 training updates. This would be Widrow's calculation without adjustments for the different model used here.
Its learning, however, should be 10 times as slow because of its learning rate of 0.1. This yields 200 training updates. Recall that Professor Widrow designed this approximator for stochastic gradient descent, which in this case is less accurate than gradient descent. Since gradient descent allows ADALINE to look at all 20 input points before updating parameters, the most benefit it could offer is a factor of 20. Realistically, gradient descent isn't 20 times more efficient, so it would be within a range of 1-20 times more efficient. We obtain an approximation of between 10-200 training examples to achieve an acceptable error, which 29 easily fits into.
The loss chart of the same ADALINE model, this time implemented with stochastic gradient descent rather than gradient descent, is shown below:

It's clear that stochastic gradient descent (SGD) is more noisy than gradient descent (GD). It's also less effective per update than gradient descent: after 20 iterations, SGD has a MSE of 0.65, while GD has a MSE of 0.54. Worse still, the lowest MSE that SGD reaches is about 0.20, while GD can reach 0.185.
However, SGD comes with advantages, particularly its far lower memory demands next to GD, making it the training method of choice from the 1960s to today. First, it's faster to compute, since it only looks at one training point at a time. Second, it requires holding far less data in memory. In fact, many datasets are so large that it's not feasible to look at every training example at once. Third, SGD can mitigate the issue of getting stuck in local minima because its updates are noisy. The loss function for each individual training point is different, so with each update, the model sees a slightly different loss function. It has been shown that every input point's loss function is roughly similar to the loss function derived from all the points, except that its smaller features will be different. SGD avoids getting stuck in the local minima because the local minima move every time it updates.
Many modern neural networks utilize a method called minibatch gradient descent, which aims to combine the accuracy of GD with the efficiency and flexibility of SGD. Minibatch gradient descent involves taking groups of training data and updating the parameters while looking at the gradient with respect to just that group of points.
1961 - MADALINE
MADALINE (Multiple ADALINE) was developed in 1961 by the same pair that invented ADALINE - Bernard Widrow and
Ted Hoff. As suggested in its name, it incorporates multiple ADALINE structures within its architecture. What makes MADALINE special
is its distinction as the first machine learning model based on linear classifiers
to set nonlinear decision boundaries by utilizing a second "layer", described later. MADALINE's invention also marked a substantial
step forward in the intuition behind machine learning, which still holds today: combining multiple, even many,
units leads to valuable emergent behavior.
Like ADALINE, MADALINE's architecture accommodates any number of inputs. It consists of two layers: a layer of at least two
ADALINEs arranged in parallel, which all connect to a second layer containing at least one classification unit,
which could be implemented as an ADALINE, a perceptron, or even logical operators such as OR, as demonstrated by Hoff.
The number of units in the second layer corresponds to the
number of output classes. Every first-layer ADALINE is "connected" to
every input (where a connection is inclusion in the calculation
of
There are countless variations on MADALINE, including the number of ADALINEs, the training method, and the number and
type of second-layer/output units. For the sake of simplicity, a MADALINE comprised of two ADALINEs in its first layer,
and an ADALINE classifier in the second layer will be demonstrated. The weights and bias of the second layer will not
be trained. Widrow did begin to explore methods of training the second layer weights, but none were effective enough to be widely used.
The described MADALINE is visualized below:

• The three layers (0 : Input , 1 : Multiple ADALINES, 2 : The second-layer ADALINE) are denoted with superscript.
Labeling layers 0 & 1 would cause unnecessary clutter here and is omitted.
• The first (left) subscript in each weight corresponds to the input value it multiplies, from 1-n.
• The second (right) subscript corresponds to the z ADALINE that it feeds into, from 1-m.
• Each z is an unactivated ADALINE, which is passed through activation function F() to activate each ADALINE.
The XOR (Exclusive OR) problem is the classic demonstration of MADALINE because it showcases the nonlinear classification ability of MADALINE, while keeping the calculations relatively simple (and able to be visualized) by using only two input dimensions, two first-layer ADALINES, and one second-layer ADALINE.


For reference, the weights and biases of this network are as follows:
• Layer 1, First ADALINE:
-
-
-
• Layer 1, Second ADALINE:
-
-
-
• Layer 2, ADALINE:
-
-
-
This MADALINE model was trained using this code. Weights are randomly initialized in this program, so they won't match the ones in this demonstration. Recall that ADALINE's output is calculated according to
Let's give this MADALINE the input
The ADALINES in the first layer are calculated as:
The ADALINE in the second layer is calculated using the outputs of the ADALINES in the first layer as input:
This MADALINE is visualized geometrically (red space represents -1 classification, green space represents 1 classification) below:

Each ADALINE in the first layer provides one of the sections of green space. All input points above the line
Since the activation function F returns 1 whenever its inputs are greater than zero, only one positive input is necessary for the second layer's ADALINE to return a 1 . Thus, the model can output a 1 in two different areas.
The remaining area, where both layer-1 ADALINES return -1, is in turn classified as -1 by the second-layer ADALINE.
While we've already seen how each ADALINE would learn, it's important to understand when each ADALINE learns. There's no "correct" output for each ADALINE, so what do we compare its output to in the loss function? Widrow answered this with a "job assigner", pictured below this paragraph. If MADALINE's output didn't match its target output, then as many ADALINES as necessary would be updated such that the lowest satisfactory number of ADALINES matched the output of the target class (For 5 first-layer ADALINEs, 3 would need to match). The ADALINEs to be updated were selected in order of how close their z value was to the desired value, with the closest ADALINEs updated first.


MADALINE's XOR decision boundary demonstrates the key component, nonlinearity. This is of course helpful for the XOR problem, but the significance here is that there could be more inputs and more ADALINES, leading to enormously complex, high-dimensional, nonlinear decision boundaries, capable of representing complex relationships between data.
The visualization of this MADALINE allows us to get to the bottom of exactly why MADALINE (and neural networks) work. MADALINE's capability stems from what you would see from the perspective of the second-layer ADALINE: the outputs of each of the two first-layer ADALINES. To emphasize its importance, I'm leaving out one thing for now, but below is very close to the perspective of the layer 2 ADALINE. Layer 1's first and second ADALINEs' Z values are represented by the x and y axes, respectively. The second-layer ADALINE's perspective below was generated by the first-layer ADALINEs as follows:
As a reminder, ADALINE computes its output using
For the input
For input
Similarly,

After activation, the second-layer ADALINE sees this:


Modern neural networks are undoubtedly far more powerful than MADALINE, but they draw on those very same principles to make their predictions.
1986 - The Modern Neural Network (MLP)
Interest in neural networks died down in the decades following MADALINE's invention due to what researchers at the time coined "the credit assignment problem". Widrow's job assigner, as clumsy as it was in the simple case of two layers, could not be effectively extended to larger, more complex neural networks. It became difficult or impossible, and certainly not efficient, to ascertain which weights should be updated to minimize the error because of the number of connected units.