Building an Intuition for Deep Learning, Historically

Deep neural networks are loosely based on real brains and capable of learning nearly anything, making them effective in driving cars, identifying images, and even holding humanlike conversations.
It's fascinating to me how an architecture could appear to exhibit such high levels of intelligence over so many different applications, which left me curious as to why it works. This site aims to build an intuition behind neural networks by following their progression through time, beginning with the first attempt to model the brain mathematically nearly 100 years ago. I started this project in the hope that it's satisfying to answer the "How?" and "Why?" of neural nets in a cumulative nature, following several brilliant researchers' contributions to machine learning as they progress toward a modern feedforward neural network.

Introduction
Warren McCulloch and Walter Pitts: The McCulloch-Pitts Neuron - 1943
Frank Rosenblatt: The Perceptron - 1957
Bernard Widrow and Ted Hoff: ADALINE - 1960
Bernard Widrow and Ted Hoff: MADALINE - 1961
Rumelhart, Hinton, and Williams: The Neural Network - 1986

1943 - The McCulloch-Pitts Neuron

In 1943, Warren McCulloch and Walter Pitts established a new field of research by developing the first mathematical representation of neurons in the brain, published in their paper "A Logical Calculus of the Ideas Immanent in Nervous Activity". The unlikely duo of college professor and homeless teenager understood thought by organizing neurons and synapses as logical expressions, an approach that was unheard of at the time. The fact that their paper cited just three sources, all foundational texts in the field of logic, speaks to the novelty of their proposal.

When McCulloch and Pitts published, it had been known for decades that the brain consisted of neurons that were connected by synapses, and that information collected by these synapses caused the neuron to turn on or off, sending an "on" signal out to other neurons accordingly. What wasn't known, and still isn't, is exactly what each neuron does with its input signals to turn on or off. McCulloch and Pitts theorized that the decisions could be represented by logical expressions. Then, as an example of its ability to model something as abstract as a thought, they demonstrated that perception of and heat and cold could be modeled by a series of connected logical "neurons". The paper visualized this logic using connected neurons, and their visualization is shown below.

Each triangle represents an instance of

N_{i}

which is neuron

N

, identified by number

i

, at a discrete moment in time

t

. The lines, representing synapses, connect input and output signals between the neurons. Each neuron requires at least two excitatory ("on") signals to fire an on signal itself, or else no signal will be fired. Filled-in circles represent standard connections, and the open circles represent the negation of the signal carried by that synapse, analogous to an inhibitory connection.
In this example,

N_{1}

and

N_{2}

are hot and cold receptors, respectively.

N_{3}

and

N_{4}

are neurons that when activated, represent the perception of hot and cold. The network indeed models their example, which they describe as: "If a cold object is held to the skin for a moment and removed, a sensation of heat will be felt; if it is applied for a longer time, the sensation will be only of cold, with no preliminary warmth, however transient." Their translation to formal logic is below:

To convert the now outdated notation used in their logical expressions to modern notation, ignore the dots, unless they appear directly between two instances of

N_{i} (t)

, in which case it means AND, and substitute "~" with NOT.

The pair sought to understand thought more than to recreate it, but other academics soon realized that machines would be capable of implementing their model, simulating intelligence. Here, learning-focused interpretations of the McCulloch-Pitts model will be followed, but it's worth noting the model's contribution to standard computing as part of its legacy. John Von Neumann proposed the Von Neumann architecture, ubiquitous in modern computuers, in "First Draft of a Report on the EDVAC" in 1945. The draft was never completed, but Von Neumann did manage to cite one paper: McCulloch and Pitts'. His implementation of computer memory was inspired by a circular group of neurons passing the same piece of information around indefinitely.

McCulloch and Pitts avoided the issue of learning in their model because it would greatly complicate how the logic would flow through a network while its logical components were being modified. The McCulloch-Pitts neuron's successors would define and improve upon learning algorithms, but these algorithms were designed more heavily through intuition than neuroscience. In fact, nearly all of the shared attributes between a modern neural network and a brain can also be found in the McCulloch-Pitts neuron. This means that the McCulloch-Pitts neuron model is the source of the Artificial Neural Network's name.

1957 - Rosenblatt's Perceptron

In 1957, Frank Rosenblatt, a psychology professor at Cornell, implemented the first machine learning algorithm for a linear classifier, known as the perceptron. Very likely inspired by McCulloch and Pitts' mathematical model of neurons, Rosenblatt's perceptron is capable of learning to determine a decision boundary for any linearly separable dataset. However, many researchers in Rosenblatt's time were concerned by the perceptron's limitations and didn't see it as the future of machine learning. Professor Rosenblatt passionately, and correctly, defended its potential. Sadly, he passed away more than a decade before the first modern neural network was implemented. Although this algorithm is quite simple and limited by modern standards, all subsequent advancements in neural networks were built on this breakthrough, and the fundamentals from the perceptron are still present in every modern neural network.
The essence of a perceptron is that it calculates the weighted sum of its inputs (so that some inputs can be assigned greater importance than others) and compares that sum to a threshold: If the weighted sum is above the threshold, then the perceptron outputs "True", or a 1. If the weighted sum is less than the threshold, then the perceptron outputs a "False", or a 0. While a historic breakthrough in machine learning, it's limited in that that it's only effective for linearly separable data and binary outputs. We'll look at it in closer detail now. Rosenblatt implemented this algorithm in hardware, but we'll use software and visualize it geometrically.

Let's first consider a perceptron with just one input, making it one-dimensional. This model accepts one input value (x₁) and has two parameters, a weight (w₁) and a threshold (θ) that the product of x₁ and w₁ must cross to be classified as positive. Since our input is one dimensional, our x vector only contains x₁ and our weight vector only contains w₁.

From this visualization, we can see that our decision boundary occurs at

w_{1} x_{1} = θ

.

Red dots indicate

x_{1}

values where

w_{1} x_{1} < θ

.

Green dots indicate

x_{1}

values where

w_{1} x_{1} > θ

.

Note that adjusting θ moves our decision boundary horizontally.

An example use case for this basic perceptron is a predictor for whether I sign up for a class based on x, the number of hours after midnight the class is scheduled to start. Let's assume the model learned

w_{1} = 1

and

θ = 9

as the parameters that predict my decision. This means that the model will predict "no" if

w_{1} x_{1} < 9

(ie if the class starts before 9am) and "yes" otherwise.

w_{1}

wasn't particularly helpful to us in this case since there was only one input dimension. As demonstrated in the next example, weights can be thought of as defining the relative importance of each input when there are multiple inputs, so that z is a weighted sum of inputs.

Our threshold is more commonly represented with bias "b" where b = -θ. Making this substitution yields a decision boundary equation that's easier to use in practice:

w_{1} x_{1} = θ

w_{1} x_{1} - θ = 0

w_{1} x_{1} + b = 0

It's common practice to refer to the expression on the left as

z

, represented visually below:

z = w_{1} x_{1} + b

Now, let's give this perceptron multiple input dimensions. When we add another input (making it two dimensional), nothing else in the structure of the perceptron changes. This second input is called $x_{2}$ and it has a corresponding weight $w_{2}$ . In this case, $z = w_{1} x_{1} + w_{2} x_{2} + b$ . This is true for any number of inputs n, so we can write:

z = w_{1} x_{1} + w_{2} x_{2} + . . . + w_{n} x_{n} + b

This notation obviously becomes inconvenient if there are many inputs, which is one reason why input and weight values are commonly held in an input vector (x) and a weight vector (w):

Now, we can substitute the dot product of vectors w and x into our equation for z, so that we have:

z = w \cdot x + b

As mentioned earlier, this dot product can be thought of as a weighted sum of the inputs, so that the "importance" of each input (its weight) is factored into the calculation of z. Visually, this can be represented as:

Now we can formulaically define the Step Function F, the part of the perceptron which decides whether to output a 0 or a 1. The Step Function "activates" z, and although the definition of the function itself will change as we move into more advanced models, the function that "activates" z will continue to be known as the Activation Function.

The step function classfies the inputs as a 1 if z > 0, or in other words, if z lies above the decision boundary.
In the case that z < 0 (z lies below the decision boundary), the step function classifies the input as a 0.

In practice, it's quite rare for z equal exactly 0 because this would mean that z is exactly on the decision boundary. This means it's not significant how we classify it, but it's still included for the sake of completeness. In the following definition I'll group it with the "0" classification. The output of the perceptron (ŷ) is equal to the output of the step function F:

Also, note y = the actual value for a given input. To help solidify this notation, let's take a look at a concrete example of a perceptron making decisions. I'll use this perceptron to predict whether I eat a meal or not. I'll assign an output of 1 to "yes" and an output of 0 to "no".
Inputs:

x_{1}

: how hungry I am, on a scale of 1-10

x_{2}

: how tasty the meal is, on a scale of 1-10

x_{3}

: the price of the meal, on a scale of 1-10
Model Parameters:

w_{1} = 1

(so that the longer it's been since I ate, the more likely I am to eat a meal)

w_{2} = 0.9

(same as reasoning as above, except slightly less because tastiness is slightly less important to me)

w_{3} = - 0.9

(so that the more expensive the meal is, the less likely I am to eat it)

b = 3.3

It can be insightful to represent this model geometrically. The decision boundary is the plane separating the red and green points. Red points (inputs) are instances where I do eat a meal (y=1), and green points are inputs where I don't eat a meal(y=0). The perceptron classifies any new input point beyond the decision boundary as 1 (ŷ = 1), and it classifies any input inside of the decision boundary as a 0 (ŷ = 0).

Link to this graph

Note that high x₁ and x₂ values with low x₃ values correspond to me eating the meal. This lines up nicely with the weight values, since x₁ and x₂ correspond to values which increase the likelihood of me eating as they increase, while x₃ corresponds to a value that decreases the likelihood that I eat as it increases. In fact, if x₃ (price) reaches 10, then no amount of tastiness or hungriness alone will result in a positive prediction, there must be some combination of both to make up for the price.

In this example, the weight vector is represented geometrically by the blue vector. The weight vector of any perceptron is always perpendicular to the decision boundary (which is possible because the boundary is always linear). Since each element of the weight vector can take on any value, its geometric representation can point in any direction. It follows that weights determine the angle and direction of the decision boundary.
One way to think about the weight vector is as a ratio of dimensions, similarly to how the slope of a line is a ratio of dimensions (rise/run). The bias moves the decision boundary along that weight vector. Note that although the bias is included with weights and inputs in the calculation of z, it does not contribute a dimension. Rather, it moves the decision boundary according to the ratio of dimensions provided by w. Three dimensions/inputs are the highest that are accessible for us to visualize, but these principles hold true for hyperplane decision boundaries in any number of dimensions. Feel free to experiment with different weights and biases yourself.

Training

Perceptron Code
What made the perceptron groundbreaking was its ability to learn. Given enough labeled examples to train on, the perceptron decided the optimal values of the weights and bias completely on its own (This assumes that the data is linearly separable, which we will go into detail on later). The perceptron modeled above learned to place the decision plane with just the points on the graph. We'll follow the perceptron through this learning process.

The learning rate is a value (denoted by α), usually between 0 and 1, that is used to control how fast the perceptron learns. For example, if it's 0, the perceptron won't learn at all, and if it's one, the perceptron learns rather quickly. We'll see later what happens if the perceptron learns too quickly, but for now, we'll use an appropriate setting to see how it works. The perceptron updates its own weights and biases according to this learning rule:

error =

(y - ŷ)

w_{1} = w_{1} + e r r o r * α * x_{1}

b = b + e r r o r * α

Let's apply the learning rule to my meal prediction perceptron, using the points shown on the graph as training data. Since this took 580 iterations to train, we will only look at a handful of examples. From now on, I will demonstrate calculations using vector notation, but explicit notation is included below for reference.

Explicit notation:

z = w \cdot x + b = w_{1} x_{1} + w_{2} x_{2} + w_{3} x_{3} + b

Vector notation:

z = w \cdot x + b = [w_{1}, w_{2}, w_{3}] \cdot [x_{1}, x_{2}, x_{3}] + b

Before a perceptron begins training, we initialize the weight vector and bias of the perceptron to 0 (Sometimes they are initialized to small random values, but either method works). This means that z will always be zero, so F(z) will always be zero. For any input vector, this untrained perceptron will always output "0".

Input vectors:

x_{1} = [3, 7, 1, 5, 3, 8, 2, 6, 5, 5, 3, 1, 5, 1, 8, 2, 6, 10, 4, 5]

x_{2} = [1, 3, 4, 7, 1, 2, 4, 6, 8, 5, 1, 3, 3, 7, 4, 2, 4, 6, 8, 6]

x_{3} = [1, 3, 6, 4, 2, 3, 7, 5, 3, 1, 4, 2, 7, 4, 2, 9, 7, 5, 3, 1]

y = [0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1]

Training Point 1:

The first input the perceptron will train on is the point x = (x₁, x₂, x₃) = (3, 1, 1). Its corresponding y value is 0, so this point is labeled with red on the graph. We already know that this perceptron will always output zero, but we'll follow its calculations for clarity:

Model's Prediction:

z = [0, 0, 0,] \cdot [3, 1, 1] + 0 = 0

F (0) = 0

Perceptron's output: 0

Parameter update:
error = (0 - 0) = 0
w₁ = 0 + (0.1)(0)(3) = 0
w₂ = 0 + (0.1)(0)(1) = 0
w₃ = 0 + (0.1)(0)(1) = 0
b = 0 + (0.1)(0) = 0

This result demonstrates that the perceptron does not update itself in every training example. The perceptron only looks at one point at a time, and it classified this point correctly (error = 0), so it doesn't update parameters. In the second training example, the perceptron will misclassify a point and we'll see it update its parameters.

Training Point 2:

The second point the perceptron will train on is x = (7, 3, 3). y = 1 for this point, so it's labeled with green on the graph. Its calculations are as follows:

Model's Prediction:

z = [0, 0, 0] \cdot [3, 1, 1] + 0 = 0

F (0) = 0

Perceptron's output: 0

Parameter update:
error = (1 - 0) = 1
w₁ = 0 + (0.1)(1)(7) = 0.7
w₂ = 0 + (0.1)(1)(3) = 0.3
w₃ = 0 + (0.1)(1)(3) = 0.3
b = 0 + (0.1)(1) = 0.1

This updated perceptron is represented geometrically below. Notice that the weights already orient the decision boundary in vaguely the right direction, and the bias actually updated in the wrong direction in this instance (It hasn't seen a point yet where it needs to move the decision boundary further into the first quadrant, this happens when it predicts a "1" for a "0"). The weights and biases will both improve as the model sees more training examples.

Training Point 3:

The third point the perceptron will train on is

x = (1, 4, 6), y = 0

. The calculations are as follows:

Model's Prediction:

z = [0.7, 0.3, 0.1] \cdot [1, 4, 6] + 0.1 = 2.6

F (2.6) = 1

Perceptron's output: 1

Parameter update:
error = (0 - 1) = -1

w_{1} = 0.7 + (0.1) (- 1) (1) = 0.6

w_{2} = 0.3 + (0.1) (- 1) (4) = - 0.1

w_{3} = 0.3 + (0.1) (- 1) (6) = - 0.3

b = 0.1 + (0.1) (- 1) = 0

This updated perceptron is again represented geometrically below. Notice that it begins to separate the red and green points. Our weights are slightly more accurate, and the bias has moved the decision boundary back towards the first quadrant as desired.

The perceptron looks at each point multiple times to determine the decision boundary. This perceptron looked at each point 29 times, for a total of 580 iterations. Each time the perceptron goes all the way down the input vectors (i.e. each "pass through" the data) is known as an epoch. Once the perceptron has completed an entire epoch without making any updates, the model stops learning. Another way to say this is that the model converges. Below is another view of the model after it has converged.

If the training data for a perceptron is not linearly separable, it will never converge because it will keep moving the decision boundary indefinitely. This limitation in machine learning is addressed by the perceptron's successor only a handful of years later.

Another way to think about the perceptron

We can get a deeper geometric understanding of the perceptron by treating each x value as a vector rather than a point. Each x vector starts at the origin and ends where their corresponding point was.
Again, we'll use the dot product of vectors to determine z. In this format, we update the weights when the angle between a "1" x vector and the weight vector is more than 90 degrees (this means that their dot product is negative), or if the angle between a "0" x vector and the weight vector is less than 90 degrees (this means that their dot product is positive). I need to find a better way to explain this.

1960 - ADALINE - Bernard Widrow and Ted Hoff

ADALINE (Adaptive Linear Element) was an early improvement on the Perceptron. Invented at Stanford University by Bernard Widrow and his student Ted Hoff, ADALINE updated the Perceptorn's learning rule so that it was now capable of finding a best decision boundary for data that was not linearly separable. Still, the decision boundary itself was constrained to be linear, and while it could find the best approximation of the data, it couldn't perfectly fit nonlinear data. This constraint was removed only a year later by Widrow and Hoff with MADALINE, discussed later.

ADALINE utilizes the same structure as the perceptron, but with a different learning algorithm. The key difference between the models is that the perceptron implements its learning algorithm using the result of the step function, while ADALINE learns from the z value before it's passed through the step function.

Below is Widrow and Hoff's original diagram of ADALINE. As with the perceptron, ADALINE was originally implemented in hardware. For reference, a's are weights, and a₀ and the "1" make up the bias (adding a "1" to the start of the input vector and a weight "w0" at the start of the weight vector produces the same effect as adding a bias with value w0 to the dot product of inputs and weights). Anytime the bias isn't mentioned in a calculation of z, it's safe to assume that the weight vector includes it already as described.

Moving the error calculation and weight update before the step function is key because it makes the error calculation differentiable. This allows ADALINE to utilize gradient descent, a method still used today, to reach the least mean squared error (ADALINE's loss function will be covered later).

The gradient is a vector made up of derivatives, each of which can be thought of as slope at a point with respect to one input dimension. Therefore, the vector containing all of these points in the combined direction of all the slopes. This property guarantees that it points in the direction of fastest increase at any point. It follows that we can take the negative of the gradient to discern the direction of fastest decrease. Additionally, the length of the gradient vector corresponds to the steepness of the change at that point.
This is useful for our loss function, we can calculate its gradient to find the direction of fastest decrease of loss and update the weights in that direction, yielding a lower loss. Repeating this process will eventually cause the model to converge to the best solution. This fundamental machine learning process is known as gradient descent and it's still used to train modern neural networks. An example of gradient descent applied to a simple function follows.

For the sake of an introductory example, imagine a simple ADALINE with one weight and a bias value. Suppose the loss of this fictitious model is given by

L (w, b)

as seen below:
Function:

L (w, b) = w^{2} + b^{2}

Gradient:

\nabla L = [2 w, 2 b]

In this example,

w

and

b

are represented by the x and y axes.

L (w, b)

is represented by the z-axis. Our goal is to minimize L by choosing the optimal values for w and b. Graphically, this looks like finding the spot on the xy plane such that the loss function (the height of this graph) is at its lowest point. Here, we'll see the effectiveness of gradient descent in finding the solution. Let's say we start at the point (2,1).

L (2, 1) = 2^{2} + 1^{2} = 5

Gradient descent begins with calculating the gradient of the loss function. The gradient is calculated by taking the derivative of the loss function with respect to each variable and formatting each of those into an element of a vector. Here, this looks like:

∇L = [

\frac{d L}{d w}

\frac{d L}{d b}

]
=

[2 w, 2 b]

[2 (2), 2 (1)]

[4, 2]

so that the negative gradient is

[- 4, - 2]

Our gradient vector is graphed below on our loss function. It's pointing in the right direction, but it's clear that it's far too long. This vector is overshooting the minimum, a problem solved in machine learning by using a learning rate. For the rest of this example, we'll scale our gradient vectors by a learning rate of 0.1 to avoid overshooting.

Without scaling according to a learning rate, our new weight and bias coordinates would have been:
w = 2 - 4 = -2
b = 1 - 2 = -1
L(-2, -1) = (-2)² + (-1)² = 5
The unscaled vector yields no improvement in the loss. Let's recalculate using a learning rate of 0.1:

w = 2 + (0.1)(-4) = 1.6
b = 1 + (0.1)(-2) = 0.8
L(1.6, 0.8) = 3.2
Utilizing the scaled vector improves the loss significantly.

Repeating this process guides the parameters to values that minimize the loss of the function. Following Gradient descent through the next point:

\nabla L = [2 w, 2 b]

\nabla L = [2 (1.6), 2 (0.8)] = [3.2, 1.6]

Taking the negative and scaling with the learning rate yields the vector that we'll adjust the parameters with:
[ -0.32, -0.16 ]

Calculating new w and b values:

w = 1.6 - 0.32 = 1.28

b = 0.8 - 0.16 = 0.64

L (1.28, 0.64) = 2.048

Again, loss decreases as desired. In practice, this process is repeated many, many times. A few more steps are visualized below.

Notice that the steps get smaller as the graph becomes less steep. A useful analogy is a ball rolling down a hill. It will always move in the direction of steepest descent, but it will move slower when the hill is less steep.

The ball also illustrates a weakness of gradient descent: it can get "stuck" in a region above the function's minimum if it has no downhill path to it. Consider this two-dimensional example where the green ball iterates lower until it gets "stuck":

There are methods to mitigate this phenomenon, but even today there's no perfect solution to it. ADALINE's loss function is simple enough so that this won't happen, and gradient descent can find the absolute minimum of the loss function.

ADALINE's Loss Function

L o s s = \frac{1}{n} \sum_{i = 1}^{n} ({\hat{y}}^{[i]} - y^{[i]})^{2}

The loss function sums the square difference between the model's calculated z value, denoted as

\hat{y}

, and the correct value for each point i. Then, it divides that sum by the number of points n, which returns the average loss. Recall that unlike the perceptron,

\hat{y}

is not restricted to 0 or 1, it's the actual value of z before it passes through the step function. This gives ADALINE the ability to learn even when it classifies a point correctly, which makes it a faster learner than the perceptron in some cases.

Although Widrow and Hoff proposed this formula as the sum of mean squared errors over each point (as in gradient descent), they were forced to implement it using the squared error of one point at a time due to the limitations of hardware at their time. This method of calculating the gradient with respect to the error of one point at a time is called stochastic gradient descent. They proved that stochastic gradient descent (SGD) was a strong approximation of gradient descent (GD) and verified their results in practice. We'll look at the sum of mean squared errors because subsequent models will be using stochastic gradient descent, but SGD actually does have advantages that emerge later in multilayer neural networks.

At each point i, evaluating the gradient of this function involves taking the its derivative with respect to each weight w_j and then its derivative with respect to the bias, formatting all of these into a vector of m weights:

\nabla L^{[i]} = [\frac{d L}{d w_{1}}, \frac{d L}{d w_{2}}, . . . \frac{d L}{d w_{m}}, \frac{d L}{d b}]

Before calculating the gradient, it's important to recall the chain rule, which dictates

\frac{d L}{d w} = \frac{d L}{d \hat{y}} \cdot \frac{d \hat{y}}{d w}

. The chain rule will be instrumental in calculating gradients in modern neural networks.

For each weight w_j in the gradient vector described above, the gradient of this loss function is calculated as:

L = \frac{1}{n} \sum_{i = 1}^{n} ({\hat{y}}^{[i]} - y^{[i]})^{2}

\frac{d L}{d w_{j}} = \frac{d L}{d \hat{y}} \cdot \frac{d \hat{y}}{d w_{j}}

= \frac{1}{n} \sum_{i = 1}^{n} 2 ({\hat{y}}^{[i]} - y^{[i]}) \cdot \frac{d \hat{y}}{d w_{j}}

= \frac{2}{n} \sum_{i = 1}^{n} ({\hat{y}}^{[i]} - y^{[i]}) \cdot \frac{d}{d w} (w_{j} x_{j} + b)^{[i]}

= \frac{2}{n} \sum_{i = 1}^{n} ({\hat{y}}^{[i]} - y^{[i]}) \cdot x_{j}^{[i]}

For the bias term of the gradient vector, the process is the same except

d L

is calculated with respect to the bias, so the objective is to solve for

\frac{d L}{d b}

instead.

\frac{d L}{d b} = \frac{d L}{d \hat{y}} \cdot \frac{d \hat{y}}{d b}

= \frac{1}{n} \sum_{i = 1}^{n} 2 ({\hat{y}}^{[i]} - y^{[i]}) \cdot \frac{d \hat{y}}{d b}

= \frac{2}{n} \sum_{i = 1}^{n} ({\hat{y}}^{[i]} - y^{[i]}) \cdot \frac{d}{d b} (w_{j} x_{j} + b)^{[i]}

= \frac{2}{n} \sum_{i = 1}^{n} ({\hat{y}}^{[i]} - y^{[i]}) \cdot 1

Gradient descent for an ADALINE model in any number of dimensions n utilizes the same process as described earlier, except that the loss function and therefore its gradient are different from the hypothetical example given earlier. Each weight and bias are reassigned according to the rules of gradient descent:

w_{j} := w_{j} + (L R) (\frac{- 2}{n} \sum_{i = 1}^{n} ({\hat{y}}^{[i]} - y^{[i]}) \cdot x_{j}^{[i]})

b := b + (L R) (\frac{- 2}{n} \sum_{i = 1}^{n} ({\hat{y}}^{[i]} - y^{[i]}))

Notice that the model parameters are only updated once per training epoch. The model first sums the gradient vectors from every point before dividing by the number of points to yield the average gradient. This averaged gradient vector is what's used in the parameter update.

Training

Code for ADALINE (Gradient descent)
Let's train an ADALINE model on the same dataset as the perceptron:

x_{1} = [3, 7, 1, 5, 3, 8, 2, 6, 5, 5, 3, 1, 5, 1, 8, 2, 6, 10, 4, 5]

x_{2} = [1, 3, 4, 7, 1, 2, 4, 6, 8, 5, 1, 3, 3, 7, 4, 2, 4, 6, 8, 6]

x_{3} = [1, 3, 6, 4, 2, 3, 7, 5, 3, 1, 4, 2, 7, 4, 2, 9, 7, 5, 3, 1]

y = [0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1]

If we train an ADALINE model with the dataset as is, it actually doesn't converge to a solution that separates the datapoints correctly. This happens because the model's loss function isn't working correctly. In just a handful of iterations, the model's MSE is on the order of thousands, and MSE continues to increase with more epochs.

This worked before with the Perceptron, how could the upgrade to ADALINE cause a drastic reduction in performance? Recall ADALINE is trying to minimize

L = \frac{1}{n} \sum_{i = 1}^{n} ({\hat{y}}^{[i]} - y^{[i]})^{2}

. The heart of this calculation is the difference between

\hat{y}

and

y

. Because

\hat{y}

= z isn't constrained to be either 0 or 1, as in the perceptron, far larger errors are possible for ADALINE, according to the value of

\hat{y}

. Because the error is a factor in ADALINE's weight update, the weights can overshoot, and continue to trying to make up for larger and larger errors on each iteration by increasing even more in magnitude. This phenomenon is called "exploding".
The exploding weights issue can be remedied by lowering the learning rate, preventing overshooting. After retraining the same model with a learning rate of 0.01, the model converges with the decision boundary visualized below.

The model converged, but it's clear that this isn't the right solution either. The issue now is more subtle. Currently, the class labels are 0, 1. This is okay for the perceptron since its

\hat{y}

values are either 0 or 1, but since

\hat{y}

can take on any value in ADALINE, the class labels cause an imbalance in the weight updates. In other words, ADALINE's loss isn't equal across each type of misclassfication, so its weights are adjusted differently based on the type of misclassification. The following tables demonstrate the equal losses for each type of misclassification (a 1 that should be a 0, or a 0 that should be a 1) in the perceptron and the unequal losses for ADALINE.

The solution to this issue is to calculate the error equally for a misclassified 0 or 1. An easy way to do this is to change every "0" value in the y vector to "-1". The corrected error calculations are shown below. Note that any

\hat{y}

value works here, 3 is used for an easy demonstration.

With the updated

y

vector below, ADALINE adjusts its weights to fit the training data as desired.

y = [- 1, 1, - 1, 1, - 1, 1, - 1, 1, 1, 1, - 1, - 1, - 1, 1, 1, - 1, 1, 1, 1, 1]

With a learning rate of 0.015, ADALINE easily outperforms the perceptron in terms of steps to a correct solution. This model learns a decision boundary that classifies every point in the dataset correctly in 227 steps, a 60% improvement over the perceptron. Unfortunately the loss function has too many dimensions to be visualized, but it's still easy to watch the decision boundary update as demonstrated in the next example.

This ADALINE model begins with a mean squared error (MSE) of 1. This is due to the fact that for every input point,

\hat{y}

= 0 and

y

= 1 or -1. Applying the step function to this 0 output means that the model predicts a "-1", no matter what the input is.

After the average gradient has been calculated, negated, and scaled according to the learning rate (as shown above), the parameters can be reassigned:

w_{1} = 0 + 0.075 = 0.075

w_{2} = 0 + 0.071 = 0.071

w_{3} = 0 + 0.005 = 0.005

b = 0 + 0.006 = 0.006

Now, MSE = 0.8, a significant improvement over the original MSE of 1. Note that the optimal linear decision boundary for this dataset has a MSE of about 0.18, not 0. This is because the error is calculated before the step function is applied to z. Of course, after the step function is applied, the model has 0 error. The model's current decision boundary is visualized below:

Link to this graph, consider following along by changing weights in decision boundary equation

While it's clear that this model isn't useful yet, it's already oriented itself in generally the right direction after just the first parameter update.

Before the second parameter update, the model recalculates the average gradient, then negates and scales it. This updated gradient is added to the current paramters in the parameter update:

w_{1} = 0.075 + - 0.03 = 0.045

w_{2} = 0.071 + - 0.027 = 0.044

w_{3} = 0.005 + - 0.073 = - 0.068

b = 0.006 + - 0.013 = - 0.008

The updated parameters reduce the MSE to 0.66. Already, most of the points are classified correctly. Its updated decision boundary is visualized below:

Subsequent parameter updates won't impact the model much individually because of the low learning rate, but the final model (225 steps later) is shown:

Another solution to the overshooting problem is more modern, and unavailable to Widrow and Hoff because their mechanism only accepted "-1" and "1" for inputs, rather than the values 1-10 used in this example. It's known as batch normalization. Each input vector is normalized according to the following equation, so that its values are all between [0, 1].

x^{'} = \frac{x - m i n (x)}{m a x (x) - m i n (x)}

ADALINE Python code (GD, normalized)
Normalizing the input data allows us to use the original learning rate of 0.1, so that model perfectly separates the input data after only 29 iterations. We can use a higher learning rate here because the x values, being between 0 and 1 (rather than 0 and 10) produce much smaller z values, leading to smaller weight updates. Batch normalization is so effective in this example that it can handle a learning rate of up to 0.6, in which case it classifies the input data perfectly after only 3 updates.

The training of modern neural networks are often assessed using "elbow" graphs, such as the one shown below (this one was generated from the 0.1 learning rate ADALINE model). The idea is to train model until the loss stops decreasing significantly.

Professor Widrow actually devised a method to estimate the number of training iterations an ADALINE model would need before it started to flatten out. Through experimentation, he found that the number of point required to train ADALINE using stochastic gradient descent was equal to five times the number of input bits (x₁, x₂,... x_j, b). His findings can't be directly applied to the models described here because he only used -1 and 1 as inputs to his machine, didn't use a learning rate, and used stochastic gradient descent (updating with respect to 1 point at a time) instead of gradient descent. Still, it can be interesting to make an approximation.

Looking at the 0.1 learning rate ADALINE model, which classified perfectly after 29 updates:

It has 4 inputs. Multiplied by 5, this yields 20 training updates. This would be Widrow's calculation without adjustments for the different model used here.

Its learning, however, should be 10 times as slow because of its learning rate of 0.1. This yields 200 training updates. Recall that Professor Widrow designed this approximator for stochastic gradient descent, which in this case is less accurate than gradient descent. Since gradient descent allows ADALINE to look at all 20 input points before updating parameters, the most benefit it could offer is a factor of 20. Realistically, gradient descent isn't 20 times more efficient, so it would be within a range of 1-20 times more efficient. We obtain an approximation of between 10-200 training examples to achieve an acceptable error, which 29 easily fits into.

The loss chart of the same ADALINE model, this time implemented with stochastic gradient descent rather than gradient descent, is shown below:

ADALINE Python Code (SGD, normalized)
It's clear that stochastic gradient descent (SGD) is more noisy than gradient descent (GD). It's also less effective per update than gradient descent: after 20 iterations, SGD has a MSE of 0.65, while GD has a MSE of 0.54. Worse still, the lowest MSE that SGD reaches is about 0.20, while GD can reach 0.185.

However, SGD comes with advantages, particularly its far lower memory demands next to GD, making it the training method of choice from the 1960s to today. First, it's faster to compute, since it only looks at one training point at a time. Second, it requires holding far less data in memory. In fact, many datasets are so large that it's not feasible to look at every training example at once. Third, SGD can mitigate the issue of getting stuck in local minima because its updates are noisy. The loss function for each individual training point is different, so with each update, the model sees a slightly different loss function. It has been shown that every input point's loss function is roughly similar to the loss function derived from all the points, except that its smaller features will be different. SGD avoids getting stuck in the local minima because the local minima move every time it updates.

Many modern neural networks utilize a method called minibatch gradient descent, which aims to combine the accuracy of GD with the efficiency and flexibility of SGD. Minibatch gradient descent involves taking groups of training data and updating the parameters while looking at the gradient with respect to just that group of points.

1961 - MADALINE

MADALINE (Multiple ADALINE) was developed in 1961 by the same pair that invented ADALINE - Bernard Widrow and Ted Hoff. As suggested in its name, it incorporates multiple ADALINE structures within its architecture. What makes MADALINE special is its distinction as the first machine learning model based on linear classifiers to set nonlinear decision boundaries by utilizing a second "layer", described later. MADALINE's invention also marked a substantial step forward in the intuition behind machine learning, which still holds today: combining multiple, even many, units leads to valuable emergent behavior.

Like ADALINE, MADALINE's architecture accommodates any number of inputs. It consists of two layers: a layer of at least two ADALINEs arranged in parallel, which all connect to a second layer containing at least one classification unit, which could be implemented as an ADALINE, a perceptron, or even logical operators such as OR, as demonstrated by Hoff. The number of units in the second layer corresponds to the number of output classes. Every first-layer ADALINE is "connected" to every input (where a connection is inclusion in the calculation of $z$ ). Together, MADALINE is composed of an input layer, a first layer of ADALINEs, and a second layer with with number of ADALINES corresponding to the number of output classes.

There are countless variations on MADALINE, including the number of ADALINEs, the training method, and the number and type of second-layer/output units. For the sake of simplicity, a MADALINE comprised of two ADALINEs in its first layer, and an ADALINE classifier in the second layer will be demonstrated. The weights and bias of the second layer will not be trained. Widrow did begin to explore methods of training the second layer weights, but none were effective enough to be widely used. The described MADALINE is visualized below:

To interpret this diagram, note that:
• The three layers (0 : Input , 1 : Multiple ADALINES, 2 : The second-layer ADALINE) are denoted with superscript.
Labeling layers 0 & 1 would cause unnecessary clutter here and is omitted.
• The first (left) subscript in each weight corresponds to the input value it multiplies, from 1-n.
• The second (right) subscript corresponds to the z ADALINE that it feeds into, from 1-m.
• Each z is an unactivated ADALINE, which is passed through activation function F() to activate each ADALINE.

The XOR (Exclusive OR) problem is the classic demonstration of MADALINE because it showcases the nonlinear classification ability of MADALINE, while keeping the calculations relatively simple (and able to be visualized) by using only two input dimensions, two first-layer ADALINES, and one second-layer ADALINE.

This can easily be represented geometrically, with inputs as coordinates and color as output (-1: Red, 1: Green):

It's clear that no single line could possibly separate these perfectly (ie the data is not linearly separable), which is why MADALINE, having nonlinear decision capability, was the first machine learning model to solve this problem. Let's first examine how an already trained MADALINE processes input and returns a classification, step by step.
For reference, the weights and biases of this network are as follows:

• Layer 1, First ADALINE:
-

w_{11} = - 0.300

w_{21} = 0.157

b_{1} = - 0.236

• Layer 1, Second ADALINE:
-

w_{12} = 0.237

w_{22} = - 0.300

b_{2} = - 0.348

• Layer 2, ADALINE:
-

w_{1}^{2} = 0.5

w_{2}^{2} = 0.5

b_{1}^{2} = 0.5

This MADALINE model was trained using this code. Weights are randomly initialized in this program, so they won't match the ones in this demonstration. Recall that ADALINE's output is calculated according to

F (w \cdot x + b)

Let's give this MADALINE the input

(x_{1}, x_{2}) = [1, 1]

. This should evaluate to -1 according to the XOR truth table.

The ADALINES in the first layer are calculated as:

(- 0.300) (1) + (0.157) (1) + - 0.236 = - 0.379 \to F (- 0.379) = - 1

(0.237) (1) + (- 0.300) (1) + - 0.348 = - 0.411 \to F (- 0.411) = - 1

The ADALINE in the second layer is calculated using the outputs of the ADALINES in the first layer as input:

(- 1) (0.5) + (- 1) (0.5) + 0.5 = - 0.5 \to F (- 0.5) = - 1

. This is the correct output according to the XOR truth table.

This MADALINE is visualized geometrically (red space represents -1 classification, green space represents 1 classification) below:

You can experiment with adjusting the weights here.
Each ADALINE in the first layer provides one of the sections of green space. All input points above the line

- 0.300 x_{1} + 0.157 x_{2} - 0.236 = 0

are classified as 1 by the activation function of the first ADALINE. All input points below the line

0.237 x_{1} - 0.300 x_{2} - 0.348

are classified as 1 by the activation function of the second ADALINE.

Since the activation function F returns 1 whenever its inputs are greater than zero, only one positive input is necessary for the second layer's ADALINE to return a 1 . Thus, the model can output a 1 in two different areas.
The remaining area, where both layer-1 ADALINES return -1, is in turn classified as -1 by the second-layer ADALINE.

While we've already seen how each ADALINE would learn, it's important to understand when each ADALINE learns. There's no "correct" output for each ADALINE, so what do we compare its output to in the loss function? Widrow answered this with a "job assigner", pictured below this paragraph. If MADALINE's output didn't match its target output, then as many ADALINES as necessary would be updated such that the lowest satisfactory number of ADALINES matched the output of the target class (For 5 first-layer ADALINEs, 3 would need to match). The ADALINEs to be updated were selected in order of how close their z value was to the desired value, with the closest ADALINEs updated first.

The implementation discussed here utilizes a job assigner based on Widrow's. Its logic is visualized below:

Each ADALINE trains according to the job assigner, and for the XOR problem, convergence usually takes between 3-5 epochs with randomly initialized weights. Randomly initializing weights, while it doesn't help significantly over just a handful of epochs, has been shown to lead to faster convergence in larger, more complicated neural networks. MADALINE is an extremely interesting model because it's the earliest one able to demonstrate why a neural network is so capable, and the MADALINE discussed here is simple enough that we can actually visualize this.
MADALINE's XOR decision boundary demonstrates the key component, nonlinearity. This is of course helpful for the XOR problem, but the significance here is that there could be more inputs and more ADALINES, leading to enormously complex, high-dimensional, nonlinear decision boundaries, capable of representing complex relationships between data.
The visualization of this MADALINE allows us to get to the bottom of exactly why MADALINE (and neural networks) work. MADALINE's capability stems from what you would see from the perspective of the second-layer ADALINE: the outputs of each of the two first-layer ADALINES. To emphasize its importance, I'm leaving out one thing for now, but below is very close to the perspective of the layer 2 ADALINE. Layer 1's first and second ADALINEs' Z values are represented by the x and y axes, respectively. The second-layer ADALINE's perspective below was generated by the first-layer ADALINEs as follows:
As a reminder, ADALINE computes its output using

w \cdot x + b

where

w

is the weight vector,

x

the input vector, and b the bias. The first ADALINE's output will be denoted by

Z_{1}

and the second

Z_{2}

.
For the input

[- 1, 1]

Z_{1} = (- 1) (- 0.3) + (- 1) (0.157) + - 0.236 = - 0.093

and

Z_{2} = (- 1) (0.237) + (- 1) (- 0.3) + - 0.348 = - 0.285

For input

[- 1, - 1]

, the second layer ADALINE will see the point

(Z_{1}, Z_{2})

(- 0.093, - 0.285)

.
Similarly,

[- 1, 1]

is represented to the second layer as

(Z_{1}, Z_{2})

(0.221, - 0.885)

[1, - 1]

is represented to the second layer as

(Z_{1}, Z_{2})

(- 0.693, 0.189)

[- 1, 1]

is represented to the second layer as

(Z_{1}, Z_{2})

(- 0.379, - 0.411)

From this perspective, the inputs have certainly been organized, but it's clear that they're still not linearly separable. Consequently, our second-layer linear unit can't set a boundary to distinguish between classes. What's missing is the activation function (here, it is the step function), which makes nonlinear decision boundaries possible. No number of first layer ADALINE units in a model can rearrange non-linearly separable data into linearly separable data, because without an activation function, they are simply performing a linear transformation on the input data. The power of MADALINE (and neural networks) comes from the combination of, and the interactions between, nonlinear activation functions and the use of many units. Let's determine the true perspective of the second layer ADALINE by adding the missing piece and activating the first layer's outputs:

[- 1, - 1] \to (Z_{1}, Z_{2}) = (- 0.093, - 0.285) \to (F (- 0.093), F (- 0.285)) \to (a_{1}, a_{2}) = (- 1, - 1)

[- 1, 1] \to (Z_{1}, Z_{2}) = (0.221, - 0.885) \to (F (0.221), F (- 0.885)) \to (a_{1}, a_{2}) = (1, - 1)

[1, - 1] \to (Z_{1}, Z_{2}) = (- 0.693, 0.189) \to (F (- 0.693), F (0.189)) \to (a_{1}, a_{2}) = (- 1, 1)

[1, 1] \to (Z_{1}, Z_{2}) = (- 0.379, - 0.411) \to (F (- 0.379), F (- 0.411)) \to (a_{1}, a_{2}) = (- 1, - 1)

After activation, the second-layer ADALINE sees this:

It's clear that the positive and negative targets are now linearly separable. A close look at how the activation function treated each point demonstrates why it's useful here. The unactivated ADALINEs mapped both of the "False" inputs to the third quadrant, meaning both their

Z_{1}, Z_{2}

coordinates were negative. Passing a negative into the step function yields

- 1

, so both "False" inputs are represented by the point

(a_{1}, a_{2}) = (- 1, - 1)

to the second-layer ADALINE. Each "True" input is represented in a separate quadrant, quadrants 2 and 4. To be more direct, MADALINE learned to manipulate the points such that they could be separated by the second-layer ADALINE, whose decision boundary equation is given by

0.5 a_{1} + 0.5 a_{2} + 0.5 = 0 \to a_{2} = - a_{1} - 1

. The second-layer ADALINE perspective, including its decision boundary, are pictured below.

To tie the examples back together, MADALINE (and neural networks) are capable because their numerous linear units and nonlinear activation functions learn to represent input data in such a way that similar patterns are grouped together and oriented such that they're more easily distinguishable. In the example of a MADALINE model trained on the nonlinear XOR problem, we saw the first layer map both of the class "False" inputs to the same point, so that the second layer could then distinguish between class "True" and class "False" points with just a line. Larger, more complex MADALINEs are more difficult or impossible to visualize, but even in more complex situations, the same principles are the force behind a model's capabilities.

Modern neural networks are undoubtedly far more powerful than MADALINE, but they draw on those very same principles to make their predictions.

1986 - The Modern Neural Network (MLP)

Interest in neural networks died down in the decades following MADALINE's invention due to what researchers at the time coined "the credit assignment problem". Widrow's job assigner, as clumsy as it was in the simple case of two layers, could not be effectively extended to larger, more complex neural networks. It became difficult or impossible, and certainly not efficient, to ascertain which weights should be updated to minimize the error because of the number of connected units.

Building an Intuition for Deep Learning, Historically

Table of Contents

1943 - The McCulloch-Pitts Neuron

1957 - Rosenblatt's Perceptron

Link to this graph

Training

Training Point 1:

Training Point 2:

Training Point 3:

Another way to think about the perceptron

1960 - ADALINE - Bernard Widrow and Ted Hoff

ADALINE's Loss Function

Training

Link to this graph, consider following along by changing weights in decision boundary equation

1961 - MADALINE

1986 - The Modern Neural Network (MLP)