The appeal of modeling machine-learning architectures on the human brain is deeply rooted in historical and rhetorical inertia. Early artificial neural networks borrowed heavily from a metaphor of interconnected biological neurons; that metaphor shaped our intuition about what "intelligence" should look like. Over decades, though, the field has repeatedly discovered that scale and simple, general-purpose methods tend to outperform hand-crafted, biologically-inspired heuristics.
In his essay, Richard Sutton called this insight the Bitter Lesson: rather than embedding world knowledge or cognitive priors into architectures, it is more effective to lean on massive computation, search and data. Recent empirical studies of computer vision research over two decades confirm strong adherence to these principles, showing consistent increases in general-purpose learning methods and computational scaling.
By thinking in metaphors about brains and neurons, ML risks conflating functional analogy with mechanistic equivalence. This conflation can mislead research priorities, suppressing exploration of algorithmic substrates that might be simpler, more efficient or robust.
Algorithmic Emergence as Substrate-Agnostic Dynamics
To see how much intellectual weight can be carried by purely algorithmic structures, consider the state-space interpretation of a sorting algorithm. This example shows emergence in its clearest form: global order arising from local rules without any biological inspiration.
Setting up the framework: Let a dataset be a sequence
X = [x_1, x_2, \dots, x_n]
where each x_i is some value we want to sort (like numbers or words).
Let \mathcal{S} be the set of all permutations of X. Think of each permutation \sigma as one possible arrangement of the elements. For example, if X = [3, 1, 2], then one permutation might be [1, 2, 3] (sorted), another might be [2, 3, 1] (unsorted), and so on.
Visualizing the state space: The following interactive visualization demonstrates how bubble sort operates on this state space. You can see how the algorithm systematically explores different arrangements, gradually moving toward the sorted state through local comparisons and swaps:
{
"algorithmType": "sorting-cycle",
"config": {
"width": 800,
"height": 300
}
}
The dynamical system view: A sorting procedure induces a discrete-time dynamical system via a transition operator
T : \mathcal{S} \to \mathcal{S}.
This operator T takes one arrangement and transforms it into another. Each "step" of the sorting algorithm corresponds to applying T once.
For deterministic algorithms like Mergesort, T is fully determined by the control flow of the algorithm—given any permutation, the next permutation is completely specified. For randomized algorithms like Quicksort, T becomes a Markov kernel
T(\sigma' \mid \sigma),
meaning the next permutation \sigma' is chosen randomly from a probability distribution that depends on the current permutation \sigma.
Either way, the system evolves through:
\sigma_{t+1} = T(\sigma_t).
This simply says: the arrangement at time step t+1 is obtained by applying the transition operator to the arrangement at time t.
The target state: The sorted list is an absorbing state\sigma^\star, satisfying T(\sigma^\star) = \sigma^\star. Once you reach the sorted arrangement, applying T again doesn't change anything—you stay sorted.
How emergence happens—the inversion count: Crucially,
global order emerges not because the algorithm “understands” the structure of the list, but because repeated local
operations (comparisons, swaps, recursive partitions) monotonically reduce an objective such as the inversion count:
\mathrm{Inv}(\sigma) = \lbrace(i,j) \mid i < j \wedge \sigma(i) > \sigma(j)\rbrace.
In plain language: an inversion is a pair of positions where a larger element appears before a smaller one. For example, in [3, 1, 2], the pairs (3,1), (3,2), and (1 is not paired with anything after it that's smaller) give us inversions. The sorted list [1, 2, 3] has zero inversions.
Visualizing inversion reduction: The following visualizations show how different sorting algorithms systematically reduce inversions. Insertion sort gradually builds order from the left, while selection sort finds the minimum element and places it correctly:
Why sorting works: Sorting is equivalent to driving \mathrm{Inv}(\sigma) to zero. Many algorithms satisfy:
\mathbb{E}[\mathrm{Inv}(\sigma_{t+1}) \mid \sigma_t] < \mathrm{Inv}(\sigma_t),
a stochastic Lyapunov condition that guarantees convergence to \sigma^\star. This is emergence in the strict mathematical sense: global structure arises from dynamics over simple local rules and a monotone descent metric. The algorithm never "sees" the whole list at once in any meaningful way—it just compares pairs and swaps them. Yet somehow, a fully sorted list emerges.
Adaptive sorting algorithms can even exhibit input-sensitive emergent behavior, with complexity bounds that depend on how "nearly sorted" the input already is, demonstrating sophisticated behavior without biological metaphors (see Petersson & Moffat, Adaptive Sorting, BRICS RS-04-27).
The key insight: Emergence is a property of dynamics, not a property of neurons or biological tissue. It's a mathematical phenomenon that appears whenever you have:
1. A state space (all possible arrangements)
2. A transition rule (how to move from one state to another)
3. A monotonically decreasing objective function (something that gets smaller with each step)
4. An absorbing target state (where you want to end up)
Adaptive sorting further strengthens this point. Results such as the
O!\left(n \left(1 + \log!\left(1 + \frac{\mathrm{Inv}}{n}\right)\right)\right) complexity for certain Quicksort
variants demonstrate that
even classical algorithms exhibit input-sensitive emergent behavior, without invoking any biological metaphors.
Thus, emergence is a property of dynamics, not a property of neurons.
Biological Emergence Is Mechanistic, Not Abstract
Contrast this with self-organizing biological systems such as regenerating frog cells or planarian tissue. Their emergent behavior arises from morphogen gradients, ion-channel–mediated signaling, mechanical tension fields, and cell-adhesion differentials—ancient mechanisms by which cells share information through electrical signals and coordinate their activity during development and regeneration.
These systems operate through bioelectrical gradients that control everything from left-right asymmetry to limb regeneration, with specific voltage patterns encoding positional information that guides morphogenesis.
These systems can be modeled by reaction–diffusion equations in continuous space:
\frac{\partial c_i}{\partial t} = D_i \nabla^2 c_i + R_i(c_1, c_2, \dots),
where:
- c_i are morphogen concentrations (chemical signals that tell cells what to become)
- D_i are diffusion coefficients (how fast each morphogen spreads)
- \nabla^2 is the Laplacian operator (measures how concentration varies across space)
- R_i are nonlinear biochemical reaction terms (how morphogens interact and transform)
The crucial distinction: Although both sorting algorithms and biological tissues exhibit convergence toward organized states, the mechanisms are wholly unrelated. A discrete permutation Markov chain (sorting) shares no ontology with a biochemical reaction–diffusion field (morphogenesis). They're both examples of emergence, but they work through completely different physical and mathematical substrates.
Emergence is universal across systems, but substrates differ fundamentally. Analogy alone provides no warrant for importing biological intuitions into ML architecture design. ML often forgets this distinction.
Machine Learning Through the Lens of Dynamical Systems
Modern ML models—transformers, diffusion models, softmax-attention networks—are also discrete-time dynamical systems, but they operate in high-dimensional vector spaces rather than discrete permutation spaces.
Training as trajectory: Let the parameters be \theta \in \mathbb{R}^d, where d might be billions of dimensions. Training defines a trajectory through this space. The simplest version is gradient descent:
\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t),
where:
- \eta is the learning rate (step size)
- \nabla_\theta \mathcal{L} is the gradient of the loss function (direction of steepest ascent in loss)
- We subtract because we want to descend to lower loss
With momentum (a common enhancement), we get:
\begin{aligned}
v_{t+1} &= \beta v_t + (1 - \beta)\nabla_\theta \mathcal{L}(\theta_t), \
\theta_{t+1} &= \theta_t - \eta v_{t+1},
\end{aligned}
where v_t is a velocity term that accumulates gradients, and \beta \in (0,1) controls how much past gradients matter. This helps smooth out noisy updates and accelerate convergence.
The model as composition: The model itself is a parametric map f_\theta : \mathcal{X} \to \mathcal{Y} that, when unrolled over layers, becomes a composition:
f_\theta(x) = F_L(F_{L-1}(\dots F_1(x))),
where each F_\ell is one layer of the network (like an attention layer or feedforward layer in a transformer), and L is the total depth.
Where intelligence emerges: ML's emergent properties—generalization (working on unseen data), invariances (recognizing rotated images), in-context learning (solving new tasks from examples), "reasoning" capabilities—follow not from neuron analogies but from:
- Optimization geometry: The landscape of the loss function and how gradient descent navigates it
- Overparameterization: Having more parameters than strictly necessary, which paradoxically improves generalization
- Implicit regularization: How the optimization process itself prefers simpler solutions
- Scale: More data, more parameters, more compute
Empirical scaling laws such as those documented by Kaplan et al. show that loss scales as a power-law with model size, dataset size, and compute used for training, with trends spanning more than seven orders of magnitude. These can be expressed as:
\mathcal{L}(N,D,C) \approx a N^{-\alpha} + b D^{-\beta} + c C^{-\gamma},
where:
- N is model size (number of parameters)
- D is dataset size (number of training examples)
- C is compute (total floating-point operations)
- \alpha, \beta, \gamma are empirically determined exponents (typically around 0.05-0.1)
- a, b, c are constants
The key point: Intelligence here emerges from resource scaling, not cortical metaphors. The brain analogy contributed historically but isn't necessary for modern ML's success.
The Cultural Physics of Scientific Metaphor
A deeper issue, rarely addressed explicitly, is that scientific worldviews are strongly shaped by the technological era
in which they arise. Historians and philosophers of science often refer to this as a "metaphor regime", where dominant
technologies scaffold the intuitions scientists bring to natural phenomena.
When precision gears and automata dominated early physics, the universe was reimagined as a clockwork mechanism (see historical accounts of the "machine metaphor" in 17th-19th century science). Laplace's demon, Newtonian determinism, and even 19th-century thermodynamics were all framed in mechanistic metaphors borrowed from industrial machinery.
When information theory and computation surged in the mid-20th century, biology was reframed: DNA became "code", cells
became "machines", brains became "processors", and evolution was cast as an optimization algorithm. (See Evelyn Fox
Keller, Making Sense of Life.)
As video games, simulations, and virtual environments became culturally ubiquitous, a new metaphor emerged: the simulation hypothesis, the idea that the universe is a computational substrate running some higher-level program (see philosophical discussions in Bostrom's work and related papers on computational metaphysics). Nick Bostrom's famous argument is an explicit product of the information era, shaped by our experience with sophisticated computer simulations.
Machine learning itself is caught in this metaphor drift. Today's dominant cultural artifact is the artificial neural network, so it is unsurprising that cognitive scientists, philosophers, and ML researchers repeatedly fall back to neurocomputational metaphors (see studies on "Metaphors for designers working with AI" and "Artificial Intelligence and other Speculative Metaphors"). But this framing is contingent; it reflects the era's tools, not a universal insight about intelligence.
A Plea for Purity
The path forward is not to reject biological inspiration outright, but to treat it as one possible abstraction
layer, no more privileged than signal processing, control theory, dynamical systems, or algorithmic combinatorics.
The correct approach is methodological purity: use biology only when it yields clean mathematical primitives with
falsifiable predictions, and avoid it entirely when simpler algorithmic structures explain the same emergent phenomena.
Emergence is substrate-agnostic. Scale is substrate-agnostic. Optimization dynamics are substrate-agnostic.
Intelligence, at least in the engineered sense, may be substrate-agnostic as well.