Brian Castle
Computational Mechanisms

We've looked at the concept of plasticity at the synaptic level, but when we look at neural populations instead, things get more complicated. Most of the machine learning models of neural networks use an "optimization" procedure, they usually either minimize an error term or maximize a likelihood term. In a biological network, how can we obtain regional or global information, like for example, a Hamiltonian or Lagrangian?

The classic optimization paradigm minimizes an error, usually the mean squared error, which is basically the sum of the squared distances between the predicted and actual values. The prediction comes from a model, it is the internal model that the network has of the world. This model is parametrized, there are one or more parameters (usually associated with hidden layers in the network) that are adjusted for best fit. In the classic treatment of least-squares regression, the global error is differentiated against the parameters to obtain an adjustment that reduces the error. In machine learning, this procedure is known as "gradient descent". Some of the well known gradient descent methods include the Newton-Raphson method and stochastic gradient descent.

In a biological system, how does one determine a global error? That turns out to be a complicated question, because in addition to the "global" error, it may become useful to calculate errors "regionally", say, per column or hypercolumn in the cerebral cortex (or whatever regional modularity makes sense). One approach is to put an electrode in the extracellular space, where we pick up local field potentials, which represent the aggregated activity of hundreds of neurons. Another approach is we can introduce some neurons into the circuit that add up all the local errors (unfortunately this pathway necessarily involves further delay, which is a complication we can set aside for a moment). Yet a third approach is we can use an external system, perhaps one involving glial cells.

The regional error model is especially interesting because it dovetails with predictive coding, Kuramoto dynamics, and regions of criticality. When we're processing a visual image, we probably don't care about any dynamics that may be going on in the auditory cortex - we only care about our own and that of our nearby neighbors. For example the predictive coding paradigm is a local model that doesn't depend on a global free energy - however if we add a term representing local (or regional) free energy to the error calculation, we can generate new kinds of learning behavior.

In the case of Kuramoto dynamics, the aggregate behavior of the oscillators depends on the shape of the coupling function between them. In the human brain this shape varies from approximately Gaussian to distinctly multipeaked. Instead of a coupling constant, it is useful to think in terms of a coupling tensor, analogous to a spatial field whose value at any given point is an oscillator phase. In the case of critical regions, we can guess that these will be closely related to the phases of oscillators. Coupled oscillators tend to recruit more coupled oscillators, and these are the regions we expect to first exhibit criticality. However this linkage is not guaranteed, and determining the ways in which it can be uncoupled is an active area of current research, and one in which modeling can definitely help us.

Generating Random Numbers

When engaging in modeling, or any kind of exploration with neural networks, there are some functions we need over and over again. One of the most important is generating random numbers according to a distribution. There are many useful ways of doing this, some of which rely on time-tested methods like the Metropolis-Hastings algorithm, and others that are highly optimized for performance. One of the important things to realize is that there are no "true" random numbers in the digital domain - the best we get from computers is a set of "pseudo"-random numbers derived from a seed. This is both helpful and unhelpful. It's helpful when we're trying to replicate an experiment exactly (we frequently have to do this during debugging!). It's unhelpful when we're trying to randomize our experiments to get as close to biological reality as possible. If we want "true" random numbers, we have to get them from quantum devices dedicated to this purpose. These are available on chips, but not commonly found in consumer products.

Energy Functions

The extent to which a local or regional energy function can be useful is determined by the network connectivity, especially the internal convergence and divergence between processing modules. In the dynamic assembly theory of cortical function, modules go into and out of functional assemblies by mutual coupling. There are dozens of ways in which such coupling can be achieved. But does this mean the energy functions needs to be coupled too? Or can each module maintain its own cost function while still participating in a larger set of network calculations?

Gradient Descent

The issues associated with gradient descent are well known and have been extensively studied. There is the issue of convergence, and there is also the issue of rate of convergence. If the algorithm gets stuck in a local minimum there are ways of restarting it, and ways of avoiding local minima altogether. There is the issue of the vanishing gradient, and a number of ways to resolve it or avoid it.

Fitting Parameters

In machine learning, the method of fitting parameters is chosen in advance according to the needs of the programmer. Some kinds of data work better with logistic regression, and other kinds work better with k-means clustering. The programmer usually explores the dataset before deciding which algorithms to try. Human beings don't always have this luxury. We're expected to respond to unknown datasets in real time, and do the best we can in terms of classification and optimization. Humans engage in meta-programming, that is to say, we do the same thing the programmer does, we select the algorithm that's best suited to the task at hand. Our brains have a library of algorithms to select from. Most of these revolve around statistics - we rarely engage in any geometry more complicated than simple matrix multiplication. If we require derivatives we extract them up front and transmit them on separate channels.

Entropy

The flip side of generating random numbers, is determining how much information there is in a signal. The foundation of this approach is based in information theory, using concepts like entropy. Sometimes we have to know things about the statistics up front, to make meaningful calculations about the information - but in real life we often don't have much up-front input, so we have to adjust our models on the fly, based on successive tidbits of incoming information. In the world of Bayesian models and Bayesian parameters, the number of parameters we use frequently has to be updated based on new information. An example was given earlier in relation to a coin flip that suddenly yields an unexpected result, like "5" when we're only expecting a 0 or 1. This forces us to update the number of parameters in our internal model, because now we know about three states instead of just two.

Non-Euclidean Manifolds

One of the things the brain does exceedingly well, is coordinate transformations. Not just geometric transformations, but restructuring along lines that are completely different from the axes of the input signal. One such mapping is the projection of the input onto coordinates that represent information - in particular we can use the Fisher information as a coordinate system and the Kullback-Leibler divergence as a metric. The resulting manifolds are the subject of information geometry. Such mappings are important everywhere in the brain, even at the most peripheral levels like in the retina (Ding et al 2023).

Next: Optimization and Error Signals

Back to the Home Page

Back to the Home Page

(c) 2026 Brian Castle
All Rights Reserved
webmaster@briancastle.com