Brian Castle
Visual Streams

As mentioned already, there are two "streams" of visual processing in the human brain, a ventral stream related to "what" an object is, and a dorsal stream related to "where" it is. However in the context of a neural timeline mapping brain electrical activity, we can note something important about the concept of an "object", in that it's invariant with respect to time (whereas its attributes, like position and orientation and velocity, may not be). The concept of an object as an invariant abstraction one can label and attach a name to, has important ramifications in relation to a timeline mapping, because the invariances have to be specifically separated from the other information, which requires specialized neural processing.

From a machine learning standpoint, we get some of this for free in convolutional networks, which are very good at extracting invariances, and indeed there are portions of the visual cortex that look very much like a convolutional network. However as mentioned in the earlier section on neurons, the artificial CNN's are highly non-biological for the most part, and don't exhibit the complex dynamics seen in the human brain.

The architecture of the visual streams tells us other important things too. For one, it is important to understand the many ways they come together, and for what reasons. Our brains do not simply store all the information they get, that would be a lot of bits, even for trillions of synapses. Instead, our brains extract the relevant information, and then reconstruct scenes in such a way that they contain only the relevant information. This process is essential to short term memory, and it is well worth studying the circuitry in and around the hippocampus to see how it works.

Anatomy of the Visual Streams

From the primary visual cortex, the visual signal feeds forward into V2, the secondary visual cortex. V2 is like a shell that surrounds V1. It has very specific architecture, that treats inputs from ocular dominance columns, orientation columns, and blobs differently.

From V2, the visual signal splits into multiple streams. There is a dorsal stream that goes up along the top of the brain into the parietal lobe and from there to the frontal eye fields, and there is a ventral stream that goes down into the temporal lobe and eventually feeds the scene mapping circuitry in the hippocampus. Along the way there are cortical visual areas specific for color, motion, faces, object recognition, and 3d spatial reconstruction.

In addition to the red and blue arrows in the figure above, there is an area sandwiched between them, that is shared by both streams. This area includes the MT and MST areas handling motion perception, the face recognition area, and other shared functions that need to be "prepared" before being fed into the streams. The figure shows some of the identified areas of the visual cortex.

A flattened map of the human visual cortex is shown below, indicating many of the relevant areas.

Here is a side view:

And here is a broader view:

Temporal Lobe and the Ventral Stream

The ventral stream handles object recognition, which is a primary function of the inferior temporal cortex (Area IT in the maps above, including PIT, CIT, and AIT). Objects have semantic meaning, and indeed there is a relationship between IT and Wernicke's area, its counterpart in the auditory realm. From a visual standpoint, objects have certain consistent characteristics, for instance the colors are uniform and move together with the object, the corners mostly have internal angles and the surfaces are mostly curved inwards, and the statistics of the object move along with the object, for example if there is internal motion (like on a wet ball, or if an insect is on it) the motion will move with the ball.

In the machine learning world, object recognition is usually preceded by convolutional layers that accomplish feature extraction and spatial invariance. The same is probably true in the human brain. In the periphery, the first level of filtering involves contrast enhancement and gain control in the retina and LGN. In the primary visual cortex V1, lines and edges are represented in terms of orientation and spatial layout. Colors are extracted in V2 and passed to V4. In the areas around V4 are many small topographic visual maps specialized by function, like face recognition which is important for social behavior. All these areas feed down into the inferior temporal lobe, which contains the infamous "grandmother cells" and "Jennifer Aniston cells". The figure shows a map of the human temporal lobe with the numbered Brodmann areas. Wernicke's area is closest to area 22, while the visual IT is closest to area 20, overlapping areas 21, 37, and parts of 19.

Area IT is a major source of input to the scene mapping circuitry in the area around the hippocampus. IT sends axons into the perirhinal cortex and entorhinal cortex, both of which are crucial for scene mapping and navigation. This area around the hippocampus includes many groups of cells dedicated to specific functions needed for scene mapping, which requires knowledge of "which object is at which location". Area IT provides the object-related information, and the spatial layout and motion of objects is provided by the parietal lobe.

Parietal Lobe and the Dorsal Stream

Objects are invariant, but their properties are not. Real objects that appear in the visual field are "examples" of abstract objects that are stored in memory. While an object is present in the visual field, it's always the same object, but its properties change, its shape depends on the viewing angle, and its textures depend on lighting and shadows. The "where" stream tracks the variant properties of objects in the visual field. A lot of this tracking has to do with motion, keeping track of which parts of a moving image belong to an object. One of the important processing stations in the dorsal stream isarea MT in the middle temporal lobe, which handles motion using the input passed up through the M channel in early visual processing. Area MT is heavily connected into the oculomotor system at the level of the superior colliculus, it is able to target the eyes to objects moving in the visual field. From area MT, fibers enter the spatial processing areas in the parietal lobe, including area 7 in the superior parietal lobule and the retrosplenial area which includes portions of the area called LIP in the inferior parietal lobule. LIP is also connected with the oculomotor system. It is able to guide the eyes in depth as well as vertically and horizontally. Both of these areas (MT and LIP) are closely related to the dorsal attention system, and that's an interesting discussion because of its relationship to transformer machines (or lack of relationship, in some cases). The figure shows some of the areas hovering around the intraparietal area that are involved in the "where" stream.

(figure from Shimegi et al 2014)

The parietal lobe, in addition to being involved with the mapping of visual space, is also involved in the spatial orientation of the organism within its environment, and that includes the position of the body as well as its parts (head, neck, eyes, and so on). Orienting movements related to visual attention frequently involve neck, head, and whole-body movements. Area MT and the areas of the parietal lobe involved in the "where" stream organize visually guided action, things like reaching, and orienting. In the parietal lobe, the first determination is made of the position of objects in space relative to the organism. This is done from and in an egocentric reference frame, that is to say, the information from the two eyes is combined on the basis of disparity to determine depth in the visual field, and the three dimensional coordinates of objects and their motions are mapped in a cyclopean manner into a geometry where the organism is at the origin.

The spatial processing centers in the parietal lobe that handle the bulk of the "where" information feed into the parahippocampal cortex (the structure of the hippocampal region is described below, and on the next page). The parahippocampal cortex contains the "parahippocampal place area" (Epstein and Kanwisher 1998), which responds preferentially to high spatial frequencies and details in the visual scene (Rajimehr et al 2011).

Combining the Streams

The dorsal and ventral streams come together in the area around the hippocampus, that handles scene mapping, navigation, and short term episodic memory. This area transforms the egocentric reference frame in several ways. One of its basic functions is to generate an allocentric "place map" that is complementary to the egocentric frame. This place map likely engages the "parahippocampal place area". Another of its basic functions is to provide context for a scene, for example the same object that is rewarding in one scene could be aversive in another. The figure shows the combining of the dorsal and ventral visual streams at the level of the entorhinal cortex. In this figure, HPC is the hippocampus.

(figure from Elward & Vargha-Khadem 2018 - CC 4.0)

The conversion between egocentric and allocentric reference frames seems to occur in and around the entorhinal cortex. This area contains the "grid cells" that map allocentric space, and the "ramp cells" that encode relationships in time. There are also cells that respond to the boundaries of scenes, things like walls, places beyond which the organism can not see. By the time this information reaches the hippocampus, it is fully allocentric, there are "place cells" that only respond when the organism is at a specific location in the allocentric map.

What is meant by "scene mapping" and "scene reconstruction"? The figure shows some of the considerations related to scenes.

(figure from Bakermans et al 2025)

To build a scene, the brain combines information from the "where" pathways and the "what" pathways. Relative to episodic memory, an episode can be considered as a collection of scenes. A scene could be something navigable, like a maze, where the objects all stay in one place but different views create different visual configurations. Or, it could be something relatively static like a room in one's own house, where one can sometimes find things without even looking. It could also be a collection of information around a memorable event, like "I was petting the cat and then it bit me". This figure shows some of the circuitry around "scene reconstruction" using visual information of various kinds.

The hippocampal formation is connected into adjoining and upstream brain areas in a wide variety of ways. The major pathway that interests us is the reciprocal connection with the prefrontal cortex, however there are also direct connections with the amygdala, nucleus accumbens, mammillary bodies, cingulate gyrus, and a host of other important brain structures. One of the primary features of hippocampal output is its "phase encoding" of information for these areas, which we'll look at in detail in the section on computation. This figure shows some of the fascinating anatomy associated with the hippocampal connections.

Up until the hippocampus, the requirements for network connectivity through the early visual layers seems to be pretty straightforward. The mapping of scenes from an egocentric viewpoint can be handled by a simple serial network, as shown in the figure.

(figure from Rolls 2025)

Even the hippocampus itself seems to be anatomically straightforward (we'll look at it in greater detail on the next page). The complexity so far seems to lie in the phase encoding mechanism, which is not yet fully understood. The functional connectivity of the visual pathways into the hippocampal area is shown in the figures below, as determined by magnetoencephalography and diffusion tractography.

(figure from Rolls et al 2024)

(figure from Rolls 2023)

Visual Attention

The overall pattern of visual cerebral organization is that anything that needs to be perceived is represented explicitly. There are more than 20 retinotopic maps in the human visual cortex, many of these are tiny and very specific. For example the figures below show the "fusiform face area" for facial recognition.

Generally speaking, the dorsal visual stream handles depth perception, spatial orientation, and the location and movement of objects in space. Whereas, the ventral stream handles object recognition, and in that capacity is very closely associated with the act of reading and deciphering printed text. This is an interesting dichotomy because reading involves a lot of motion, however it's coordinated internally by eye movements, rather than externally by visual stimuli. In the visual system, there is a separation of motion and position related information as early as the retina. P, M, and K channels are maintained faithfully through the LGN and V1, and mapped in a regular manner into V2. V2 is where the visual streams begin to split up, and it is also the first visual cortical area responding heavily to visual attention. Responsiveness in the context of attention is usually measured by an increase in firing rate to the attended stimulus - a "gain" relative to the unattended case. Visual attention is related to (and possibly coordinated by) the pulvinar nucleus of the thalamus, with some participation from the mediodorsal nucleus. Both of these thalamic nuclei project to the frontal eye fields, but the pulvinar is mainly related to visually guided attention, whereas MD has more to do with voluntary eye movements. The pulvinar is heavily connected with all of the higher visual cortical areas, including area MT that processes motion. It interfaces between the dorsal and ventral attention streams. Portions of it connect to both the ventral visual stream and the dorsal visual stream.

The dorsal visual stream goes into the parietal cortex, around an area called LIP in humans and primates (near the lateral intraparietal sulcus, behind the splenium of the corpus callosum around the retrosplenial area). Area LIP is modulated by visual attention, directing the eyes to move to objects of interest and areas of interest within objects. It is reciprocally connected with the pulvinar and sends fibers directly to the superior colliculus. "Objects of interest" in the visual field command attention. Something has to determine which objects are of interest. At one level, bright shiny objects are interesting, and dark shadows looming overhead are interesting. At another level, art is interesting, and nature is interesting, and a collection of faces is interesting. At yet another level, a glass of water is interesting when one is thirsty, and not so much when one isn't. The attention system exists at all these levels, control is passed from one level to another by mostly unknown means. This is an area where machine learning has been informing neuroscience, with architectures like the "transformer" that determine and extract the information required for contextual understanding. However this meaning of "attention" is slightly different from the psychologist's version, because humans have the ability to voluntarily shut out stimuli at will, which is something quite different from extracting meaning from nearby words in a sentence. The figure shows the architecture of a transformer for a large language model (LLM). The "multi-head attention" blocks perform an important function related to activity along the timeline - they take portions of the preceding signals and derive context from them, essentially performing a linear version of what the rotations do in the compactified version.

Returning to the brain, there are massive projections from almost all areas of the visual cortex to the pulvinar area of the thalamus, and the frontal eye fields. The pulvinar is related to attention in various modalities, while the frontal eye fields initiate voluntary saccades even in the absence of visual stimuli. The pulvinar receives direct retinotopic projections from the superior colliculus, which we'll discuss in detail in the next section on eye movements, and it's reciprocally connected with all the cortical areas that feed into the superior colliculus, including the area MT, area LIP, and the frontal eye fields.

Humans can "pay attention" to visual stimuli, or not. Sometimes, paying attention involves eye movements. There are visually driven eye movements that seem to be mostly organized in the parietal lobe, and voluntary movements that seem to be mostly organized in the frontal lobe. These two areas talk to each other, they have direct reciprocal connections and they light up like Christmas trees in a functional MRI. The pulvinar portion of the attention system is the visually driven part. Whereas, the frontal lobe attention systems are considerably more complex, involving the anterior cingulate cortex in addition to the scene mapping circuitry around the hippocampus. Interestingly enough, changes in response latency related to stimulus intensity can be dissociated from the changes in latency associated with attention. The former originate in the retina, the latter originate downstream from the superior colliculus.

Next: Visual Memory

Back to Embedding the Timeline

Home

Store

Messages/Events

About

Log In

Brian Castle
Visual Streams

Anatomy of the Visual Streams

A flattened map of the human visual cortex is shown below, indicating many of the relevant areas.

Here is a side view:

And here is a broader view:

Temporal Lobe and the Ventral Stream

Parietal Lobe and the Dorsal Stream

Combining the Streams

Up until the hippocampus, the requirements for network connectivity through the early visual layers seems to be pretty straightforward. The mapping of scenes from an egocentric viewpoint can be handled by a simple serial network, as shown in the figure.

Visual Attention

Back to Embedding the Timeline

Back to the Home Page

Home

Store

Messages/Events

About

Log In

Brian CastleVisual Streams

Anatomy of the Visual Streams

A flattened map of the human visual cortex is shown below, indicating many of the relevant areas.

Here is a side view:

And here is a broader view:

Temporal Lobe and the Ventral Stream

Parietal Lobe and the Dorsal Stream

Combining the Streams

Up until the hippocampus, the requirements for network connectivity through the early visual layers seems to be pretty straightforward. The mapping of scenes from an egocentric viewpoint can be handled by a simple serial network, as shown in the figure.

Visual Attention

Back to Embedding the Timeline

Back to the Home Page

Brian Castle
Visual Streams