Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

Read original: arXiv:2407.10737 - Published 7/16/2024 by Rining Wu, Feixiang Zhou, Ziwei Yin, Jian K. Liu

Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

Overview

This paper explores the alignment between neuronal coding of dynamic visual scenes and foundation vision models.
It investigates how well modern computer vision models can capture the neural representations of the brain when processing dynamic visual stimuli.
The researchers compare the responses of neurons in the visual cortex of macaque monkeys to the activations of various deep learning models trained on natural videos.

Plain English Explanation

The human brain is incredibly adept at processing and understanding the visual world around us. Researchers have long been fascinated by how the brain's neurons, or nerve cells, respond to and encode dynamic visual information. Neural Representations of Dynamic Visual Stimuli and Time-Cell Inspired Temporal Codebook for Spiking Neural Networks are two examples of research that has delved into this topic.

In this paper, the authors wanted to see how well modern artificial intelligence (AI) models, known as "foundation vision models," can capture the neural representations of the brain when processing dynamic visual scenes, such as videos. They compared the responses of neurons in the visual cortex of macaque monkeys to the activations of various deep learning models trained on natural videos.

The key idea is that if the AI models can align well with the brain's neural coding of dynamic visual information, it would suggest that these models are effectively mimicking how the human visual system works. This could have important implications for understanding human vision, as well as developing more advanced computer vision systems.

Technical Explanation

The researchers used several state-of-the-art deep learning models, including Playing to Vision's Strengths: Stereo for 2D-to-3D Transfer, to analyze the neural responses of macaque monkeys watching dynamic visual stimuli. They recorded the activity of individual neurons in the visual cortex and compared these neural representations to the internal activations of the AI models.

By aligning the neuronal coding with the foundation vision models, the researchers were able to gain insights into how the brain processes and encodes dynamic visual information. The results suggest that the AI models can capture many aspects of the neural representations, particularly in the earlier stages of visual processing.

However, the authors also note that there are still significant gaps between the AI models and the brain's neural coding, especially in the higher-level stages of visual processing. This indicates that there is still much to be learned about the brain's sophisticated mechanisms for understanding complex, dynamic visual scenes.

Critical Analysis

The paper provides a valuable contribution to our understanding of how the brain's neural representations align with modern computer vision models. The researchers used rigorous experimental methods and state-of-the-art AI models to conduct their analysis, which lends credibility to their findings.

That said, the paper also acknowledges the limitations of the study. The experiments were conducted on macaque monkeys, which may not fully capture the nuances of human visual processing. Additionally, the AI models, while impressive, are still relatively simplistic compared to the brain's complex neural networks.

Further research is needed to explore the deeper layers of visual processing and to better understand the mechanisms underlying the brain's extraordinary ability to make sense of dynamic, real-world visual scenes. Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Scenes and NeuroCine: Decoding Vivid Video Sequences from Human Brain Activity are examples of research that delve into these more complex aspects of visual processing.

Conclusion

This paper represents an important step towards understanding the relationship between the brain's neural coding of dynamic visual scenes and the capabilities of modern computer vision models. By aligning these two domains, the researchers have gained valuable insights into the mechanisms of human vision and the potential of AI systems to mimic these natural processes.

While the results suggest that current foundation vision models can capture many aspects of the brain's neural representations, there is still much work to be done to fully bridge the gap between artificial and biological visual processing. Continued research in this area could lead to breakthroughs in both neuroscience and computer vision, with implications for a wide range of applications, from autonomous systems to medical diagnostic tools.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

Rining Wu, Feixiang Zhou, Ziwei Yin, Jian K. Liu

Our brains represent the ever-changing environment with neurons in a highly dynamic fashion. The temporal features of visual pixels in dynamic natural scenes are entrapped in the neuronal responses of the retina. It is crucial to establish the intrinsic temporal relationship between visual pixels and neuronal responses. Recent foundation vision models have paved an advanced way of understanding image pixels. Yet, neuronal coding in the brain largely lacks a deep understanding of its alignment with pixels. Most previous studies employ static images or artificial videos derived from static images for emulating more real and complicated stimuli. Despite these simple scenarios effectively help to separate key factors influencing visual coding, complex temporal relationships receive no consideration. To decompose the temporal features of visual coding in natural scenes, here we propose Vi-ST, a spatiotemporal convolutional neural network fed with a self-supervised Vision Transformer (ViT) prior, aimed at unraveling the temporal-based encoding patterns of retinal neuronal populations. The model demonstrates robust predictive performance in generalization tests. Furthermore, through detailed ablation experiments, we demonstrate the significance of each temporal module. Furthermore, we introduce a visual coding evaluation metric designed to integrate temporal considerations and compare the impact of different numbers of neuronal populations on complementary coding. In conclusion, our proposed Vi-ST demonstrates a novel modeling framework for neuronal coding of dynamic visual scenes in the brain, effectively aligning our brain representation of video with neuronal activity. The code is available at https://github.com/wurining/Vi-ST.

7/16/2024

Neural Representations of Dynamic Visual Stimuli

Jacob Yeung, Andrew F. Luo, Gabriel Sarch, Margaret M. Henderson, Deva Ramanan, Michael J. Tarr

Humans experience the world through constantly changing visual stimuli, where scenes can shift and move, change in appearance, and vary in distance. The dynamic nature of visual perception is a fundamental aspect of our daily lives, yet the large majority of research on object and scene processing, particularly using fMRI, has focused on static stimuli. While studies of static image perception are attractive due to their computational simplicity, they impose a strong non-naturalistic constraint on our investigation of human vision. In contrast, dynamic visual stimuli offer a more ecologically-valid approach but present new challenges due to the interplay between spatial and temporal information, making it difficult to disentangle the representations of stable image features and motion. To overcome this limitation -- given dynamic inputs, we explicitly decouple the modeling of static image representations and motion representations in the human brain. Three results demonstrate the feasibility of this approach. First, we show that visual motion information as optical flow can be predicted (or decoded) from brain activity as measured by fMRI. Second, we show that this predicted motion can be used to realistically animate static images using a motion-conditioned video diffusion model (where the motion is driven by fMRI brain activity). Third, we show prediction in the reverse direction: existing video encoders can be fine-tuned to predict fMRI brain activity from video imagery, and can do so more effectively than image encoders. This foundational work offers a novel, extensible framework for interpreting how the human brain processes dynamic visual information.

6/6/2024

🧠

Time Cell Inspired Temporal Codebook in Spiking Neural Networks for Enhanced Image Generation

Linghao Feng, Dongcheng Zhao, Sicheng Shen, Yiting Dong, Guobin Shen, Yi Zeng

This paper presents a novel approach leveraging Spiking Neural Networks (SNNs) to construct a Variational Quantized Autoencoder (VQ-VAE) with a temporal codebook inspired by hippocampal time cells. This design captures and utilizes temporal dependencies, significantly enhancing the generative capabilities of SNNs. Neuroscientific research has identified hippocampal time cells that fire sequentially during temporally structured experiences. Our temporal codebook emulates this behavior by triggering the activation of time cell populations based on similarity measures as input stimuli pass through it. We conducted extensive experiments on standard benchmark datasets, including MNIST, FashionMNIST, CIFAR10, CelebA, and downsampled LSUN Bedroom, to validate our model's performance. Furthermore, we evaluated the effectiveness of the temporal codebook on neuromorphic datasets NMNIST and DVS-CIFAR10, and demonstrated the model's capability with high-resolution datasets such as CelebA-HQ, LSUN Bedroom, and LSUN Church. The experimental results indicate that our method consistently outperforms existing SNN-based generative models across multiple datasets, achieving state-of-the-art performance. Notably, our approach excels in generating high-resolution and temporally consistent data, underscoring the crucial role of temporal information in SNN-based generative modeling.

5/24/2024

Time-Dependent VAE for Building Latent Factor from Visual Neural Activity with Complex Dynamics

Liwei Huang, ZhengYu Ma, Liutao Yu, Huihui Zhou, Yonghong Tian

Seeking high-quality neural latent representations to reveal the intrinsic correlation between neural activity and behavior or sensory stimulation has attracted much interest. Currently, some deep latent variable models rely on behavioral information (e.g., movement direction and position) as an aid to build expressive embeddings while being restricted by fixed time scales. Visual neural activity from passive viewing lacks clearly correlated behavior or task information, and high-dimensional visual stimulation leads to intricate neural dynamics. To cope with such conditions, we propose Time-Dependent SwapVAE, following the approach of separating content and style spaces in Swap-VAE, on the basis of which we introduce state variables to construct conditional distributions with temporal dependence for the above two spaces. Our model progressively generates latent variables along neural activity sequences, and we apply self-supervised contrastive learning to shape its latent space. In this way, it can effectively analyze complex neural dynamics from sequences of arbitrary length, even without task or behavioral data as auxiliary inputs. We compare TiDe-SwapVAE with alternative models on synthetic data and neural data from mouse visual cortex. The results show that our model not only accurately decodes complex visual stimuli but also extracts explicit temporal neural dynamics, demonstrating that it builds latent representations more relevant to visual stimulation.

8/16/2024