菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-05
📄 Abstract - A foundation model of vision, audition, and language for in-silico neuroscience

Cognitive neuroscience is fragmented into specialized models, each tailored to specific experimental paradigms, hence preventing a unified model of cognition in the human brain. Here, we introduce TRIBE v2, a tri-modal (video, audio and language) foundation model capable of predicting human brain activity in a variety of naturalistic and experimental conditions. Leveraging a unified dataset of over 1,000 hours of fMRI across 720 subjects, we demonstrate that our model accurately predicts high-resolution brain responses for novel stimuli, tasks and subjects, superseding traditional linear encoding models, delivering several-fold improvements in accuracy. Critically, TRIBE v2 enables in silico experimentation: tested on seminal visual and neuro-linguistic paradigms, it recovers a variety of results established by decades of empirical research. Finally, by extracting interpretable latent features, TRIBE v2 reveals the fine-grained topography of multisensory integration. These results establish artificial intelligence as a unifying framework for exploring the functional organization of the human brain.

顶级标签: multi-modal systems machine learning
详细标签: foundation model neuroscience fmri prediction multisensory integration in-silico 或 搜索:

面向计算机神经科学的视觉、听觉与语言基础模型 / A foundation model of vision, audition, and language for in-silico neuroscience


1️⃣ 一句话总结

本文提出了一种名为TRIBE v2的多模态(视频、音频和语言)基础模型,通过学习超过1000小时的720名受试者脑部fMRI数据,能高精度预测大脑对不同刺激的反应,并在计算机模拟中复现了数十年来经典的视觉与语言神经科学实验结果,为研究大脑功能提供了一种统一的AI框架。

源自 arXiv: 2605.04326