A New Paper Decodes What Someone Sees From Their Brain Activity — Without Training on That Person
Fifteen researchers from CMU, HKU, Columbia, and Harvard report a meta-learning method that reads visual content out of fMRI signals on a completely new subject with no fine-tuning at all. Until now, every brain-decoding model had to be retrained per person; this paper frames the result as 'a critical step towards a generalizable foundation model for non-invasive brain decoding.'

A preprint posted to arXiv on April 9, 2026, describes what the authors argue is the first fMRI visual-decoding system that generalizes to new people without any fine-tuning. The paper, "Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding", has 15 co-authors led by Mu Nan, Muquan Yu, and Weijian Mai, with senior authorship from Michael J. Tarr (Carnegie Mellon), Nikolaus Kriegeskorte (Columbia), Xiaoqing Hu (University of Hong Kong), and Andrew F. Luo (University of Hong Kong, formerly CMU).
The problem
Visual decoding from brain signals — reconstructing what a person is seeing from their fMRI scans — has been a central problem at the intersection of computer vision and neuroscience for two decades. Until now, every working system has shared the same bottleneck: neural representations vary enough from person to person that you can't move a decoder trained on one brain to another. Each new subject requires either a bespoke model or at minimum a fine-tuning run with that subject's own brain-and-image data. That bottleneck is the reason brain-decoding work has not scaled the way language or vision models have.
What the paper does
The authors train a meta-optimized decoder that learns how to learn a new subject's encoding model from a small in-context set of image–brain-activation pairs. At inference time, instead of fine-tuning, the model is shown a handful of examples from the new person and infers their specific per-voxel encoding parameters on the fly. It then performs decoding by hierarchical inference — inverting the inferred encoder to map fMRI activity back to the visual content.
The procedure, described in the abstract, has two steps:
- Per-voxel encoder inference. For multiple brain regions, the model constructs a context from the new subject's stimulus-and-response pairs and estimates the visual-response encoder parameters for each voxel.
- Aggregated functional inversion. Using those inferred encoder parameters and the voxel responses from a held-out stimulus, the model aggregates across voxels and inverts the encoder to produce a semantic visual decoding.
What it generalizes across
The paper claims four distinct axes of generalization for the same trained model, with no per-subject retraining:
- Across subjects. New people the model has never seen.
- Across scanners. Data from fMRI machines it wasn't trained on.
- Across visual backbones. The downstream image representation can be swapped.
- Without anatomical alignment or stimulus overlap. The new subject doesn't need to be aligned to a shared brain template, and the test stimuli don't need to overlap with anything the model has previously seen.
That last point is the strongest. Most cross-subject methods in brain decoding require either anatomical alignment (mapping everyone's brain to a shared template like MNI space) or shared stimuli (the new subject needs to see some of the same images as the training subjects). This paper claims neither.
Why the foundation-model framing matters
The abstract's closing sentence — "a critical step towards a generalizable foundation model for non-invasive brain decoding" — is a deliberate and ambitious positioning. Foundation models in other domains (GPT in language, CLIP in vision, AlphaFold in biology) share a pattern: train once on a large heterogeneous corpus, then deploy zero- or few-shot on any new instance in the domain. Until now there has been no equivalent for brain decoding, because every new subject was a new training problem. If the result holds up in peer review, the method is the closest yet to making fMRI decoding a prompt-and-go operation.
The practical stakes are significant. Clinical applications — communication prosthetics for locked-in patients, for example — have been stuck at the per-subject-training step for years. A working cross-subject decoder would cut the per-patient onboarding cost from weeks of scanner time to the length of a single short session.
Caveats
This is a preprint and has not yet been peer-reviewed. The abstract reports strong generalization but does not specify reconstruction quality or task performance numbers — those are in the full paper, not the abstract WHO-style summary. The method decodes "semantic visual" content, which in the brain-decoding literature typically means identifying object categories or reconstructing low-frequency image content, rather than pixel-perfect reconstruction. The authors note the approach is "explicitly optimized for in-context learning" — meaning its generalization is learned, not free, and will be constrained by the distribution of subjects and stimuli in its training data.
The paper is at arXiv:2604.08537.