Researchers compared tightly matched LLM and VLM pairs in a strictly text-only setting to isolate the impact of multimodal training history on human alignment during natural reading. The study evaluated model alignment using a human natural-reading dataset that included whole-cortex fMRI responses and synchronized eye-tracking saccades. The findings indicate that multimodal pretraining may not provide a uniform, global advantage in human alignment during natural reading. Instead, language-internal representations remain a key factor for modeling human text processing. However, a selective VLM advantage was observed when sentences contained stronger visual semantic content, with evidence from both fMRI and eye-movement alignments. This suggests that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.
VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading
A new study investigates whether vision-language models (VLMs) offer a global advantage over large language models (LLMs) in aligning with human language processing during natural reading. The research, conducted under text-only conditions, suggests that multimodal pretraining does not uniformly improve human alignment. Instead, any VLM advantage appears selectively, particularly when sentences have strong visual semantic content.
Why it matters
This research is important for understanding the fundamental mechanisms of how multimodal pretraining influences language representations. It suggests that while VLMs can be beneficial, their advantage in mimicking human natural reading is not universal and is more pronounced in specific contexts, particularly those with strong visual semantic content. This insight can guide the development of more efficient and human-aligned AI models.