Why it matters
This research is important for improving the reliability and interpretability of MLLMs when used as automated evaluators. By mitigating perceptual judgment bias, MLLMs can become more trustworthy in tasks requiring visual reasoning, leading to more consistent and verifiable evaluations across various applications.

Recent research highlights a critical limitation in multimodal large language models (MLLMs) when functioning as automated judges: a phenomenon termed 'Perceptual Judgment Bias'. This bias manifests when MLLMs favor text-based narratives, even if they conflict with visual evidence, leading to inconsistent and unreliable evaluations. To systematically study this, a new dataset, the Perceptually Perturbed Judgment Dataset, was created. This dataset features minimally altered counterfactual responses designed to isolate perceptual errors and provide verifiable supervision. Building on this, a novel training framework was developed. This framework integrates a structured GRPO-based reward mechanism with a batch-ranking objective, enabling coherent global ordering without the need for explicit pairwise labels. Experiments conducted on various MLLM-as-a-Judge benchmarks indicate that this approach significantly enhances perceptual fidelity, improves ranking coherence, and aligns more closely with human evaluation. The findings suggest a scalable and generalizable method for developing perceptually grounded, interpretable, and robust multimodal judges capable of handling visual-reasoning conflicts effectively.

Share:XHacker NewsLink
Article ID - cmpweuxeu0Featured on AI Radar: Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge