Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge

Researchers have identified and analyzed "Perceptual Judgment Bias" in multimodal large language models (MLLMs) when used as evaluators. This bias causes MLLMs to prioritize plausible textual narratives over perceptually accurate visual evidence. To address this, they introduced the Perceptually Perturbed Judgment Dataset and a unified training framework combining GRPO-based rewards with a batch-ranking objective, aiming to improve perceptual fidelity and alignment with human evaluation.

RDR83Confidence 88%Multimodal LLMsEvaluationBias MitigationPerceptual ReasoningReward Modeling

Why it matters

This research is important for improving the reliability and interpretability of MLLMs when used as automated evaluators. By mitigating perceptual judgment bias, MLLMs can become more trustworthy in tasks requiring visual reasoning, leading to more consistent and verifiable evaluations across various applications.

Recent research highlights a critical limitation in multimodal large language models (MLLMs) when functioning as automated judges: a phenomenon termed 'Perceptual Judgment Bias'. This bias manifests when MLLMs favor text-based narratives, even if they conflict with visual evidence, leading to inconsistent and unreliable evaluations. To systematically study this, a new dataset, the Perceptually Perturbed Judgment Dataset, was created. This dataset features minimally altered counterfactual responses designed to isolate perceptual errors and provide verifiable supervision. Building on this, a novel training framework was developed. This framework integrates a structured GRPO-based reward mechanism with a batch-ranking objective, enabling coherent global ordering without the need for explicit pairwise labels. Experiments conducted on various MLLM-as-a-Judge benchmarks indicate that this approach significantly enhances perceptual fidelity, improves ranking coherence, and aligns more closely with human evaluation. The findings suggest a scalable and generalizable method for developing perceptually grounded, interpretable, and robust multimodal judges capable of handling visual-reasoning conflicts effectively.

Article ID - cmpweuxeu0Featured on AI Radar: Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge