Recent research highlights a critical limitation in multimodal large language models (MLLMs) when functioning as automated judges: a phenomenon termed 'Perceptual Judgment Bias'. This bias manifests when MLLMs favor text-based narratives, even if they conflict with visual evidence, leading to inconsistent and unreliable evaluations. To systematically study this, a new dataset, the Perceptually Perturbed Judgment Dataset, was created. This dataset features minimally altered counterfactual responses designed to isolate perceptual errors and provide verifiable supervision. Building on this, a novel training framework was developed. This framework integrates a structured GRPO-based reward mechanism with a batch-ranking objective, enabling coherent global ordering without the need for explicit pairwise labels. Experiments conducted on various MLLM-as-a-Judge benchmarks indicate that this approach significantly enhances perceptual fidelity, improves ranking coherence, and aligns more closely with human evaluation. The findings suggest a scalable and generalizable method for developing perceptually grounded, interpretable, and robust multimodal judges capable of handling visual-reasoning conflicts effectively.
Featured on AI Radar: Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge