SOCO: A New Benchmark for Semantic Object Correspondence in Vision Foundation Models

Researchers have introduced SOCO, a new benchmark designed to evaluate Semantic Object Correspondence (SC) in vision foundation models and large vision-language models (LVLMs). SOCO provides a taxonomy of correspondence types and over 1 million keypoint annotations across 100 categories, including language descriptions for part-level understanding.

RDR80Confidence 90%vision foundation modelslarge vision-language modelssemantic correspondenceobject understandingbenchmarkingcomputer vision

Why it matters

SOCO addresses the challenge of evaluating structured object understanding in vision models by offering a consistent and comprehensive benchmark. Its findings highlight current limitations in how vision models transfer correspondences across categories and the gap between language-grounded localization and fine-grained visual correspondence in LVLMs. The benchmark also demonstrates that correspondence performance is a stronger predictor for downstream tasks like segmentation and 3D pose estimation than traditional ImageNet classification.

The SOCO benchmark aims to systematically evaluate Semantic Object Correspondence (SC) in vision foundation models. SC assesses a model's ability to match object parts across different instances and categories, even with significant variations in appearance, viewpoint, and geometry. The benchmark features a taxonomy of correspondence types and provides over 1 million functionally meaningful keypoint annotations across 100 categories. Additionally, SOCO includes keypoint language descriptions, which enable the evaluation of large vision-language models (LVLMs) and their fine-grained, part-level understanding capabilities.

Initial experiments using SOCO revealed several key insights:

1. Vision foundation backbones possess strong semantic structure but struggle with transferring correspondences effectively across related categories and only partially capture object-part positions. 2. LVLMs demonstrate stronger performance in text-prompted part localization compared to visual-reference cross-image matching, indicating a disparity between language-grounded localization and detailed visual correspondence. 3. Performance on semantic correspondence tasks, as measured by SOCO, is a more robust predictor for performance in dense downstream tasks—such as segmentation, tracking, 3D pose estimation, and 3D detection—than performance on ImageNet classification.

Article ID - cmpuquzu00Featured on AI Radar: SOCO: A New Benchmark for Semantic Object Correspondence in Vision Foundation Models