The SOCO benchmark aims to systematically evaluate Semantic Object Correspondence (SC) in vision foundation models. SC assesses a model's ability to match object parts across different instances and categories, even with significant variations in appearance, viewpoint, and geometry. The benchmark features a taxonomy of correspondence types and provides over 1 million functionally meaningful keypoint annotations across 100 categories. Additionally, SOCO includes keypoint language descriptions, which enable the evaluation of large vision-language models (LVLMs) and their fine-grained, part-level understanding capabilities.
Initial experiments using SOCO revealed several key insights:
1. Vision foundation backbones possess strong semantic structure but struggle with transferring correspondences effectively across related categories and only partially capture object-part positions. 2. LVLMs demonstrate stronger performance in text-prompted part localization compared to visual-reference cross-image matching, indicating a disparity between language-grounded localization and detailed visual correspondence. 3. Performance on semantic correspondence tasks, as measured by SOCO, is a more robust predictor for performance in dense downstream tasks—such as segmentation, tracking, 3D pose estimation, and 3D detection—than performance on ImageNet classification.
Featured on AI Radar: SOCO: A New Benchmark for Semantic Object Correspondence in Vision Foundation Models