Researchers have introduced LocateAnything, a unified generative framework for vision-language grounding and detection. The core innovation is Parallel Box Decoding (PBD), which processes geometric elements like bounding boxes and points as single, atomic units rather than a sequence of independent tokens. This method contrasts with conventional vision-language models (VLMs) that serialize 2D boxes into multiple 1D tokens, leading to sequential generation bottlenecks and potential loss of geometric coherence.
PBD is designed to improve both decoding throughput and localization accuracy by leveraging parallelism. The framework also includes LocateAnything-Data, a large-scale dataset comprising over 138 million training samples. This extensive dataset aims to increase data diversity, which is crucial for achieving high-precision localization. Evaluations indicate that LocateAnything significantly advances the speed-accuracy frontier, offering higher decoding throughput and improved high-IoU localization quality across various benchmarks. The findings emphasize the combined benefits of Parallel Box Decoding and large-scale training data for efficient and precise unified visual grounding and detection.