Best Towards Robustness against Typographic Attack with Training-free Concept Localization alternatives.
Live source-backed alternatives to Towards Robustness against Typographic Attack with Training-free Concept Localization for Vision-language. Alternatives are selected from the same task category and update whenever the best-of index rebuilds.
Towards Robustness against Typographic Attack with Training-free Concept Localization
Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR. cs.CV Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR. Research signal collected from arXiv metadata; Gemini enrichment can add a clearer summary. cs.CV cs.CL
NVIDIA NIM Model Catalog
Matched vision-language, vision language, multimodal; 3 source links; official inference catalog signal; access model: Free endpoint
Hugging Face Inference Providers
Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid API
| # | Alternative | Kind | Access | Fit | Why it appears | Source |
|---|---|---|---|---|---|---|
| 01 | NVIDIA NIM Model Catalog | service | Free endpoint | RDR83 | Matched vision-language, vision language, multimodal; 3 source links; official inference catalog signal; access model: Free endpoint | build.nvidia.com |
| 02 | Hugging Face Inference Providers | service | Paid API | RDR80 | Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid API | huggingface.co |
| 03 | Fireworks AI Serverless Models | service | Paid API | RDR79 | Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid API | docs.fireworks.ai |
| 04 | Together AI Serverless Models | service | Paid API | RDR79 | Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid API | docs.together.ai |
| 05 | amalia-llm/MATH-Vision-PT | model | Open weights | RDR78 | Matched vision-language, vision language, image-to-text; 2 source links; access model: Open weights; freshly updated | huggingface.co |
| 06 | RSICCLLM: A Multimodal Large Language Model for Remote Sensing Image Change Captioning | paper | Research-only | RDR75 | Matched vision-language, vision language, multimodal; 2 source links; access model: Research-only; freshly updated | arxiv.org |
| 07 | Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models | paper | Research-only | RDR74 | Matched vision-language, vision language, multimodal; 1 source link; access model: Research-only | arxiv.org |
Track Towards Robustness against Typographic Attack with Training-free Concept Localization alternatives
Get private alerts when source-backed vision-language alternatives, access signals, or comparison evidence change.
API and bulk access