Topic

Visual Localization

Updated 2026.03.31 · 401 papers

← Back to topics
NeedleDB: A Generative-AI Based System for Accurate and Efficient Image Retrieval using Complex Natural Language Queries Mahdi Erfanian, Abolfazl Asudeh Updated 2026-03-29

We demonstrate NeedleDB, an open-source, deployment-ready database system for answering complex natural language queries over image data. Unlike existing approaches that rely on contrastive-learning embeddings (e.g., CLIP), which degrade on compositional or nuanced queries, NeedleDB leverages generative AI to synthesize guide images that represent the query in the visual domain, transforming the text-to-image retrieval problem into a more tractable image-to-image search. The system aggregates nearest-neighbor results across multiple vision embedders using a weighted rank-fusion strategy grounded in a Monte Carlo estimator with provable error bounds. NeedleDB ships with a full-featured command-line interface (needlectl), a browser-based Web UI, and a modular microservice architecture backed by PostgreSQL and Milvus. On challenging benchmarks, it improves Mean Average Precision by up to 93% over the strongest baseline while maintaining sub-second query latency. In our demonstration, attendees interact with NeedleDB through three hands-on scenarios that showcase its retrieval capabilities, data ingestion workflow, and pipeline configurability.

Preview loads on expand
TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval David G. Shatwell, Sirnam Swetha, Mubarak Shah Updated 2026-03-28

Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, location, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and propose TIGeR, a unified framework for Time, Images and Geo-location Retrieval. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time-aware retrieval. By preserving the underlying location identity despite large appearance changes, TIGeR enables retrieval based on where and when a scene was captured, rather than purely on visual similarity. To support this task, we design a multistage data curation pipeline and propose a new diverse dataset of 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.

Preview loads on expand
Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni Updated 2026-03-27

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

Preview loads on expand
HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Jiajia Nie, Yinwei Wei, Yupeng Hu Updated 2026-03-27

Composed Image Retrieval (CIR) is a challenging image retrieval paradigm. It aims to retrieve target images from large-scale image databases that are consistent with the modification semantics, based on a multimodal query composed of a reference image and modification text. Although existing methods have made significant progress in cross-modal alignment and feature fusion, a key flaw remains: the neglect of contextual information in discriminating matching samples. However, addressing this limitation is not an easy task due to two challenges: 1) implicit dependencies and 2) the lack of a differential amplification mechanism. To address these challenges, we propose a dual-patH composItional coNtextualized neTwork (HINT), which can perform contextualized encoding and amplify the similarity differences between matching and non-matching samples, thus improving the upper performance of CIR models in complex scenarios. Our HINT model achieves optimal performance on all metrics across two CIR benchmark datasets, demonstrating the superiority of our HINT model. Codes are available at https://github.com/zh-mingyu/HINT.

Preview loads on expand
4DRaL: Bridging 4D Radar with LiDAR for Place Recognition using Knowledge Distillation Ningyuan Huang, Zhiheng Li, Zheng Fang Updated 2026-03-27

Place recognition is crucial for loop closure detection and global localization in robotics. Although mainstream algorithms typically rely on cameras and LiDAR, these sensors are susceptible to adverse weather conditions. Fortunately, the recently developed 4D millimeter-wave radar (4D radar) offers a promising solution for all-weather place recognition. However, the inherent noise and sparsity in 4D radar data significantly limit its performance. Thus, in this paper, we propose a novel framework called 4DRaL that leverages knowledge distillation (KD) to enhance the place recognition performance of 4D radar. Its core is to adopt a high-performance LiDAR-to-LiDAR (L2L) place recognition model as a teacher to guide the training of a 4D radar-to-4D radar (R2R) place recognition model. 4DRaL comprises three key KD modules: a local image enhancement module to handle the sparsity of raw 4D radar points, a feature distribution distillation module that ensures the student model generates more discriminative features, and a response distillation module to maintain consistency in feature space between the teacher and student models. More importantly, 4DRaL can also be trained for 4D radar-to-LiDAR (R2L) place recognition through different module configurations. Experimental results prove that 4DRaL achieves state-of-the-art performance in both R2R and R2L tasks regardless of normal or adverse weather.

Preview loads on expand
Few Shots Text to Image Retrieval: New Benchmarking Dataset and Optimization Methods Ofer Idan, Vladi Vexler, Gil Lederman, Dima Sivov, Aviad Cohen Zada, Shir Niego Komforti Updated 2026-03-26

Pre-trained vision-language models (VLMs) excel in multimodal tasks, commonly encoding images as embedding vectors for storage in databases and retrieval via approximate nearest neighbor search (ANNS). However, these models struggle with compositional queries and out-of-distribution (OOD) image-text pairs. Inspired by human cognition's ability to learn from minimal examples, we address this performance gap through few-shot learning approaches specifically designed for image retrieval. We introduce the Few-Shot Text-to-Image Retrieval (FSIR) task and its accompanying benchmark dataset, FSIR-BD - the first to explicitly target image retrieval by text accompanied by reference examples, focusing on the challenging compositional and OOD queries. The compositional part is divided to urban scenes and nature species, both in specific situations or with distinctive features. FSIR-BD contains 38,353 images and 303 queries, with 82% comprising the test corpus (averaging per query 37 positives, ground truth matches, and significant number of hard negatives) and 18% forming the few-shot reference corpus (FSR) of exemplar positive and hard negative images. Additionally, we propose two novel retrieval optimization methods leveraging single shot or few shot reference examples in the FSR to improve performance. Both methods are compatible with any pre-trained image encoder, making them applicable to existing large-scale environments. Our experiments demonstrate that: (1) FSIR-BD provides a challenging benchmark for image retrieval; and (2) our optimization methods outperform existing baselines as measured by mean Average Precision (mAP). Further research into FSIR optimization methods will help narrow the gap between machine and human-level understanding, particularly for compositional reasoning from limited examples.

Preview loads on expand
Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz Updated 2026-03-26

Cross-view geo-localization (CVGL) estimates a camera's location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.

Preview loads on expand
On-Demand Instructional Material Providing Agent Based on MLLM for Tutoring Support Takumi Kato, Masato Kikuchi, Tadachika Ozono Updated 2026-03-26

Effective instruction in tutoring requires promptly providing instructional materials that match the needs of each student (e.g., in response to questions). In this study, we introduce an agent that automatically delivers supplementary materials on demand during one-on-one tutoring sessions. Our agent uses a multimodal large language model to analyze spoken dialogue between the instructor and the student, automatically generate search queries, and retrieve relevant Web images. Evaluation experiments demonstrate that our agent reduces the average image retrieval time by 44.4 s compared to cases without support and successfully provides images of acceptable quality in 85.7% of trials. These results indicate that our agent effectively supports instructors during tutoring sessions.

Preview loads on expand
Sparse Autoencoders for Interpretable Medical Image Representation Learning Philipp Wesp, Robbie Holland, Vasiliki Sideri-Lampretsa, Sergios Gatidis Updated 2026-03-24

Vision foundation models (FMs) achieve state-of-the-art performance in medical imaging. However, they encode information in abstract latent representations that clinicians cannot interrogate or verify. The goal of this study is to investigate Sparse Autoencoders (SAEs) for replacing opaque FM image representations with human-interpretable, sparse features. We train SAEs on embeddings from BiomedParse (biomedical) and DINOv3 (general-purpose) using 909,873 CT and MRI 2D image slices from the TotalSegmentator dataset. We find that learned sparse features: (a) reconstruct original embeddings with high fidelity (R2 up to 0.941) and recover up to 87.8% of downstream performance using only 10 features (99.4% dimensionality reduction), (b) preserve semantic fidelity in image retrieval tasks, (c) correspond to specific concepts that can be expressed in language using large language model (LLM)-based auto-interpretation. (d) bridge clinical language and abstract latent representations in zero-shot language-driven image retrieval. Our work indicates SAEs are a promising pathway towards interpretable, concept-driven medical vision systems. Code repository: https://github.com/pwesp/sail.

Preview loads on expand
ARGENT: Adaptive Hierarchical Image-Text Representations Chuong Huynh, Hossein Souri, Abhinav Kumar, Vitali Petsiuk, Deen Dayal Mohan, Suren Kumar Updated 2026-03-24

Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT, Adaptive hieRarchical imaGe-tExt represeNTation. ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.

Preview loads on expand
Retrieval-Guided Photovoltaic Inventory Estimation from Satellite Imagery for Distribution Grid Planning Muhao Guo, Lihao Mai, Erik Blasch, Jafarali Parol, Turki Rakan, Yang Weng Updated 2026-03-24

The rapid expansion of distributed rooftop photovoltaic (PV) systems introduces increasing uncertainty in distribution grid planning, hosting capacity assessment, and voltage regulation. Reliable estimation of rooftop PV deployment from satellite imagery is therefore essential for accurate modeling of distributed generation at feeder and service-territory scales. However, conventional computer vision approaches rely on fixed learned representations and globally averaged visual correlations. This makes them sensitive to geographic distribution shifts caused by differences in roof materials, urban morphology, and imaging conditions across regions. To address these challenges, this paper proposes Solar Retrieval-Augmented Generation (Solar-RAG), a context-grounded framework for photovoltaic assessment that integrates similarity-based image retrieval with multimodal vision-language reasoning. Instead of producing predictions solely from internal model parameters, the proposed approach retrieves visually similar rooftop scenes with verified annotations and performs comparative reasoning against these examples during inference. This retrieval-guided mechanism provides geographically contextualized references that improve robustness under heterogeneous urban environments without requiring model retraining. The method outperform both conventional deep vision models and standalone vision-language models. Furthermore, feeder-level case studies show that improved PV inventory estimation reduces errors in voltage deviation analysis and hosting capacity assessment. The results demonstrate that the proposed method provides a scalable and geographically robust approach for monitoring distributed PV deployment. This enables more reliable integration of remote sensing data into distribution grid planning and distributed energy resource management.

Preview loads on expand
SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts Khanh Binh Nguyen, Chae Jung Park Updated 2026-03-24

Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.

Preview loads on expand
HyFI: Hyperbolic Feature Interpolation for Brain-Vision Alignment Sangmin Jo, Wootaek Jeong, Da-Woon Heo, Yoohwan Hwang, Heung-Il Suk Updated 2026-03-24

Recent progress in artificial intelligence has encouraged numerous attempts to understand and decode human visual system from brain signals. These prior works typically align neural activity independently with semantic and perceptual features extracted from images using pre-trained vision models. However, they fail to account for two key challenges: (1) the modality gap arising from the natural difference in the information level of representation between brain signals and images, and (2) the fact that semantic and perceptual features are highly entangled within neural activity. To address these issues, we utilize hyperbolic space, which is well-suited for considering differences in the amount of information and has the geometric property that geodesics between two points naturally bend toward the origin, where the representational capacity is lower. Leveraging these properties, we propose a novel framework, Hyperbolic Feature Interpolation (HyFI), which interpolates between semantic and perceptual visual features along hyperbolic geodesics. This enables both the fusion and compression of perceptual and semantic information, effectively reflecting the limited expressiveness of brain signals and the entangled nature of these features. As a result, it facilitates better alignment between brain and visual features. We demonstrate that HyFI achieves state-of-the-art performance in zero-shot brain-to-image retrieval, outperforming prior methods with Top-1 accuracy improvements of up to +17.3% on THINGS-EEG and +9.1% on THINGS-MEG.

Preview loads on expand
ADaFuSE: Adaptive Diffusion-generated Image and Text Fusion for Interactive Text-to-Image Retrieval Zhuocheng Zhang, Xingwu Zhang, Kangheng Liang, Guanxuan Li, Richard Mccreadie, Zijun Long Updated 2026-03-23

Recent advances in interactive text-to-image retrieval (I-TIR) use diffusion models to bridge the modality gap between the textual information need and the images to be searched, resulting in increased effectiveness. However, existing frameworks fuse multi-modal views of user feedback by simple embedding addition. In this work, we show that this static and undifferentiated fusion indiscriminately incorporates generative noise produced by the diffusion model, leading to performance degradation for up to 55.62% samples. We further propose ADaFuSE (Adaptive Diffusion-Text Fusion with Semantic-aware Experts), a lightweight fusion model designed to align and calibrate multi-modal views for diffusion-augmented I-TIR, which can be plugged into existing frameworks without modifying the backbone encoder. Specifically, we introduce a dual-branch fusion mechanism that employs an adaptive gating branch to dynamically balance modality reliability, alongside a semantic-aware mixture-of-experts branch to capture fine-grained cross-modal nuances. Via thorough evaluation over four standard I-TIR benchmarks, ADaFuSE achieves state-of-the-art performance, surpassing DAR by up to 3.49% in Hits@10 with only a 5.29% parameter increase, while exhibiting stronger robustness to noisy and longer interactive queries. These results show that generative augmentation coupled with principled fusion provides a simple, generalizable alternative to fine-tuning for interactive retrieval.

Preview loads on expand
SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval Qunjie Huang, Weina Zhu Updated 2026-03-21

Cross-subject EEG-to-image retrieval for visual decoding is challenged by subject shift and hubness in the embedding space, which distort similarity geometry and destabilize top-k rankings, making small-k shortlists unreliable. We introduce SATTC (Structure-Aware Test-Time Calibration), a label-free calibration head that operates directly on the similarity matrix of frozen EEG and image encoders. SATTC combines a geometric expert, subject-adaptive whitening of EEG embeddings with an adaptive variant of Cross-domain Similarity Local Scaling (CSLS), and a structural expert built from mutual nearest neighbors, bidirectional top-k ranks, and class popularity, fused via a simple Product-of-Experts rule. On THINGS-EEG under a strict leave-one-subject-out protocol, standardized inference with cosine similarities, L2-normalized embeddings, and candidate whitening already yields a strong cross-subject baseline over the original ATM retrieval setup. Building on this baseline, SATTC further improves Top-1 and Top-5 accuracy, reduces hubness and per-class imbalance, and produces more reliable small-k shortlists. These gains transfer across multiple EEG encoders, supporting SATTC as an encoder-agnostic, label-free test-time calibration layer for cross-subject neural decoding.

Preview loads on expand
A Multihead Continual Learning Framework for Fine-Grained Fashion Image Retrieval with Contrastive Learning and Exponential Moving Average Distillation Ling Xiao, Toshihiko Yamasaki Updated 2026-03-21

Most fine-grained fashion image retrieval (FIR) methods assume a static setting, requiring full retraining when new attributes appear, which is costly and impractical for dynamic scenarios. Although pretrained models support zero-shot inference, their accuracy drops without supervision, and no prior work explores class-incremental learning (CIL) for fine-grained FIR. We propose a multihead continual learning framework for fine-grained fashion image retrieval with contrastive learning and exponential moving average (EMA) distillation (MCL-FIR). MCL-FIR adopts a multi-head design to accommodate evolving classes across increments, reformulates triplet inputs into doublets with InfoNCE for simpler and more effective training, and employs EMA distillation for efficient knowledge transfer. Experiments across four datasets demonstrate that, beyond its scalability, MCL-FIR achieves a strong balance between efficiency and accuracy. It significantly outperforms CIL baselines under similar training cost, and compared with static methods, it delivers comparable performance while using only about 30% of the training cost. The source code is publicly available in https://github.com/Dr-LingXiao/MCL-FIR.

Preview loads on expand
IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov Updated 2026-03-20

Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.

Preview loads on expand
IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression via Implicit Dense Alignment v1 Jun Wang, Xiaoyan Huang Updated 2026-03-20

Relative pose estimation is fundamental for SLAM, visual localization, and 3D reconstruction. Existing Relative Pose Regression (RPR) methods face a key trade-off: feature-matching pipelines achieve high accuracy but block gradient flow via non-differentiable RANSAC, while ViT-based regressors are end-to-end trainable but prohibitively expensive for real-time deployment. We identify the core bottlenecks as the coupling between rotation and translation estimation and insufficient cross-view feature alignment. We propose IUP-Pose, a geometry-driven decoupled iterative framework with implicit dense alignment. A lightweight Multi-Head Bi-Cross Attention (MHBC) module aligns cross-view features without explicit matching supervision. The aligned features are processed by a decoupled rotation-translation pipeline: two shared-parameter rotation stages iteratively refine rotation with uncertainty, and feature maps are realigned via rotational homography H_inf before translation prediction. IUP-Pose achieves 73.3% AUC@20deg on MegaDepth1500 with full end-to-end differentiability, 70 FPS throughput, and only 37M parameters, demonstrating a favorable accuracy-efficiency trade-off for real-time edge deployment.

Preview loads on expand
MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval Xuri Ge, Chunhao Wang, Xindi Wang, Zheyun Qin, Zhumin Chen, Xin Xin Updated 2026-03-18

Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user's intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released.

Preview loads on expand
VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents Zhengbo Zhang, Jinbo Su, Zhaowen Zhou, Changtao Miao, Yuhan Hong, Qimeng Wu, Yumeng Liu, Feier Wu, Yihe Tian, Yuhao Liang, Zitong Shan, Wanke Xia, Yi-Fan Zhang, Bo Zhang, Zhe Li, Shiming Xiang, Ying Yan Updated 2026-03-18

The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models' visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: https://github.com/ZhengboZhang/VisBrowse-Bench

Preview loads on expand
Visual Product Search Benchmark Karthik Sulthanpete Govindappa Updated 2026-03-17

Reliable product identification from images is a critical requirement in industrial and commercial applications, particularly in maintenance, procurement, and operational workflows where incorrect matches can lead to costly downstream failures. At the core of such systems lies the visual search component, which must retrieve and rank the exact object instance from large and continuously evolving catalogs under diverse imaging conditions. This report presents a structured benchmark of modern visual embedding models for instance-level image retrieval, with a focus on industrial applications. A curated set of open-source foundation embedding models, proprietary multi-modal embedding systems, and domain-specific vision-only models are evaluated under a unified image-to-image retrieval protocol. The benchmark includes curated datasets, which includes industrial datasets derived from production deployments in Manufacturing, Automotive, DIY, and Retail, as well as established public benchmarks. Evaluation is conducted without post-processing, isolating the retrieval capability of each model. The results provide insight into how well contemporary foundation and unified embedding models transfer to fine-grained instance retrieval tasks, and how they compare to models explicitly trained for industrial applications. By emphasizing realistic constraints, heterogeneous image conditions, and exact instance matching requirements, this benchmark aims to inform both practitioners and researchers about the strengths and limitations of current visual embedding approaches in production-level product identification systems. An interactive companion website presenting the benchmark results, evaluation details, and additional visualizations is available at https://benchmark.nyris.io.

Preview loads on expand
Retrieving Counterfactuals Improves Visual In-Context Learning Guangzhi Xiong, Sanchit Sinha, Zhenghao He, Aidong Zhang Updated 2026-03-17

Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.

Preview loads on expand
HMAR: Hierarchical Modality-Aware Expert and Dynamic Routing Medical Image Retrieval Architecture Aojie Yuan Updated 2026-03-17

Medical image retrieval (MIR) is a critical component of computer-aided diagnosis, yet existing systems suffer from three persistent limitations: uniform feature encoding that fails to account for the varying clinical importance of anatomical structures, ambiguous similarity metrics based on coarse classification labels, and an exclusive focus on global image similarity that cannot meet the clinical demand for fine-grained region-specific retrieval. We propose HMAR (Hierarchical Modality-Aware Expert and Dynamic Routing), an adaptive retrieval framework built on a Mixture-of-Experts (MoE) architecture. HMAR employs a dual-expert mechanism: Expert0 extracts global features for holistic similarity matching, while Expert1 learns position-invariant local representations for precise lesion-region retrieval. A two-stage contrastive learning strategy eliminates the need for expensive bounding-box annotations, and a sliding-window matching algorithm enables dense local comparison at inference time. Hash codes are generated via Kolmogorov-Arnold Network (KAN) layers for efficient Hamming-distance search. Experiments on the RadioImageNet-CT dataset (16 clinical patterns, 29,903 images) show that HMAR achieves mean Average Precision (mAP) of 0.711 and 0.724 for 64-bit and 128-bit hash codes, improving over the state-of-the-art ACIR method by 0.7% and 1.1%, respectively.

Preview loads on expand
Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty Mangyu Kong, Jaewon Lee, Seongwon Lee, Euntai Kim Updated 2026-03-17

3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible. To address these uncertainties, we introduce a relocalization framework that combines Monte Carlo pose sampling with Fisher Information-based PnP optimization. Our method explicitly accounts for both pose and geometric uncertainty and requires no retraining or additional supervision. Across diverse indoor and outdoor benchmarks, our approach consistently improves localization accuracy and significantly increases stability under pose and depth noise.

Preview loads on expand
Evaluation of Visual Place Recognition Methods for Image Pair Retrieval in 3D Vision and Robotics Dennis Haitz, Athradi Shritish Shetty, Michael Weinmann, Markus Ulrich Updated 2026-03-14

Visual Place Recognition (VPR) is a core component in computer vision, typically formulated as an image retrieval task for localization, mapping, and navigation. In this work, we instead study VPR as an image pair retrieval front-end for registration pipelines, where the goal is to find top-matching image pairs between two disjoint image sets for downstream tasks such as scene registration, SLAM, and Structure-from-Motion. We comparatively evaluate state-of-the-art VPR families - NetVLAD-style baselines, classification-based global descriptors (CosPlace, EigenPlaces), feature-mixing (MixVPR), and foundation-model-driven methods (AnyLoc, SALAD, MegaLoc) - on three challenging datasets: object-centric outdoor scenes (Tanks and Temples), indoor RGB-D scans (ScanNet-GS), and autonomous-driving sequences (KITTI). We show that modern global descriptor approaches are increasingly suitable as off-the-shelf image pair retrieval modules in challenging scenarios including perceptual aliasing and incomplete sequences, while exhibiting clear, domain-dependent strengths and weaknesses that are critical when choosing VPR components for robust mapping and registration.

Preview loads on expand
Sky2Ground: A Benchmark for Site Modeling under Varying Altitude Zengyan Wang, Sirshapan Mitra, Rajat Modi, Grace Lim, Yogesh Rawat Updated 2026-03-14

We introduce Sky2Ground, a three-view dataset designed for varying altitude camera localization, correspondence learning, and reconstruction. The dataset combines structured synthetic imagery with real, in-the-wild images, providing both controlled multi-view geometry and realistic scene noise. Each of the 51 sites contains thousands of satellite, aerial, and ground images spanning wide altitude ranges and nearly orthogonal viewing angles, enabling rigorous evaluation across global-to-local contexts. We benchmark state of the art pose estimation models, including MASt3R, DUSt3R, Map Anything, and VGGT, and observe that the use of satellite imagery often degrades performance, highlighting the challenges under large altitude variations. We also examine reconstruction methods, highlighting the challenges introduced by sparse geometric overlap, varying perspectives, and the use of real imagery, which often introduces noise and reduces rendering quality. To address some of these challenges, we propose SkyNet, a model which enhances cross-view consistency when incorporating satellite imagery with a curriculum-based training strategy to progressively incorporate more satellite views. SkyNet significantly strengthens multi-view alignment and outperforms existing methods by 9.6% on RRA@5 and 18.1% on RTA@5 in terms of absolute performance. Sky2Ground and SkyNet together establish a comprehensive testbed and baseline for advancing large-scale, multi-altitude 3D perception and generalizable camera localization. Code and models will be released publicly for future research.

Preview loads on expand
A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks Tangzheng Lian, Guanyu Hu, Yijing Ren, Dimitrios Kollias, Oya Celiktutan Updated 2026-03-13

While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \textbf{closed-form} solution in the cross-modal space, achieving Pareto-optimal fairness with \textbf{bounded utility losses}. Our method is \textbf{training-free}, requires \textbf{no annotated data}, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \textbf{intersectional} fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance.

Preview loads on expand
Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval Jing Yang, Hui Xue, Shipeng Zhu, Pengfei Fang Updated 2026-03-13

This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.

Preview loads on expand
CM-Bench: A Comprehensive Cross-Modal Feature Matching Benchmark Bridging Visible and Infrared Images Liangzheng Sun, Mengfan He, Xingyu Shao, Binbin Li, Zhiqiang Yan, Chunyu Li, Ziyang Meng, Fei Xing Updated 2026-03-13

Infrared-visible (IR-VIS) feature matching plays an essential role in cross-modality visual localization, navigation and perception. Along with the rapid development of deep learning techniques, a number of representative image matching methods have been proposed. However, crossmodal feature matching is still a challenging task due to the significant appearance difference. A significant gap for cross-modal feature matching research lies in the absence of standardized benchmarks and metrics for evaluations. In this paper, we introduce a comprehensive cross-modal feature matching benchmark, CM-Bench, which encompasses 30 feature matching algorithms across diverse cross-modal datasets. Specifically, state-of-the-art traditional and deep learning-based methods are first summarized and categorized into sparse, semidense, and dense methods. These methods are evaluated by different tasks including homography estimation, relative pose estimation, and feature-matching-based geo-localization. In addition, we introduce a classification-network-based adaptive preprocessing front-end that automatically selects suitable enhancement strategies before matching. We also present a novel infrared-satellite cross-modal dataset with manually annotated ground-truth correspondences for practical geo-localization evaluation. The dataset and resource will be available at: https://github.com/SLZ98/CM-Bench.

Preview loads on expand
FBCIR: Balancing Cross-Modal Focuses in Composed Image Retrieval Chenchen Zhao, Jianhuan Zhuo, Muxi Chen, Zhaohua Zhang, Wenyu Jiang, Tianwen Jiang, Qiuyong Xiao, Jihong Zhang, Qiang Xu Updated 2026-03-12

Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to encourage balanced cross-modal reasoning. Extensive experiments across multiple CIR models demonstrate that the proposed augmentation consistently improves performance in challenging cases, while maintaining their capabilities on standard benchmarks. Together, our interpretation method and data augmentation workflow provide a new perspective on CIR model diagnosis and robustness improvements.

Preview loads on expand
Efficient Cross-View Localization in 6G Space-Air-Ground Integrated Network Min Hao, Yanbing Xu, Maoqiang Wu, Jinglin Huang, Chen Shang, Jiacheng Wang, Ruichen Zhang, Jiawen Kang, Dusit Niyato, Zhu Han, Wei Ni Updated 2026-03-12

Recently, visual localization has become an important supplement to improve localization reliability, and cross-view approaches can greatly enhance coverage and adaptability. Meanwhile, future 6G will enable a globally covered mobile communication system, with a space-air-ground integrated network (SAGIN) serving as key supporting architecture. Inspired by this, we explore an integration of cross-view localization (CVL) with 6G SAGIN, thereby enhancing its performance in latency, energy consumption, and privacy protection. First, we provide a comprehensive review of CVL and SAGIN, highlighting their capabilities, integration opportunities, and potential applications. Benefiting from the fast and extensive image collection and transmission capabilities of the 6G SAGIN architecture, CVL achieves higher localization accuracy and faster processing speed. Then, we propose a split-inference framework for implementing CVL, which fully leverages the distributed communication and computing resources of the 6G SAGIN architecture. Subsequently, we conduct joint optimization of communication, computation, and confidentiality within the proposed split-inference framework, aiming to provide a paradigm and a direction for making CVL efficient. Experimental results validate the effectiveness of the proposed framework and provide solutions to the optimization problem. Finally, we discuss potential research directions for 6G SAGIN-enabled CVL.

Preview loads on expand
Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations Yuheng Wang, Yuji Lin, Dongrun Zhu, Jiayue Cai, Sunil Kalia, Harvey Lui, Chunqi Chang, Z. Jane Wang, Tim K. Lee Updated 2026-03-10

Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.

Preview loads on expand
$L^3$:Scene-agnostic Visual Localization in the Wild Yu Zhang, Muhua Zhu, Yifei Xue, Tie Ji, Yizhen Lao Updated 2026-03-09

Standard visual localization methods typically require offline pre-processing of scenes to obtain 3D structural information for better performance. This inevitably introduces additional computational and time costs, as well as the overhead of storing scene representations. Can we visually localize in a wild scene without any off-line preprocessing step? In this paper, we leverage the online inference capabilities of feed-forward 3D reconstruction networks to propose a novel map-free visual localization framework $L^3$. Specifically, by performing direct online 3D reconstruction on RGB images, followed by two-stage metric scale recovery and pose refinement based on 2D-3D correspondences, $L^3$ achieves high accuracy without the need to pre-build or store any offline scene representations. Extensive experiments demonstrate $L^3$ not only that the performance is comparable to state-of-the-art solutions on various benchmarks, but also that it exhibits significantly superior robustness in sparse scenes (fewer reference images per scene).

Preview loads on expand
QdaVPR: A novel query-based domain-agnostic model for visual place recognition Shanshan Wan, Lai Kang, Yingmei Wei, Tianrui Shen, Haixuan Wang, Chao Zuo Updated 2026-03-08

Visual place recognition (VPR) aiming at predicting the location of an image based solely on its visual features is a fundamental task in robotics and autonomous systems. Domain variation remains one of the main challenges in VPR and is relatively unexplored. Existing VPR models attempt to achieve domain agnosticism either by training on large-scale datasets that inherently contain some domain variations, or by being specifically adapted to particular target domains. In practice, the former lacks explicit domain supervision, while the latter generalizes poorly to unseen domain shifts. This paper proposes a novel query-based domain-agnostic VPR model called QdaVPR. First, a dual-level adversarial learning framework is designed to encourage domain invariance for both the query features forming the global descriptor and the image features from which these query features are derived. Then, a triplet supervision based on query combinations is designed to enhance the discriminative power of the global descriptors. To support the learning process, we augment a large-scale VPR dataset using style transfer methods, generating various synthetic domains with corresponding domain labels as auxiliary supervision. Extensive experiments show that QdaVPR achieves state-of-the-art performance on multiple VPR benchmarks with significant domain variations. Specifically, it attains the best Recall@1 and Recall@10 on nearly all test scenarios: 93.5%/98.6% on Nordland (seasonal changes), 97.5%/99.0% on Tokyo24/7 (day-night transitions), and the highest Recall@1 across almost all weather conditions on the SVOX dataset. Our code will be released at https://github.com/shuimushan/QdaVPR.

Preview loads on expand
T2Nav Algebraic Topology Aware Temporal Graph Memory and Loop Detection for ZeroShot Visual Navigation Quang-Anh N. D., Duc Pham, Minh-Anh Nguyen, Tung Doan, Tuan Dang Updated 2026-03-06

Deploying autonomous agents in real world environments is challenging, particularly for navigation, where systems must adapt to situations they have not encountered before. Traditional learning approaches require substantial amounts of data, constant tuning, and, sometimes, starting over for each new task. That makes them hard to scale and not very flexible. Recent breakthroughs in foundation models, such as large language models and vision language models, enable systems to attempt new navigation tasks without requiring additional training. However, many of these methods only work with specific input types, employ relatively basic reasoning, and fail to fully exploit the details they observe or the structure of the spaces. Here, we introduce T2Nav, a zeroshot navigation system that integrates heterogeneous data and employs graph-based reasoning. By directly incorporating visual information into the graph and matching it to the environment, our approach enables the system to strike a good balance between exploration and goal attainment. This strategy allows robust obstacle avoidance, reliable loop closure detection, and efficient path planning while eliminating redundant exploration patterns. The system demonstrates flexibility by handling goals specified using reference images of target object instances, making it particularly suitable for scenarios in which agents must navigate to visually similar yet spatially distinct instances. Experiments demonstrate that our approach is efficient and adapts well to unknown environments, moving toward practical zero-shot instance-image navigation capabilities.

Preview loads on expand
EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition Adam D. Hines, Gokul B. Nair, Nicolás Marticorena, Michael Milford, Tobias Fischer Updated 2026-03-06

Dynamic vision sensors, also known as event cameras, are rapidly rising in popularity for robotic and computer vision tasks due to their sparse activation and high-temporal resolution. Event cameras have been used in robotic navigation and localization tasks where accurate positioning needs to occur on small and frequent time scales, or when energy concerns are paramount. In this work, we present EventGeM, a state-of-the-art global to local feature fusion pipeline for event-based Visual Place Recognition. We use a pre-trained vision transformer (ViT-S/16) backbone to obtain global feature patch for initial match predictions embeddings from event histogram images. Local feature keypoints were then detected using a pre-trained MaxViT backbone for 2D-homography based re-ranking with RANSAC. For additional re-ranking refinement, we subsequently used a pre-trained vision foundation model for depth estimation to compare structural similarity between references and queries. Our work performs state-of-the-art localization when compared to the best currently available event-based place recognition method across several benchmark datasets and lighting conditions all whilst being fully capable of running in real-time when deployed across a variety of compute architectures. We demonstrate the capability of EventGeM in a real-world deployment on a robotic platform for online localization using event streams directly from an event camera. Project page: https://eventgemvpr.github.io/

Preview loads on expand
Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval Donghoon Han, Eunhwan Park, Seunghyeon Seo Updated 2026-03-06

Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbf{BM25-V}, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25's inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 $\geq$ 0.993, enabling a two-stage pipeline that reranks only $K{=}200$ candidates per query and recovers near-dense accuracy within $0.2$\% on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven fine-grained benchmarks without fine-tuning, and BM25-V retrieval decisions are attributable to specific visual words with quantified IDF contributions.

Preview loads on expand
Loop Closure via Maximal Cliques in 3D LiDAR-Based SLAM Javier Laserna, Saurabh Gupta, Oscar Martinez Mozos, Cyrill Stachniss, Pablo San Segundo Updated 2026-03-05

Reliable loop closure detection remains a critical challenge in 3D LiDAR-based SLAM, especially under sensor noise, environmental ambiguity, and viewpoint variation conditions. RANSAC is often used in the context of loop closures for geometric model fitting in the presence of outliers. However, this approach may fail, leading to map inconsistency. We introduce a novel deterministic algorithm, CliReg, for loop closure validation that replaces RANSAC verification with a maximal clique search over a compatibility graph of feature correspondences. This formulation avoids random sampling and increases robustness in the presence of noise and outliers. We integrated our approach into a real- time pipeline employing binary 3D descriptors and a Hamming distance embedding binary search tree-based matching. We evaluated it on multiple real-world datasets featuring diverse LiDAR sensors. The results demonstrate that our proposed technique consistently achieves a lower pose error and more reliable loop closures than RANSAC, especially in sparse or ambiguous conditions. Additional experiments on 2D projection-based maps confirm its generality across spatial domains, making our approach a robust and efficient alternative for loop closure detection.

Preview loads on expand
PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing Rohan Mahadev, Joyce Yuan, Patrick Poirson, David Xue, Hao-Yu Wu, Dmitry Kislyuk Updated 2026-03-04

Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.

Preview loads on expand
SSR: A Generic Framework for Text-Aided Map Compression for Localization Mohammad Omama, Po-han Li, Harsh Goel, Minkyu Choi, Behdad Chalaki, Vaishnav Tadiparthi, Hossein Nourkhiz Mahjoub, Ehsan Moradi Pari, Sandeep P. Chinchali Updated 2026-03-04

Mapping is crucial in robotics for localization and downstream decision-making. As robots are deployed in ever-broader settings, the maps they rely on continue to increase in size. However, storing these maps indefinitely (cold storage), transferring them across networks, or sending localization queries to cloud-hosted maps imposes prohibitive memory and bandwidth costs. We propose a text-enhanced compression framework that reduces both memory and bandwidth footprints while retaining high-fidelity localization. The key idea is to treat text as an alternative modality: one that can be losslessly compressed with large language models. We propose leveraging lightweight text descriptions combined with very small image feature vectors, which capture "complementary information" as a compact representation for the mapping task. Building on this, our novel technique, Similarity Space Replication (SSR), learns an adaptive image embedding in one shot that captures only the information "complementary" to the text descriptions. We validate our compression framework on multiple downstream localization tasks, including Visual Place Recognition as well as object-centric Monte Carlo localization in both indoor and outdoor settings. SSR achieves 2 times better compression than competing baselines on state-of-the-art datasets, including TokyoVal, Pittsburgh30k, Replica, and KITTI.

Preview loads on expand
Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark Martin Kvisvik Larsen, Oscar Pizarro Updated 2026-03-04

Long-term visual localization has the potential to reduce cost and improve mapping quality in optical benthic monitoring with autonomous underwater vehicles (AUVs). Despite this potential, long-term visual localization in benthic environments remains understudied, primarily due to the lack of curated datasets for benchmarking. Moreover, limited georeferencing accuracy and image footprints necessitate precise geometric information for accurate ground-truthing. In this work, we address these gaps by presenting a curated dataset for long-term visual localization in benthic environments and a novel method to ground-truth visual localization results for near-nadir underwater imagery. Our dataset comprises georeferenced AUV imagery from five benthic reference sites, revisited over periods up to six years, and includes raw and color-corrected stereo imagery, camera calibrations, and sub-decimeter registered camera poses. To our knowledge, this is the first curated underwater dataset for long-term visual localization spanning multiple sites and photic-zone habitats. Our ground-truthing method estimates 3D seafloor image footprints and links camera views with overlapping footprints, ensuring that ground-truth links reflect shared visual content. Building on this dataset and ground truth, we benchmark eight state-of-the-art visual place recognition (VPR) methods and find that Recall@K is significantly lower on our dataset than on established terrestrial and underwater benchmarks. Finally, we compare our footprint-based ground truth to a traditional location-based ground truth and show that distance-threshold ground-truthing can overestimate VPR Recall@K at sites with rugged terrain and altitude variations. Together, the curated dataset, ground-truthing method, and VPR benchmark provide a stepping stone for advancing long-term visual localization in dynamic benthic environments.

Preview loads on expand
VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep Updated 2026-02-26

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

Preview loads on expand
WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang Updated 2026-02-26

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.

Preview loads on expand
Autoregressive Visual Decoding from EEG Signals Sicheng Dai, Hongwang Xiao, Shan Yu, Qiwei Ye Updated 2026-02-26

Electroencephalogram (EEG) signals have become a popular medium for decoding visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual decoding from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on a "next-scale prediction" strategy: images are encoded into multi-scale token maps using a pre-trained VQ-VAE, and a transformer is trained to autoregressively predict finer-scale tokens starting from EEG embeddings as the coarsest representation. This design enables coherent generation while preserving a direct connection between the input EEG signals and the reconstructed images. Experiments on two datasets show that AVDE outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters. In addition, visualization of intermediate outputs shows that the generative process of AVDE reflects the hierarchical nature of human visual perception. These results highlight the potential of autoregressive models as efficient and interpretable tools for practical BCI applications.

Preview loads on expand
Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning Guoyizhe Wei, Yang Jiao, Nan Xi, Zhishen Huang, Jingjing Meng, Rama Chellappa, Yan Gao Updated 2026-02-26

Composed Image Retrieval (CIR) uses a reference image plus a natural-language edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the DFMM-Compose benchmark, Pix2Key improves Recall@10 up to 3.2 points, and adding V-Dict-AE yields an additional 2.3-point gain while improving intent consistency and maintaining high list diversity.

Preview loads on expand
Global-Aware Edge Prioritization for Pose Graph Initialization Tong Wei, Giorgos Tolias, Jiri Matas, Daniel Barath Updated 2026-02-25

The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. The ode and trained models are available at https://github.com/weitong8591/global_edge_prior.

Preview loads on expand
Automatic Map Density Selection for Locally-Performant Visual Place Recognition Somayeh Hussaini, Tobias Fischer, Michael Milford Updated 2026-02-25

A key challenge in translating Visual Place Recognition (VPR) from the lab to long-term deployment is ensuring a priori that a system can meet user-specified performance requirements across different parts of an environment, rather than just on average globally. A critical mechanism for controlling local VPR performance is the density of the reference mapping database, yet this factor is largely neglected in existing work, where benchmark datasets with fixed, engineering-driven (sensors, storage, GPS frequency) sampling densities are typically used. In this paper, we propose a dynamic VPR mapping approach that uses pairs of reference traverses from the target environment to automatically select an appropriate map density to satisfy two user-defined requirements: (1) a target Local Recall@1 level, and (2) the proportion of the operational environment over which this requirement must be met or exceeded, which we term the Recall Achievement Rate (RAR). Our approach is based on the hypothesis that match patterns between multiple reference traverses, evaluated across different map densities, can be modelled to predict the density required to meet these performance targets on unseen deployment data. Through extensive experiments across multiple VPR methods and the Nordland and Oxford RobotCar benchmarks, we show that our system consistently achieves or exceeds the specified local recall level over at least the user-specified proportion of the environment. Comparisons with alternative baselines demonstrate that our approach reliably selects the correct operating point in map density, avoiding unnecessary over-densification. Finally, ablation studies and analysis evaluate sensitivity to reference map choice and local space definitions, and reveal that conventional global Recall@1 is a poor predictor of the often more operationally meaningful RAR metric.

Preview loads on expand
Seeing Through Words: Controlling Visual Retrieval Quality with Language Models Jianglin Lu, Simon Jenni, Kushal Kafle, Jing Shi, Handong Zhao, Yun Fu Updated 2026-02-24

Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-language model (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at https://github.com/Jianglin954/QCQC.

Preview loads on expand
LST-SLAM: A Stereo Thermal SLAM System for Kilometer-Scale Dynamic Environments Zeyu Jiang, Kuan Xu, Changhao Chen Updated 2026-02-24

Thermal cameras offer strong potential for robot perception under challenging illumination and weather conditions. However, thermal Simultaneous Localization and Mapping (SLAM) remains difficult due to unreliable feature extraction, unstable motion tracking, and inconsistent global pose and map construction, particularly in dynamic large-scale outdoor environments. To address these challenges, we propose LST-SLAM, a novel large-scale stereo thermal SLAM system that achieves robust performance in complex, dynamic scenes. Our approach combines self-supervised thermal feature learning, stereo dual-level motion tracking, and geometric pose optimization. We also introduce a semantic-geometric hybrid constraint that suppresses potentially dynamic features lacking strong inter-frame geometric consistency. Furthermore, we develop an online incremental bag-of-words model for loop closure detection, coupled with global pose optimization to mitigate accumulated drift. Extensive experiments on kilometer-scale dynamic thermal datasets show that LST-SLAM significantly outperforms recent representative SLAM systems, including AirSLAM and DROID-SLAM, in both robustness and accuracy.

Preview loads on expand
Long-Term Multi-Session 3D Reconstruction Under Substantial Appearance Change Beverley Gorry, Tobias Fischer, Michael Milford, Alejandro Fontan Updated 2026-02-24

Long-term environmental monitoring requires the ability to reconstruct and align 3D models across repeated site visits separated by months or years. However, existing Structure-from-Motion (SfM) pipelines implicitly assume near-simultaneous image capture and limited appearance change, and therefore fail when applied to long-term monitoring scenarios such as coral reef surveys, where substantial visual and structural change is common. In this paper, we show that the primary limitation of current approaches lies in their reliance on post-hoc alignment of independently reconstructed sessions, which is insufficient under large temporal appearance change. We address this limitation by enforcing cross-session correspondences directly within a joint SfM reconstruction. Our approach combines complementary handcrafted and learned visual features to robustly establish correspondences across large temporal gaps, enabling the reconstruction of a single coherent 3D model from imagery captured years apart, where standard independent and joint SfM pipelines break down. We evaluate our method on long-term coral reef datasets exhibiting significant real-world change, and demonstrate consistent joint reconstruction across sessions in cases where existing methods fail to produce coherent reconstructions. To ensure scalability to large datasets, we further restrict expensive learned feature matching to a small set of likely cross-session image pairs identified via visual place recognition, which reduces computational cost and improves alignment robustness.

Preview loads on expand
Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Yuanhuiyi Lyu, Yu Huang, Jungang Li, Kening Zheng, Xu Zheng, Philip S. Yu, James Kwok, Xuming Hu Updated 2026-02-23

With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.

Preview loads on expand
VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments Jingyi Xu, Zhangshuo Qi, Zhongmiao Yan, Xuyu Gao, Qianyun Jiao, Songpengcheng Xia, Xieyuanli Chen, Ling Pei Updated 2026-02-23

In autonomous driving, robust place recognition is critical for global localization and loop closure detection. While inter-modality fusion of camera and LiDAR data in multimodal place recognition (MPR) has shown promise in overcoming the limitations of unimodal counterparts, existing MPR methods basically attend to hand-crafted fusion strategies and heavily parameterized backbones that require costly retraining. To address this, we propose VGGT-MPR, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking. In the global retrieval stage, VGGT extracts geometrically-rich visual embeddings through prior depth-aware and point map supervision, and densifies sparse LiDAR point clouds with predicted depth maps to improve structural representation. This enhances the discriminative ability of fused multimodal features and produces global descriptors for fast retrieval. Beyond global retrieval, we design a training-free re-ranking mechanism that exploits VGGT's cross-view keypoint-tracking capability. By combining mask-guided keypoint extraction with confidence-aware correspondence scoring, our proposed re-ranking mechanism effectively refines retrieval results without additional parameter optimization. Extensive experiments on large-scale autonomous driving benchmarks and our self-collected data demonstrate that VGGT-MPR achieves state-of-the-art performance, exhibiting strong robustness to severe environmental changes, viewpoint shifts, and occlusions. Our code and data will be made publicly available.

Preview loads on expand
Evaluating the Impact of Data Anonymization on Image Retrieval Marvin Chen, Manuel Eberhardinger, Johannes Maucher Updated 2026-02-23

With the growing importance of privacy regulations such as the General Data Protection Regulation, anonymizing visual data is becoming increasingly relevant across institutions. However, anonymization can negatively affect the performance of Computer Vision systems that rely on visual features, such as Content-Based Image Retrieval (CBIR). Despite this, the impact of anonymization on CBIR has not been systematically studied. This work addresses this gap, motivated by the DOKIQ project, an artificial intelligence-based system for document verification actively used by the State Criminal Police Office Baden-Württemberg. We propose a simple evaluation framework: retrieval results after anonymization should match those obtained before anonymization as closely as possible. To this end, we systematically assess the impact of anonymization using two public datasets and the internal DOKIQ dataset. Our experiments span three anonymization methods, four anonymization degrees, and four training strategies, all based on the state of the art backbone Self-Distillation with No Labels (DINO)v2. Our results reveal a pronounced retrieval bias in favor of models trained on original data, which produce the most similar retrievals after anonymization. The findings of this paper offer practical insights for developing privacy-compliant CBIR systems while preserving performance.

Preview loads on expand
Knowledge-aware Visual Question Generation for Remote Sensing Images Siran Li, Li Mi, Javiera Castillo-Navarro, Devis Tuia Updated 2026-02-22

With the rapid development of remote sensing image archives, asking questions about images has become an effective way of gathering specific information or performing image retrieval. However, automatically generated image-based questions tend to be simplistic and template-based, which hinders the real deployment of question answering or visual dialogue systems. To enrich and diversify the questions, we propose a knowledge-aware remote sensing visual question generation model, KRSVQG, that incorporates external knowledge related to the image content to improve the quality and contextual understanding of the generated questions. The model takes an image and a related knowledge triplet from external knowledge sources as inputs and leverages image captioning as an intermediary representation to enhance the image grounding of the generated questions. To assess the performance of KRSVQG, we utilized two datasets that we manually annotated: NWPU-300 and TextRS-300. Results on these two datasets demonstrate that KRSVQG outperforms existing methods and leads to knowledge-enriched questions, grounded in both image and domain knowledge.

Preview loads on expand
Questions beyond Pixels: Integrating Commonsense Knowledge in Visual Question Generation for Remote Sensing Siran Li, Li Mi, Javiera Castillo-Navarro, Devis Tuia Updated 2026-02-22

With the rapid development of remote sensing image archives, asking questions about images has become an effective way of gathering specific information or performing semantic image retrieval. However, current automatically generated questions tend to be simplistic and template-based, which hinders the deployment of question answering or visual dialogue systems for real-world applications. To enrich and diversify the questions with both image content and commonsense knowledge, we propose a Knowledge-aware Remote Sensing Visual Question Generation model (KRSVQG). The proposed model incorporates related knowledge triplets from external knowledge sources to broaden the question content, while employing image captioning as an intermediary representation to ground questions to the corresponding images. Moreover, KRSVQG utilizes a vision-language pre-training and fine-tuning strategy, enabling the model's adaptation to low data regimes. To evaluate the proposed KRSVQG model, we construct two knowledge-aware remote sensing visual question generation datasets: the NWPU-300 dataset and the TextRS-300 dataset. Evaluations, including metrics and human assessment, demonstrate that KRSVQG outperforms existing methods and leads to rich questions, grounded in both image and domain knowledge. As a key practice in vision-language research, knowledge-aware visual question generation advances the understanding of image content beyond pixels, facilitating the development of knowledge-enriched vision-language systems with vision-grounded human commonsense.

Preview loads on expand
IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping Tingyang Xiao, Liu Liu, Wei Feng, Zhengyu Zou, Xiaolin Zhou, Wei Sui, Hao Li, Dingwen Zhang, Zhizhong Su Updated 2026-02-21

Geometry foundation models have significantly advanced dense geometric SLAM, yet existing systems often lack deep semantic understanding and robust loop closure capabilities. Meanwhile, contemporary semantic mapping approaches are frequently hindered by decoupled architectures and fragile data association. We propose IRIS-SLAM, a novel RGB semantic SLAM system that leverages unified geometric-instance representations derived from an instance-extended foundation model. By extending a geometry foundation model to concurrently predict dense geometry and cross-view consistent instance embeddings, we enable a semantic-synergized association mechanism and instance-guided loop closure detection. Our approach effectively utilizes viewpoint-agnostic semantic anchors to bridge the gap between geometric reconstruction and open-vocabulary mapping. Experimental results demonstrate that IRIS-SLAM significantly outperforms state-of-the-art methods, particularly in map consistency and wide-baseline loop closure reliability.

Preview loads on expand
VQPP: Video Query Performance Prediction Benchmark Adrian Catalin Lutu, Eduard Poesina, Radu Tudor Ionescu Updated 2026-02-19

Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content-based video retrieval (CBVR) remains largely underexplored. To this end, we propose the first benchmark for video query performance prediction (VQPP), comprising two text-to-video retrieval datasets and two CBVR systems, respectively. VQPP contains a total of 56K text queries and 51K videos, and comes with official training, validation and test splits, fostering direct comparisons and reproducible results. We explore multiple pre-retrieval and post-retrieval performance predictors, creating a representative benchmark for future exploration of QPP in the video domain. Our results show that pre-retrieval predictors obtain competitive performance, enabling applications before performing the retrieval step. We also demonstrate the applicability of VQPP by employing the best performing pre-retrieval predictor as reward model for training a large language model (LLM) on the query reformulation task via direct preference optimization (DPO). We release our benchmark and code at https://github.com/AdrianLutu/VQPP.

Preview loads on expand
DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition Ji Li, Zhiwei Li, Shihao Li, Zhenjiang Yu, Boyang Wang, Haiou Liu Updated 2026-02-12

Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition in autonomous driving

Preview loads on expand
Arbitrary Ratio Feature Compression via Next Token Prediction Yufan Liu, Daoyuan Ren, Zhipeng Zhang, Wenyang Luo, Bing Li, Weiming Hu, Stephen Maybank Updated 2026-02-12

Feature compression is increasingly important for improving the efficiency of downstream tasks, especially in applications involving large-scale or multi-modal data. While existing methods typically rely on dedicated models for achieving specific compression ratios, they are often limited in flexibility and generalization. In particular, retraining is necessary when adapting to a new compression ratio. To address this limitation, we propose a novel and flexible Arbitrary Ratio Feature Compression (ARFC) framework, which supports any compression ratio with a single model, eliminating the need for multiple specialized models. At its core, the Arbitrary Ratio Compressor (ARC) is an auto-regressive model that performs compression via next-token prediction. This allows the compression ratio to be controlled at inference simply by adjusting the number of generated tokens. To enhance the quality of the compressed features, two key modules are introduced. The Mixture of Solutions (MoS) module refines the compressed tokens by utilizing multiple compression results (solutions), reducing uncertainty and improving robustness. The Entity Relation Graph Constraint (ERGC) is integrated into the training process to preserve semantic and structural relationships during compression. Extensive experiments on cross-modal retrieval, image classification, and image retrieval tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches at various compression ratios. Notably, in some cases, it even surpasses the performance of the original, uncompressed features. These results validate the effectiveness and versatility of ARFC for practical, resource-constrained scenarios.

Preview loads on expand
DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou Updated 2026-02-11

Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.

Preview loads on expand
WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning Mert Sonmezer, Serge Vasylechko, Duygu Atasoy, Seyda Ertekin, Sila Kurugol Updated 2026-02-10

Retrieving wrist radiographs with analogous fracture patterns is challenging because clinically important cues are subtle, highly localized and often obscured by overlapping anatomy or variable imaging views. Progress is further limited by the scarcity of large, well-annotated datasets for case-based medical image retrieval. We introduce WristMIR, a region-aware pediatric wrist radiograph retrieval framework that leverages dense radiology reports and bone-specific localization to learn fine-grained, clinically meaningful image representations without any manual image-level annotations. Using MedGemma-based structured report mining to generate both global and region-level captions, together with pre-processed wrist images and bone-specific crops of the distal radius, distal ulna, and ulnar styloid, WristMIR jointly trains global and local contrastive encoders and performs a two-stage retrieval process: (1) coarse global matching to identify candidate exams, followed by (2) region-conditioned reranking aligned to a predefined anatomical bone region. WristMIR improves retrieval performance over strong vision-language baselines, raising image-to-text Recall@5 from 0.82% to 9.35%. Its embeddings also yield stronger fracture classification (AUROC 0.949, AUPRC 0.953). In region-aware evaluation, the two-stage design markedly improves retrieval-based fracture diagnosis, increasing mean $F_1$ from 0.568 to 0.753, and radiologists rate its retrieved cases as more clinically relevant, with mean scores rising from 3.36 to 4.35. These findings highlight the potential of anatomically guided retrieval to enhance diagnostic reasoning and support clinical decision-making in pediatric musculoskeletal imaging. The source code is publicly available at https://github.com/quin-med-harvard-edu/WristMIR.

Preview loads on expand
OSCAR: Optimization-Steered Agentic Planning for Composed Image Retrieval Teng Wang, Rong Shan, Jianghao Lin, Junjie Wu, Tianyi Xu, Jianping Zhang, Wenteng Chen, Changwang Zhang, Zhaoxiang Wang, Weinan Zhang, Jun Wang Updated 2026-02-09

Composed image retrieval (CIR) requires complex reasoning over heterogeneous visual and textual constraints. Existing approaches largely fall into two paradigms: unified embedding retrieval, which suffers from single-model myopia, and heuristic agentic retrieval, which is limited by suboptimal, trial-and-error orchestration. To this end, we propose OSCAR, an optimization-steered agentic planning framework for composed image retrieval. We are the first to reformulate agentic CIR from a heuristic search process into a principled trajectory optimization problem. Instead of relying on heuristic trial-and-error exploration, OSCAR employs a novel offline-online paradigm. In the offline phase, we model CIR via atomic retrieval selection and composition as a two-stage mixed-integer programming problem, mathematically deriving optimal trajectories that maximize ground-truth coverage for training samples via rigorous boolean set operations. These trajectories are then stored in a golden library to serve as in-context demonstrations for online steering of VLM planner at online inference time. Extensive experiments on three public benchmarks and a private industrial benchmark show that OSCAR consistently outperforms SOTA baselines. Notably, it achieves superior performance using only 10% of training data, demonstrating strong generalization of planning logic rather than dataset-specific memorization.

Preview loads on expand
A Sketch+Text Composed Image Retrieval Dataset for Thangka Jinyu Xu, Yi Sun, Jiangling Zhang, Qing Xie, Daomin Ji, Zhifeng Bao, Jiachen Li, Yanchun Ma, Yongjian Liu Updated 2026-02-09

Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at https://github.com/jinyuxu-whut/CIRThan.

Preview loads on expand
UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science Jie Zhang, Xingtong Yu, Yuan Fang, Rudi Stouffs, Zdravko Trivic Updated 2026-02-09

Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban understanding tasks -- including geolocation ranking, image retrieval, urban perception, and spatial grounding. We develop UGE on multiple state-of-the-art VLM backbones, including Qwen2-VL, Qwen2.5-VL, Phi-3-Vision, and LLaVA1.6-Mistral, and train fixed-dimensional spatial embeddings with LoRA tuning. UGE built upon Qwen2.5-VL-7B backbone achieves up to 44% improvement in image retrieval and 30% in geolocation ranking on training cities, and over 30% and 22% gains respectively on held-out cities, demonstrating the effectiveness of explicit spatial grounding for spatially intensive urban tasks.

Preview loads on expand
SDR-CIR: Semantic Debias Retrieval Framework for Training-Free Zero-Shot Composed Image Retrieval Yi Sun, Jinyu Xu, Qing Xie, Jiachen Li, Yanchun Ma, Yongjian Liu Updated 2026-02-05

Composed Image Retrieval (CIR) aims to retrieve a target image from a query composed of a reference image and modification text. Recent training-free zero-shot methods often employ Multimodal Large Language Models (MLLMs) with Chain-of-Thought (CoT) to compose a target image description for retrieval. However, due to the fuzzy matching nature of ZS-CIR, the generated description is prone to semantic bias relative to the target image. We propose SDR-CIR, a training-free Semantic Debias Ranking method based on CoT reasoning. First, Selective CoT guides the MLLM to extract visual content relevant to the modification text during image understanding, thereby reducing visual noise at the source. We then introduce a Semantic Debias Ranking with two steps, Anchor and Debias, to mitigate semantic bias. In the Anchor step, we fuse reference image features with target description features to reinforce useful semantics and supplement omitted cues. In the Debias step, we explicitly model the visual semantic contribution of the reference image to the description and incorporate it into the similarity score as a penalty term. By supplementing omitted cues while suppressing redundancy, SDR-CIR mitigates semantic bias and improves retrieval performance. Experiments on three standard CIR benchmarks show that SDR-CIR achieves state-of-the-art results among one-stage methods while maintaining high efficiency. The code is publicly available at https://github.com/suny105/SDR-CIR.

Preview loads on expand
SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation David F. Ramirez, Tim Overman, Kristen Jaskie, Joe Marvin, Andreas Spanias Updated 2026-02-04

We present a visual-context image retrieval-augmented generation (ImageRAG) assisted AI agent for automatic target recognition (ATR) of synthetic aperture radar (SAR). SAR is a remote sensing method used in defense and security applications to detect and monitor the positions of military vehicles, which may appear indistinguishable in images. Researchers have extensively studied SAR ATR to improve the differentiation and identification of vehicle types, characteristics, and measurements. Test examples can be compared with known vehicle target types to improve recognition tasks. New methods enhance the capabilities of neural networks, transformer attention, and multimodal large language models. An agentic AI method may be developed to utilize a defined set of tools, such as searching through a library of similar examples. Our proposed method, SAR Retrieval-Augmented Generation (SAR-RAG), combines a multimodal large language model (MLLM) with a vector database of semantic embeddings to support contextual search for image exemplars with known qualities. By recovering past image examples with known true target types, our SAR-RAG system can compare similar vehicle categories, achieving improved ATR prediction accuracy. We evaluate this through search and retrieval metrics, categorical classification accuracy, and numeric regression of vehicle dimensions. These metrics all show improvements when SAR-RAG is added to an MLLM baseline method as an attached ATR memory bank.

Preview loads on expand
Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition Dhyey Manish Rajani, Michael Milford, Tobias Fischer Updated 2026-02-04

Visual Place Recognition (VPR) is a key component for localisation in GNSS-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that, given a user-defined precision requirement, automatically selects the operating point of a VPR system to maximise recall. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets, making the method robust to sampling variability. Experiments with multiple state-of-the-art VPR techniques and datasets show that the proposed approach consistently outperforms the state-of-the-art, delivering up to 25% higher recall in high-precision operating regimes. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code will be released upon acceptance.

Preview loads on expand
Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement Zipeng Zhu, Zhanghao Hu, Qinglin Zhu, Yuxi Hong, Yijun Liu, Jingyong Su, Yulan He, Lin Gui Updated 2026-02-04

Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual information to be reactivated at deeper layers. Based on this observation, we introduce Visual Activation by Query (VAQ), a metric that identifies the layer whose attention map is most relevant to query-specific visual grounding by measuring attention sensitivity to the input query. Building on VAQ, we further propose LASER (Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning), a training-free inference procedure that adaptively selects task-appropriate layers for visual localization and question answering. Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.

Preview loads on expand
Invariance on Manifolds: Understanding Robust Visual Representations for Place Recognition Jintao Cheng, Weibin Li, Zhijian He, Jin Wu, Chi Man Vong, Wei Zhang Updated 2026-02-04

Visual Place Recognition (VPR) demands representations robust to drastic environmental and viewpoint shifts. Current aggregation paradigms, however, either rely on data-hungry supervision or simplistic first-order statistics, often neglecting intrinsic structural correlations. In this work, we propose a Second-Order Geometric Statistics framework that inherently captures geometric stability without training. We conceptualize scenes as covariance descriptors on the Symmetric Positive Definite (SPD) manifold, where perturbations manifest as tractable congruence transformations. By leveraging geometry-aware Riemannian mappings, we project these descriptors into a linearized Euclidean embedding, effectively decoupling signal structure from noise. Our approach introduces a training-free framework built upon fixed, pre-trained backbones, achieving strong zero-shot generalization without parameter updates. Extensive experiments confirm that our method achieves highly competitive performance against state-of-the-art baselines, particularly excelling in challenging zero-shot scenarios.

Preview loads on expand
LaVPR: Benchmarking Language and Vision for Place Recognition Ofer Idan, Dan Badur, Yosi Keller, Yoli Shavit Updated 2026-02-03

Visual Place Recognition (VPR) often fails under extreme environmental changes and perceptual aliasing. Furthermore, standard systems cannot perform "blind" localization from verbal descriptions alone, a capability needed for applications such as emergency response. To address these challenges, we introduce LaVPR, a large-scale benchmark that extends existing VPR datasets with over 650,000 rich natural-language descriptions. Using LaVPR, we investigate two paradigms: Multi-Modal Fusion for enhanced robustness and Cross-Modal Retrieval for language-based localization. Our results show that language descriptions yield consistent gains in visually degraded conditions, with the most significant impact on smaller backbones. Notably, adding language allows compact models to rival the performance of much larger vision-only architectures. For cross-modal retrieval, we establish a baseline using Low-Rank Adaptation (LoRA) and Multi-Similarity loss, which substantially outperforms standard contrastive methods across vision-language models. Ultimately, LaVPR enables a new class of localization systems that are both resilient to real-world stochasticity and practical for resource-constrained deployment. Our dataset and code are available at https://github.com/oferidan1/LaVPR.

Preview loads on expand
ObjEmbed: Towards Universal Multimodal Object Embeddings Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie, Wei-Shi Zheng Updated 2026-02-03

Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.

Preview loads on expand
Real-Time Loop Closure Detection in Visual SLAM via NetVLAD and Faiss Enguang Fan Updated 2026-02-02

Loop closure detection (LCD) is a core component of simultaneous localization and mapping (SLAM): it identifies revisited places and enables pose-graph constraints that correct accumulated drift. Classic bag-of-words approaches such as DBoW are efficient but often degrade under appearance change and perceptual aliasing. In parallel, deep learning-based visual place recognition (VPR) descriptors (e.g., NetVLAD and Transformer-based models) offer stronger robustness, but their computational cost is often viewed as a barrier to real-time SLAM. In this paper, we empirically evaluate NetVLAD as an LCD module and compare it against DBoW on the KITTI dataset. We introduce a Fine-Grained Top-K precision-recall curve that better reflects LCD settings where a query may have zero or multiple valid matches. With Faiss-accelerated nearestneighbor search, NetVLAD achieves real-time query speed while improving accuracy and robustness over DBoW, making it a practical drop-in alternative for LCD in SLAM.

Preview loads on expand
ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval Tianyu Yang, ChenWei He, Xiangzhao Hao, Tianyue Wang, Jiarui Guo, Haiyun Guo, Leigang Qu, Jinqiao Wang, Tat-Seng Chua Updated 2026-02-02

Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. Recently, adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction. However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL (Recalibrating Capability Degradation), a model-agnostic framework that follows a diagnose-generate-refine pipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by CoT prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code will be released soon.

Preview loads on expand
Interacted Planes Reveal 3D Line Mapping Zeran Ke, Bin Tan, Gui-Song Xia, Yujun Shen, Nan Xue Updated 2026-02-01

3D line mapping from multi-view RGB images provides a compact and structured visual representation of scenes. We study the problem from a physical and topological perspective: a 3D line most naturally emerges as the edge of a finite 3D planar patch. We present LiP-Map, a line-plane joint optimization framework that explicitly models learnable line and planar primitives. This coupling enables accurate and detailed 3D line mapping while maintaining strong efficiency (typically completing a reconstruction in 3 to 5 minutes per scene). LiP-Map pioneers the integration of planar topology into 3D line mapping, not by imposing pairwise coplanarity constraints but by explicitly constructing interactions between plane and line primitives, thus offering a principled route toward structured reconstruction in man-made environments. On more than 100 scenes from ScanNetV2, ScanNet++, Hypersim, 7Scenes, and Tanks\&Temple, LiP-Map improves both accuracy and completeness over state-of-the-art methods. Beyond line mapping quality, LiP-Map significantly advances line-assisted visual localization, establishing strong performance on 7Scenes. Our code is released at https://github.com/calmke/LiPMAP for reproducible research.

Preview loads on expand
Variance & Greediness: A comparative study of metric-learning losses Donghuo Zeng, Hao Niu, Zhi Li, Masato Taya Updated 2026-01-29

Metric learning is central to retrieval, yet its effects on embedding geometry and optimization dynamics are not well understood. We introduce a diagnostic framework, VARIANCE (intra-/inter-class variance) and GREEDINESS (active ratio and gradient norms), to compare seven representative losses, i.e., Contrastive, Triplet, N-pair, InfoNCE, ArcFace, SCL, and CCL, across five image-retrieval datasets. Our analysis reveals that Triplet and SCL preserve higher within-class variance and clearer inter-class margins, leading to stronger top-1 retrieval in fine-grained settings. In contrast, Contrastive and InfoNCE compact embeddings are achieved quickly through many small updates, accelerating convergence but potentially oversimplifying class structures. N-pair achieves a large mean separation but with uneven spacing. These insights reveal a form of efficiency-granularity trade-off and provide practical guidance: prefer Triplet/SCL when diversity preservation and hard-sample discrimination are critical, and Contrastive/InfoNCE when faster embedding compaction is desired.

Preview loads on expand
When Vision Meets Texts in Listwise Reranking Hongyi Cai Updated 2026-01-28

Recent advancements in information retrieval have highlighted the potential of integrating visual and textual information, yet effective reranking for image-text documents remains challenging due to the modality gap and scarcity of aligned datasets. Meanwhile, existing approaches often rely on large models (7B to 32B parameters) with reasoning-based distillation, incurring unnecessary computational overhead while primarily focusing on textual modalities. In this paper, we propose Rank-Nexus, a multimodal image-text document reranker that performs listwise qualitative reranking on retrieved lists incorporating both images and texts. To bridge the modality gap, we introduce a progressive cross-modal training strategy. We first train modalities separately: leveraging abundant text reranking data, we distill knowledge into the text branch. For images, where data is scarce, we construct distilled pairs from multimodal large language model (MLLM) captions on image retrieval benchmarks. Subsequently, we distill a joint image-text reranking dataset. Rank-Nexus achieves outstanding performance on text reranking benchmarks (TREC, BEIR) and the challenging image reranking benchmark (INQUIRE, MMDocIR), using only a lightweight 2B pretrained visual-language model. This efficient design ensures strong generalization across diverse multimodal scenarios without excessive parameters or reasoning overhead.

Preview loads on expand
Eliminating Hallucination in Diffusion-Augmented Interactive Text-to-Image Retrieval Zhuocheng Zhang, Kangheng Liang, Guanxuan Li, Paul Henderson, Richard Mccreadie, Zijun Long Updated 2026-01-28

Diffusion-Augmented Interactive Text-to-Image Retrieval (DAI-TIR) is a promising paradigm that improves retrieval performance by generating query images via diffusion models and using them as additional ``views'' of the user's intent. However, these generative views can be incorrect because diffusion generation may introduce hallucinated visual cues that conflict with the original query text. Indeed, we empirically demonstrate that these hallucinated cues can substantially degrade DAI-TIR performance. To address this, we propose Diffusion-aware Multi-view Contrastive Learning (DMCL), a hallucination-robust training framework that casts DAI-TIR as joint optimization over representations of query intent and the target image. DMCL introduces semantic-consistency and diffusion-aware contrastive objectives to align textual and diffusion-generated query views while suppressing hallucinated query signals. This yields an encoder that acts as a semantic filter, effectively mapping hallucinated cues into a null space, improving robustness to spurious cues and better representing the user's intent. Attention visualization and geometric embedding-space analyses corroborate this filtering behavior. Across five standard benchmarks, DMCL delivers consistent improvements in multi-round Hits@10, reaching as high as 7.37\% over prior fine-tuned and zero-shot baselines, which indicates it is a general and robust training framework for DAI-TIR.

Preview loads on expand
VGGT-SLAM 2.0: Real time Dense Feed-forward Scene Reconstruction Dominic Maggio, Luca Carlone Updated 2026-01-27

We present VGGT-SLAM 2.0, a real time RGB feed-forward SLAM system which substantially improves upon VGGT-SLAM for incrementally aligning submaps created from VGGT. Firstly, we remove high-dimensional 15-degree-of-freedom drift and planar degeneracy from VGGT-SLAM by creating a new factor graph design while still addressing the reconstruction ambiguity of VGGT given unknown camera intrinsics. Secondly, by studying the attention layers of VGGT, we show that one of the layers is well suited to assist in image retrieval verification for free without additional training, which enables both rejecting false positive matches and allows for completing more loop closures. Finally, we conduct a suite of experiments which includes showing VGGT-SLAM 2.0 can easily be adapted for open-set object detection and demonstrating real time performance while running online onboard a ground robot using a Jetson Thor. We also test in environments ranging from cluttered indoor apartments and office scenes to a 4,200 square foot barn, and we also demonstrate VGGT-SLAM 2.0 achieves the highest accuracy on the TUM dataset with about 23 percent less pose error than VGGT-SLAM. Code will be released upon publication.

Preview loads on expand
Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models Jeonghwan Kim, Renjie Tao, Sanat Sharma, Jiaqi Wang, Kai Sun, Zhaojiang Lin, Seungwhan Moon, Lambert Mathias, Anuj Kumar, Heng Ji, Xin Luna Dong Updated 2026-01-27

Visual Question Answering (VQA) often requires coupling fine-grained perception with factual knowledge beyond the input image. Prior multimodal Retrieval-Augmented Generation (MM-RAG) systems improve factual grounding but lack an internal policy for when and how to retrieve. We propose PixSearch, the first end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning. During encoding, PixSearch emits <search> tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries, eliminating the reliance on modular pipelines (detectors, segmenters, captioners, etc.). A two-stage supervised fine-tuning regimen with search-interleaved supervision teaches retrieval timing and query selection while preserving segmentation ability. On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization, yielding a 19.7% relative gain in accuracy on CRAG-MM compared to whole image retrieval, while retaining competitive reasoning performance on various VQA and text-only QA tasks.

Preview loads on expand
X-Aligner: Composed Visual Retrieval without the Bells and Whistles Yuqian Zheng, Mariana-Iuliana Georgescu Updated 2026-01-23

Composed Video Retrieval (CoVR) facilitates video retrieval by combining visual and textual queries. However, existing CoVR frameworks typically fuse multimodal inputs in a single stage, achieving only marginal gains over initial baseline. To address this, we propose a novel CoVR framework that leverages the representational power of Vision Language Models (VLMs). Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs and align their multimodal representation with that of the target video. To further enhance the representation of the multimodal query, we incorporate the caption of the visual query as an additional input. The framework is trained in two stages to preserve the pretrained VLM representation. In the first stage, only the newly introduced module is trained, while in the second stage, the textual query encoder is also fine-tuned. We implement our framework on top of BLIP-family architecture, namely BLIP and BLIP-2, and train it on the Webvid-CoVR data set. In addition to in-domain evaluation on Webvid-CoVR-Test, we perform zero-shot evaluations on the Composed Image Retrieval (CIR) data sets CIRCO and Fashion-IQ. Our framework achieves state-of-the-art performance on CoVR obtaining a Recall@1 of 63.93% on Webvid-CoVR-Test, and demonstrates strong zero-shot generalization on CIR tasks.

Preview loads on expand
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing Tingyu Song, Yanzhao Zhang, Mingxin Li, Zhuoning Guo, Dingkun Long, Pengjun Xie, Siyue Zhang, Yilun Zhao, Shu Wu Updated 2026-01-22

Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.

Preview loads on expand
Heterogeneous Uncertainty-Guided Composed Image Retrieval with Fine-Grained Probabilistic Learning Haomiao Tang, Jinpeng Wang, Minyi Zhao, Guanghao Meng, Ruisheng Luo, Long Chen, Shu-Tao Xia Updated 2026-01-22

Composed Image Retrieval (CIR) enables image search by combining a reference image with modification text. Intrinsic noise in CIR triplets incurs intrinsic uncertainty and threatens the model's robustness. Probabilistic learning approaches have shown promise in addressing such issues; however, they fall short for CIR due to their instance-level holistic modeling and homogeneous treatment of queries and targets. This paper introduces a Heterogeneous Uncertainty-Guided (HUG) paradigm to overcome these limitations. HUG utilizes a fine-grained probabilistic learning framework, where queries and targets are represented by Gaussian embeddings that capture detailed concepts and uncertainties. We customize heterogeneous uncertainty estimations for multi-modal queries and uni-modal targets. Given a query, we capture uncertainties not only regarding uni-modal content quality but also multi-modal coordination, followed by a provable dynamic weighting mechanism to derive comprehensive query uncertainty. We further design uncertainty-guided objectives, including query-target holistic contrast and fine-grained contrasts with comprehensive negative sampling strategies, which effectively enhance discriminative learning. Experiments on benchmarks demonstrate HUG's effectiveness beyond state-of-the-art baselines, with faithful analysis justifying the technical contributions.

Preview loads on expand
Unified Multimodal and Multilingual Retrieval via Multi-Task Learning with NLU Integration Xinyuan Zhang, Lina Zhang, Lisung Chen, Guangyao Liu, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang Updated 2026-01-21

Multimodal retrieval systems typically employ Vision Language Models (VLMs) that encode images and text independently into vectors within a shared embedding space. Despite incorporating text encoders, VLMs consistently underperform specialized text models on text-only retrieval tasks. Moreover, introducing additional text encoders increases storage, inference overhead, and exacerbates retrieval inefficiencies, especially in multilingual settings. To address these limitations, we propose a multi-task learning framework that unifies the feature representation across images, long and short texts, and intent-rich queries. To our knowledge, this is the first work to jointly optimize multilingual image retrieval, text retrieval, and natural language understanding (NLU) tasks within a single framework. Our approach integrates image and text retrieval with a shared text encoder that is enhanced by NLU features for intent understanding and retrieval accuracy.

Preview loads on expand
LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval Chao Gao, Siqiao Xue, Yimin Peng, Jiwen Fu, Tingyi Gu, Shanshan Li, Fan Zhou Updated 2026-01-21

In this paper, we present LookBench (We use the term "look" to reflect retrieval that mirrors how people shop -- finding the exact item, a close substitute, or a visually consistent alternative.), a live, holistic and challenging benchmark for fashion image retrieval in real e-commerce settings. LookBench includes both recent product images sourced from live websites and AI-generated fashion images, reflecting contemporary trends and use cases. Each test sample is time-stamped and we intend to update the benchmark periodically, enabling contamination-aware evaluation aligned with declared training cutoffs. Grounded in our fine-grained attribute taxonomy, LookBench covers single-item and outfit-level retrieval across. Our experiments reveal that LookBench poses a significant challenge on strong baselines, with many models achieving below $60\%$ Recall@1. Our proprietary model achieves the best performance on LookBench, and we release an open-source counterpart that ranks second, with both models attaining state-of-the-art results on legacy Fashion200K evaluations. LookBench is designed to be updated semi-annually with new test samples and progressively harder task variants, providing a durable measure of progress. We publicly release our leaderboard, dataset, evaluation code, and trained models.

Preview loads on expand
XR: Cross-Modal Agents for Composed Image Retrieval Zhongyu Yang, Wei Pang, Yingfang Yuan Updated 2026-01-20

Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: https://01yzzyu.github.io/xr.github.io/.

Preview loads on expand
Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration Yongcong Ye, Kai Zhang, Yanghai Zhang, Enhong Chen, Longfei Li, Jun Zhou Updated 2026-01-20

Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications, allowing users to retrieve a target image by providing a reference image and a relative caption describing the desired modifications. Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively. They primarily rely on either transforming the multimodal query into a single text using image-to-text models or employing large language models for target image description generation, approaches that often fail to capture complementary visual information and complete semantic context. To address these limitations, we propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration (CVSI). Specifically, CVSI leverages three key components: (1) Visual Information Extraction, which not only extracts global image features but also uses a pre-trained mapping network to convert the image into a pseudo token, combining it with the modification text and the objects most likely to be added. (2) Semantic Information Extraction, which involves using a pre-trained captioning model to generate multiple captions for the reference image, followed by leveraging an LLM to generate the modified captions and the objects most likely to be added. (3) Complementary Information Retrieval, which integrates information extracted from both the query and database images to retrieve the target image, enabling the system to efficiently handle retrieval queries in a variety of situations. Extensive experiments on three public datasets (e.g., CIRR, CIRCO, and FashionIQ) demonstrate that CVSI significantly outperforms existing state-of-the-art methods. Our code is available at https://github.com/yyc6631/CVSI.

Preview loads on expand
Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning Hongbo Bai, Yujin Zhou, Yile Wu, Chi-Min Chan, Pengcheng Wen, Kunhao Pan, Sirui Han, Yike Guo Updated 2026-01-20

Large Multimodal Models (LMMs) have achieved remarkable success in visual understanding, yet they struggle with knowledge-intensive queries involving long-tail entities or evolving information due to static parametric knowledge. Recent search-augmented approaches attempt to address this limitation, but existing methods rely on indiscriminate whole-image retrieval that introduces substantial visual redundancy and noise, and lack deep iterative reflection, limiting their effectiveness on complex visual queries. To overcome these challenges, we propose Glance-or-Gaze (GoG), a fully autonomous framework that shifts from passive perception to active visual planning. GoG introduces a Selective Gaze mechanism that dynamically chooses whether to glance at global context or gaze into high-value regions, filtering irrelevant information before retrieval. We design a dual-stage training strategy: Reflective GoG Behavior Alignment via supervised fine-tuning instills the fundamental GoG paradigm, while Complexity-Adaptive Reinforcement Learning further enhances the model's capability to handle complex queries through iterative reasoning. Experiments across six benchmarks demonstrate state-of-the-art performance. Ablation studies confirm that both Selective Gaze and complexity-adaptive RL are essential for effective visual search. We will release our data and models for further exploration soon.

Preview loads on expand
DC-VLAQ: Query-Residual Aggregation for Robust Visual Place Recognition Hanyu Zhu, Zhihao Zhan, Yuhang Ming, Liang Li, Dibo Hou, Javier Civera, Wanzeng Kong Updated 2026-01-19

One of the central challenges in visual place recognition (VPR) is learning a robust global representation that remains discriminative under large viewpoint changes, illumination variations, and severe domain shifts. While visual foundation models (VFMs) provide strong local features, most existing methods rely on a single model, overlooking the complementary cues offered by different VFMs. However, exploiting such complementary information inevitably alters token distributions, which challenges the stability of existing query-based global aggregation schemes. To address these challenges, we propose DC-VLAQ, a representation-centric framework that integrates the fusion of complementary VFMs and robust global aggregation. Specifically, we first introduce a lightweight residual-guided complementary fusion that anchors representations in the DINOv2 feature space while injecting complementary semantics from CLIP through a learned residual correction. In addition, we propose the Vector of Local Aggregated Queries (VLAQ), a query--residual global aggregation scheme that encodes local tokens by their residual responses to learnable queries, resulting in improved stability and the preservation of fine-grained discriminative cues. Extensive experiments on standard VPR benchmarks, including Pitts30k, Tokyo24/7, MSLS, Nordland, SPED, and AmsterTime, demonstrate that DC-VLAQ consistently outperforms strong baselines and achieves state-of-the-art performance, particularly under challenging domain shifts and long-term appearance changes.

Preview loads on expand
SupScene: Learning Overlap-Aware Global Descriptor for Unconstrained SfM Xulei Shi, Maoyu Wang, Yuning Peng, Guanbo Wang, Xin Wang, Qi Chen, Pengjie Tao Updated 2026-01-17

Image retrieval is a critical step for alleviating the quadratic complexity of image matching in unconstrained Structure-from-Motion (SfM). However, in this context, image retrieval typically focuses more on the image pairs of geometric matchability than on those of semantic similarity, a nuance that most existing deep learning-based methods guided by batched binaries (overlapping vs. non-overlapping pairs) fail to capture. In this paper, we introduce SupScene, a novel solution that learns global descriptors tailored for finding overlapping image pairs of similar geometric nature for SfM. First, to better underline co-visible regions, we employ a subgraph-based training strategy that moves beyond equally important isolated pairs, leveraging ground-truth geometric overlapping relationships with various weights to provide fine-grained supervision via a soft supervised contrastive loss. Second, we introduce DiVLAD, a DINO-inspired VLAD aggregator that leverages the inherent multi-head attention maps from the last block of ViT. And then, a learnable gating mechanism is designed to adaptively utilize these semantically salient cues with visual features, enabling a more discriminative global descriptor. Extensive experiments on the GL3D dataset demonstrate that our method achieves state-of-the-art performance, significantly outperforming NetVLAD while introducing a negligible number of additional trainable parameters. Furthermore, we show that the proposed training strategy brings consistent gains across different aggregation techniques. Code and models are available at https://anonymous.4open.science/r/SupScene-5B73.

Preview loads on expand
Simple Models, Rich Representations: Visual Decoding from Primate Intracortical Neural Signals Matteo Ciferri, Matteo Ferrante, Nicola Toschi Updated 2026-01-16

Understanding how neural activity gives rise to perception is a central challenge in neuroscience. We address the problem of decoding visual information from high-density intracortical recordings in primates, using the THINGS Ventral Stream Spiking Dataset. We systematically evaluate the effects of model architecture, training objectives, and data scaling on decoding performance. Results show that decoding accuracy is mainly driven by modeling temporal dynamics in neural signals, rather than architectural complexity. A simple model combining temporal attention with a shallow MLP achieves up to 70% top-1 image retrieval accuracy, outperforming linear baselines as well as recurrent and convolutional approaches. Scaling analyses reveal predictable diminishing returns with increasing input dimensionality and dataset size. Building on these findings, we design a modular generative decoding pipeline that combines low-resolution latent reconstruction with semantically conditioned diffusion, generating plausible images from 200 ms of brain activity. This framework provides principles for brain-computer interfaces and semantic neural decoding.

Preview loads on expand
Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text Piyush Singh Pasi Updated 2026-01-15

Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely heavily on machine translation, while advances in multilingual text modeling remain underutilized. We introduce METAL, a lightweight alignment method that learns only a few linear layers using English text alone to map multilingual text embeddings into a multimodal space. Despite its simplicity, METAL matches baseline performance in English (94.9 percent Recall at 10) and achieves strong zero-shot transfer (89.5 percent Recall at 10 averaged across 11 languages, 10 unseen) on XTD text-to-image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image-text retrieval, METAL generalizes to audio-text retrieval and cross-lingual text-to-image generation. We release code and checkpoints at https://github.com/m2m-codebase/M2M , as well as multilingual evaluation datasets including MSCOCO Multilingual 30K (https://huggingface.co/datasets/piyushsinghpasi/mscoco-multilingual-30k ), AudioCaps Multilingual (https://huggingface.co/datasets/piyushsinghpasi/audiocaps-multilingual ), and Clotho Multilingual (https://huggingface.co/datasets/piyushsinghpasi/clotho-multilingual ), to facilitate further research.

Preview loads on expand
UniHash: Unifying Pointwise and Pairwise Hashing Paradigms for Seen and Unseen Category Retrieval Xiaoxu Ma, Runhao Li, Hanwen Liu, Xiangbo Zhang, Zhenyu Weng Updated 2026-01-14

Effective retrieval across both seen and unseen categories is crucial for modern image retrieval systems. Retrieval on seen categories ensures precise recognition of known classes, while retrieval on unseen categories promotes generalization to novel classes with limited supervision. However, most existing deep hashing methods are confined to a single training paradigm, either pointwise or pairwise, where the former excels on seen categories and the latter generalizes better to unseen ones. To overcome this limitation, we propose Unified Hashing (UniHash), a dual-branch framework that unifies the strengths of both paradigms to achieve balanced retrieval performance across seen and unseen categories. UniHash consists of two complementary branches: a center-based branch following the pointwise paradigm and a pairwise branch following the pairwise paradigm. A novel hash code learning method is introduced to enable bidirectional knowledge transfer between branches, improving hash code discriminability and generalization. It employs a mutual learning loss to align hash representations and introduces a Split-Merge Mixture of Hash Experts (SM-MoH) module to enhance cross-branch exchange of hash representations. Theoretical analysis substantiates the effectiveness of UniHash, and extensive experiments on CIFAR-10, MSCOCO, and ImageNet demonstrate that UniHash consistently achieves state-of-the-art performance in both seen and unseen image retrieval scenarios.

Preview loads on expand
Hybrid guided variational autoencoder for visual place recognition Ni Wang, Zihan You, Emre Neftci, Thorben Schoepe Updated 2026-01-14

Autonomous agents such as cars, robots and drones need to precisely localize themselves in diverse environments, including in GPS-denied indoor environments. One approach for precise localization is visual place recognition (VPR), which estimates the place of an image based on previously seen places. State-of-the-art VPR models require high amounts of memory, making them unwieldy for mobile deployment, while more compact models lack robustness and generalization capabilities. This work overcomes these limitations for robotics using a combination of event-based vision sensors and an event-based novel guided variational autoencoder (VAE). The encoder part of our model is based on a spiking neural network model which is compatible with power-efficient low latency neuromorphic hardware. The VAE successfully disentangles the visual features of 16 distinct places in our new indoor VPR dataset with a classification performance comparable to other state-of-the-art approaches while, showing robust performance also under various illumination conditions. When tested with novel visual inputs from unknown scenes, our model can distinguish between these places, which demonstrates a high generalization capability by learning the essential features of location. Our compact and robust guided VAE with generalization capabilities poses a promising model for visual place recognition that can significantly enhance mobile robot navigation in known and unknown indoor environments.

Preview loads on expand
Keyframe-based Dense Mapping with the Graph of View-Dependent Local Maps Krzysztof Zielinski, Dominik Belter Updated 2026-01-13

In this article, we propose a new keyframe-based mapping system. The proposed method updates local Normal Distribution Transform maps (NDT) using data from an RGB-D sensor. The cells of the NDT are stored in 2D view-dependent structures to better utilize the properties and uncertainty model of RGB-D cameras. This method naturally represents an object closer to the camera origin with higher precision. The local maps are stored in the pose graph which allows correcting global map after loop closure detection. We also propose a procedure that allows merging and filtering local maps to obtain a global map of the environment. Finally, we compare our method with Octomap and NDT-OM and provide example applications of the proposed mapping method.

Preview loads on expand
Enhancing Image Quality Assessment Ability of LMMs via Retrieval-Augmented Generation Kang Fu, Huiyu Duan, Zicheng Zhang, Yucheng Zhu, Jun Zhao, Xiongkuo Min, Jia Wang, Guangtao Zhai Updated 2026-01-13

Large Multimodal Models (LMMs) have recently shown remarkable promise in low-level visual perception tasks, particularly in Image Quality Assessment (IQA), demonstrating strong zero-shot capability. However, achieving state-of-the-art performance often requires computationally expensive fine-tuning methods, which aim to align the distribution of quality-related token in output with image quality levels. Inspired by recent training-free works for LMM, we introduce IQARAG, a novel, training-free framework that enhances LMMs' IQA ability. IQARAG leverages Retrieval-Augmented Generation (RAG) to retrieve some semantically similar but quality-variant reference images with corresponding Mean Opinion Scores (MOSs) for input image. These retrieved images and input image are integrated into a specific prompt. Retrieved images provide the LMM with a visual perception anchor for IQA task. IQARAG contains three key phases: Retrieval Feature Extraction, Image Retrieval, and Integration & Quality Score Generation. Extensive experiments across multiple diverse IQA datasets, including KADID, KonIQ, LIVE Challenge, and SPAQ, demonstrate that the proposed IQARAG effectively boosts the IQA performance of LMMs, offering a resource-efficient alternative to fine-tuning for quality assessment.

Preview loads on expand
Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization Miao Pan, Wangjie Gan, Jintao Chen, Wenqi Zhang, Bing Sun, Jianwei Yin, Xuhong Zhang Updated 2026-01-13

While Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse tasks, their practical deployment is severely hindered by hallucination issues, which become particularly acute during Reinforcement Learning (RL) optimization. This paper systematically analyzes the root causes of hallucinations in MLLMs under RL training, identifying three critical factors: (1) an over-reliance on chained visual reasoning, where inaccurate initial descriptions or redundant information anchor subsequent inferences to incorrect premises; (2) insufficient exploration diversity during policy optimization, leading the model to generate overly confident but erroneous outputs; and (3) destructive conflicts between training samples, where Neural Tangent Kernel (NTK) similarity causes false associations and unstable parameter updates. To address these challenges, we propose a comprehensive framework comprising three core modules. First, we enhance visual localization by introducing dedicated planning and captioning stages before the reasoning phase, employing a quality-based caption reward to ensure accurate initial anchoring. Second, to improve exploration, we categorize samples based on the mean and variance of their reward distributions, prioritizing samples with high variance to focus the model on diverse and informative data. Finally, to mitigate sample interference, we regulate NTK similarity by grouping sample pairs and applying an InfoNCE loss to push overly similar pairs apart and pull dissimilar ones closer, thereby guiding gradient interactions toward a balanced range. Experimental results demonstrate that our proposed method significantly reduces hallucination rates and effectively enhances the inference accuracy of MLLMs.

Preview loads on expand
Multi-task Cross-modal Learning for Chest X-ray Image Retrieval Zhaohui Liang, Sivaramakrishnan Rajaraman, Niccolo Marini, Zhiyun Xue, Sameer Antani Updated 2026-01-08

CLIP and BiomedCLIP are examples of vision-language foundation models and offer strong cross-modal embeddings; however, they are not optimized for fine-grained medical retrieval tasks, such as retrieving clinically relevant radiology reports using chest X-ray (CXR) image queries. To address this shortcoming, we propose a multi-task learning framework to fine-tune BiomedCLIP and evaluate improvements to CXR image-text retrieval. Using BiomedCLIP as the backbone, we incorporate a lightweight MLP projector head trained with a multi-task composite loss function that includes: (1) a binary cross-entropy loss to distinguish normal from abnormal CXR studies, (2) a supervised contrastive loss to reinforce intra-class consistency, and (3) a CLIP loss to maintain cross-modal alignment. Experimental results demonstrate that the fine-tuned model achieves more balanced and clinically meaningful performance across both image-to-text and text-to-image retrieval tasks compared to the pretrained BiomedCLIP and general-purpose CLIP models. Furthermore, t-SNE visualizations reveal clearer semantic clustering of normal and abnormal cases, demonstrating the model's enhanced diagnostic sensitivity. These findings highlight the value of domain-adaptive, multi-task learning for advancing cross-modal retrieval in biomedical applications.

Preview loads on expand
ImLoc: Revisiting Visual Localization with Image-based Representation Xudong Jiang, Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Marc Pollefeys Updated 2026-01-07

Existing visual localization methods are typically either 2D image-based, which are easy to build and maintain but limited in effective geometric reasoning, or 3D structure-based, which achieve high accuracy but require a centralized reconstruction and are difficult to update. In this work, we revisit visual localization with a 2D image-based representation and propose to augment each image with estimated depth maps to capture the geometric structure. Supported by the effective use of dense matchers, this representation is not only easy to build and maintain, but achieves highest accuracy in challenging conditions. With compact compression and a GPU-accelerated LO-RANSAC implementation, the whole pipeline is efficient in both storage and computation and allows for a flexible trade-off between accuracy and highest memory efficiency. Our method achieves a new state-of-the-art accuracy on various standard benchmarks and outperforms existing memory-efficient methods at comparable map sizes. Code will be available at https://github.com/cvg/Hierarchical-Localization.

Preview loads on expand
CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Yiwei Ma, Jiayi Ji, Chenyi Lei, Han Li, Xiaoshuai Sun Updated 2026-01-07

Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.

Preview loads on expand
BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion Qingyao Tian, Bingyu Yang, Huai Liao, Xinyan Huang, Junyong Li, Dong Yi, Hongbin Liu Updated 2026-01-07

Vision-language models (VLMs) have recently shown remarkable performance in navigation and localization tasks by leveraging large-scale pretraining for semantic understanding. However, applying VLMs to 6-DoF endoscopic camera localization presents several challenges: 1) the lack of large-scale, high-quality, densely annotated, and localization-oriented vision-language datasets in real-world medical settings; 2) limited capability for fine-grained pose regression; and 3) high computational latency when extracting temporal features from past frames. To address these issues, we first construct BREATH dataset, the largest in-vivo endoscopic localization dataset to date, collected in the complex human airway. Building on this dataset, we propose BREATH-VL, a hybrid framework that integrates semantic cues from VLMs with geometric information from vision-based registration methods for accurate 6-DoF pose estimation. Our motivation lies in the complementary strengths of both approaches: VLMs offer generalizable semantic understanding, while registration methods provide precise geometric alignment. To further enhance the VLM's ability to capture temporal context, we introduce a lightweight context-learning mechanism that encodes motion history as linguistic prompts, enabling efficient temporal reasoning without expensive video-level computation. Extensive experiments demonstrate that the vision-language module delivers robust semantic localization in challenging surgical scenes. Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline, while achieving competitive computational latency.

Preview loads on expand
HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps Xuchang Zhong, Xu Cao, Jinke Feng, Hao Fang Updated 2026-01-07

Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusion and direct 3-DoF pose regression. To the best of our knowledge, this is the first work to unify BEV semantic reasoning with homography learning for image-to-map localization. Furthermore, by explicitly modeling homography transformations, the proposed framework naturally supports cross-resolution inputs, enhancing model flexibility. Extensive experiments on the nuScenes dataset demonstrate that our approach significantly outperforms existing state-of-the-art visual localization methods. Code and pretrained models will be publicly released to foster future research.

Preview loads on expand
Comparative Analysis of Binarization Methods For Medical Image Hashing On Odir Dataset Nedim Muzoglu Updated 2026-01-07

In this study, we evaluated four binarization methods. Locality-Sensitive Hashing (LSH), Iterative Quantization (ITQ), Kernel-based Supervised Hashing (KSH), and Supervised Discrete Hashing (SDH) on the ODIR dataset using deep feature embeddings. Experimental results show that SDH achieved the best performance, with an mAP@100 of 0.9184 using only 32-bit codes, outperforming LSH, ITQ, and KSH. Compared with prior studies, our method proved highly competitive: Fang et al. reported 0.7528 (Fundus-iSee, 48 bits) and 0.8856 (ASOCT-Cataract, 48 bits), while Wijesinghe et al. achieved 94.01 (KVASIR, 256 bits). Despite using significantly fewer bits, our SDH-based framework reached retrieval accuracy close to the state-of-the-art. These findings demonstrate that SDH is the most effective approach among those tested, offering a practical balance of accuracy, storage, and efficiency for medical image retrieval and device inventory management.

Preview loads on expand
Loop Closure using AnyLoc Visual Place Recognition in DPV-SLAM Wenzheng Zhang, Kazuki Adachi, Yoshitaka Hara, Sousuke Nakamura Updated 2026-01-06

Loop closure is crucial for maintaining the accuracy and consistency of visual SLAM. We propose a method to improve loop closure performance in DPV-SLAM. Our approach integrates AnyLoc, a learning-based visual place recognition technique, as a replacement for the classical Bag of Visual Words (BoVW) loop detection method. In contrast to BoVW, which relies on handcrafted features, AnyLoc utilizes deep feature representations, enabling more robust image retrieval across diverse viewpoints and lighting conditions. Furthermore, we propose an adaptive mechanism that dynamically adjusts similarity threshold based on environmental conditions, removing the need for manual tuning. Experiments on both indoor and outdoor datasets demonstrate that our method significantly outperforms the original DPV-SLAM in terms of loop closure accuracy and robustness. The proposed method offers a practical and scalable solution for enhancing loop closure performance in modern SLAM systems.

Preview loads on expand
Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, Jun Wang Updated 2026-01-05

Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.

Preview loads on expand
OCP-LS: An Efficient Algorithm for Visual Localization Jindi Zhong, Hongxia Wang, Huanshui Zhang Updated 2025-12-31

This paper proposes a novel second-order optimization algorithm. It aims to address large-scale optimization problems in deep learning because it incorporates the OCP method and appropriately approximating the diagonal elements of the Hessian matrix. Extensive experiments on multiple standard visual localization benchmarks demonstrate the significant superiority of the proposed method. Compared with conventional optimiza tion algorithms, our framework achieves competitive localization accuracy while exhibiting faster convergence, enhanced training stability, and improved robustness to noise interference.

Preview loads on expand
Geometric Multi-Session Map Merging with Learned Local Descriptors Yanlong Ma, Nakul S. Joshi, Christa S. Robison, Philip R. Osteen, Brett T. Lopez Updated 2025-12-30

Multi-session map merging is crucial for extended autonomous operations in large-scale environments. In this paper, we present GMLD, a learning-based local descriptor framework for large-scale multi-session point cloud map merging that systematically aligns maps collected across different sessions with overlapping regions. The proposed framework employs a keypoint-aware encoder and a plane-based geometric transformer to extract discriminative features for loop closure detection and relative pose estimation. To further improve global consistency, we include inter-session scan matching cost factors in the factor-graph optimization stage. We evaluate our framework on the public datasets, as well as self-collected data from diverse environments. The results show accurate and robust map merging with low error, and the learned features deliver strong performance in both loop closure detection and relative pose estimation.

Preview loads on expand
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation Guo Ye, Zexi Zhang, Xu Zhao, Shang Wu, Haoran Lu, Shihan Lu, Han Liu Updated 2025-12-29

Vision-Language-Action (VLA) models have shown remarkable generalization by mapping web-scale knowledge to robotic control, yet they remain blind to physical contact. Consequently, they struggle with contact-rich manipulation tasks that require reasoning about force, texture, and slip. While some approaches incorporate low-dimensional tactile signals, they fail to capture the high-resolution dynamics essential for such interactions. To address this limitation, we introduce DreamTacVLA, a framework that grounds VLA models in contact physics by learning to feel the future. Our model adopts a hierarchical perception scheme in which high-resolution tactile images serve as micro-vision inputs coupled with wrist-camera local vision and third-person macro vision. To reconcile these multi-scale sensory streams, we first train a unified policy with a Hierarchical Spatial Alignment (HSA) loss that aligns tactile tokens with their spatial counterparts in the wrist and third-person views. To further deepen the model's understanding of fine-grained contact dynamics, we finetune the system with a tactile world model that predicts future tactile signals. To mitigate tactile data scarcity and the wear-prone nature of tactile sensors, we construct a hybrid large-scale dataset sourced from both high-fidelity digital twin and real-world experiments. By anticipating upcoming tactile states, DreamTacVLA acquires a rich model of contact physics and conditions its actions on both real observations and imagined consequences. Across contact-rich manipulation tasks, it outperforms state-of-the-art VLA baselines, achieving up to 95% success, highlighting the importance of understanding physical contact for robust, touch-aware robotic agents.

Preview loads on expand
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning Jiawei Chen, Xintian Shen, Lihao Zheng, Zhenwei Shao, Hongyuan Zhang, Pengfei Yu, Xudong Rao, Ning Mao, Xiaobo Liu, Lian Wen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Shanshan Li, Zide Liu, Jing Luo, Lifu Mu, Xuhao Pan, Chang Ren, Haoyi Sun, Qian Wang, Wei Wang, Hongfu Yang, Jiqing Zhan, Chunpeng Zhou, Zheng Zhou, Hao Ma, Tao Wei, Pan Zhou, Wei Chen Updated 2025-12-29

Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.

Preview loads on expand
Anomaly Detection by Effectively Leveraging Synthetic Images Sungho Kang, Hyunkyu Park, Yeonho Lee, Hanbyul Lee, Mijoo Jeong, YeongHyeon Park, Injae Lee, Juneho Yi Updated 2025-12-29

Anomaly detection plays a vital role in industrial manufacturing. Due to the scarcity of real defect images, unsupervised approaches that rely solely on normal images have been extensively studied. Recently, diffusion-based generative models brought attention to training data synthesis as an alternative solution. In this work, we focus on a strategy to effectively leverage synthetic images to maximize the anomaly detection performance. Previous synthesis strategies are broadly categorized into two groups, presenting a clear trade-off. Rule-based synthesis, such as injecting noise or pasting patches, is cost-effective but often fails to produce realistic defect images. On the other hand, generative model-based synthesis can create high-quality defect images but requires substantial cost. To address this problem, we propose a novel framework that leverages a pre-trained text-guided image-to-image translation model and image retrieval model to efficiently generate synthetic defect images. Specifically, the image retrieval model assesses the similarity of the generated images to real normal images and filters out irrelevant outputs, thereby enhancing the quality and relevance of the generated defect images. To effectively leverage synthetic images, we also introduce a two stage training strategy. In this strategy, the model is first pre-trained on a large volume of images from rule-based synthesis and then fine-tuned on a smaller set of high-quality images. This method significantly reduces the cost for data collection while improving the anomaly detection performance. Experiments on the MVTec AD dataset demonstrate the effectiveness of our approach.

Preview loads on expand
UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer Tianchen Deng, Xun Chen, Ziming Li, Hongming Shen, Danwei Wang, Javier Civera, Hesheng Wang Updated 2025-12-28

Visual Place Recognition (VPR) has been traditionally formulated as a single-image retrieval task. Using multiple views offers clear advantages, yet this setting remains relatively underexplored and existing methods often struggle to generalize across diverse environments. In this work we introduce UniPR-3D, the first VPR architecture that effectively integrates information from multiple views. UniPR-3D builds on a VGGT backbone capable of encoding multi-view 3D representations, which we adapt by designing feature aggregators and fine-tune for the place recognition task. To construct our descriptor, we jointly leverage the 3D tokens and intermediate 2D tokens produced by VGGT. Based on their distinct characteristics, we design dedicated aggregation modules for 2D and 3D features, allowing our descriptor to capture fine-grained texture cues while also reasoning across viewpoints. To further enhance generalization, we incorporate both single- and multi-frame aggregation schemes, along with a variable-length sequence retrieval strategy. Our experiments show that UniPR-3D sets a new state of the art, outperforming both single- and multi-view baselines and highlighting the effectiveness of geometry-grounded tokens for VPR. Our code and models will be made publicly available on Github https://github.com/dtc111111/UniPR-3D.

Preview loads on expand
Reloc-VGGT: Visual Re-localization with Geometry Grounded Transformer Tianchen Deng, Wenhua Wu, Kunzhen Wu, Guangming Wang, Siting Zhu, Shenghai Yuan, Xun Chen, Guole Shen, Zhe Liu, Hesheng Wang Updated 2025-12-26

Visual localization has traditionally been formulated as a pair-wise pose regression problem. Existing approaches mainly estimate relative poses between two images and employ a late-fusion strategy to obtain absolute pose estimates. However, the late motion average is often insufficient for effectively integrating spatial information, and its accuracy degrades in complex environments. In this paper, we present the first visual localization framework that performs multi-view spatial integration through an early-fusion mechanism, enabling robust operation in both structured and unstructured environments. Our framework is built upon the VGGT backbone, which encodes multi-view 3D geometry, and we introduce a pose tokenizer and projection module to more effectively exploit spatial relationships from multiple database views. Furthermore, we propose a novel sparse mask attention strategy that reduces computational cost by avoiding the quadratic complexity of global attention, thereby enabling real-time performance at scale. Trained on approximately eight million posed image pairs, Reloc-VGGT demonstrates strong accuracy and remarkable generalization ability. Extensive experiments across diverse public datasets consistently validate the effectiveness and efficiency of our approach, delivering high-quality camera pose estimates in real time while maintaining robustness to unseen environments. Our code and models will be publicly released upon acceptance.https://github.com/dtc111111/Reloc-VGGT.

Preview loads on expand
Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Tran Chi Nguyen Updated 2025-12-24

Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval

Preview loads on expand
Soft Filtering: Guiding Zero-shot Composed Image Retrieval with Prescriptive and Proscriptive Constraints Youjin Jung, Seongwoo Cho, Hyun-seok Min, Sungchul Choi Updated 2025-12-23

Composed Image Retrieval (CIR) aims to find a target image that aligns with user intent, expressed through a reference image and a modification text. While Zero-shot CIR (ZS-CIR) methods sidestep the need for labeled training data by leveraging pretrained vision-language models, they often rely on a single fused query that merges all descriptive cues of what the user wants, tending to dilute key information and failing to account for what they wish to avoid. Moreover, current CIR benchmarks assume a single correct target per query, overlooking the ambiguity in modification texts. To address these challenges, we propose Soft Filtering with Textual constraints (SoFT), a training-free, plug-and-play filtering module for ZS-CIR. SoFT leverages multimodal large language models (LLMs) to extract two complementary constraints from the reference-modification pair: prescriptive (must-have) and proscriptive (must-avoid) constraints. These serve as semantic filters that reward or penalize candidate images to re-rank results, without modifying the base retrieval model or adding supervision. In addition, we construct a two-stage dataset pipeline that refines CIR benchmarks. We first identify multiple plausible targets per query to construct multi-target triplets, capturing the open-ended nature of user intent. Then guide multimodal LLMs to rewrite the modification text to focus on one target, while referencing contrastive distractors to ensure precision. This enables more comprehensive and reliable evaluation under varying ambiguity levels. Applied on top of CIReVL, a ZS-CIR retriever, SoFT raises R@5 to 65.25 on CIRR (+12.94), mAP@50 to 27.93 on CIRCO (+6.13), and R@50 to 58.44 on FashionIQ (+4.59), demonstrating broad effectiveness.

Preview loads on expand
Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark Hao Guo et.al. Updated 2025-12-23

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis Argha Kamal Samanta et.al. Updated 2025-12-22

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Finer-Personalization Rank: Fine-Grained Retrieval Examines Identity Preservation for Personalized Generation Connor Kilrain et.al. Updated 2025-12-22

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Text2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments Saeideh Yousefzadeh et.al. Updated 2025-12-21

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Through the PRISm: Importance-Aware Scene Graphs for Image Retrieval Dimitrios Georgoulopoulos et.al. Updated 2025-12-20

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors Son Tung Nguyen et.al. Updated 2025-12-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
The Effect of Negation on CLIP in Medical Imaging: Limitations of Contrastive Language-Image Pretraining Jasmine Vu et.al. Updated 2025-12-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval Amna Amir et.al. Updated 2025-12-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
CLNet: Cross-View Correspondence Makes a Stronger Geo-Localizationer Xianwei Cao et.al. Updated 2025-12-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries Emanuele Mezzi et.al. Updated 2025-12-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Towards Test-time Efficient Visual Place Recognition via Asymmetric Query Processing Jaeyoon Kim et.al. Updated 2025-12-15

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching Wonseok Choi et.al. Updated 2025-12-14

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval J. Xiao et.al. Updated 2025-12-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos Ryan Meegan et.al. Updated 2025-12-10

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Adaptive Thresholding for Visual Place Recognition using Negative Gaussian Mixture Statistics Nick Trinh et.al. Updated 2025-12-09

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Generalized Referring Expression Segmentation on Aerial Photos Luís Marnoto et.al. Updated 2025-12-08

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Spatial Retrieval Augmented Autonomous Driving Xiaosong Jia et.al. Updated 2025-12-07

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Language-driven Fine-grained Retrieval Shijie Wang et.al. Updated 2025-12-06

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
GuideNav: User-Informed Development of a Vision-Only Robotic Navigation Assistant For Blind Travelers Hochul Hwang et.al. Updated 2025-12-05

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning Shengyuan Ding et.al. Updated 2025-12-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark Haobo Yuan et.al. Updated 2025-12-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding Abhigyan Bhattacharya et.al. Updated 2025-12-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Revealing stimulus-dependent dynamics through statistical complexity Edson V. de Paula et.al. Updated 2025-12-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Influence of Object Affordance on Action Language Understanding: Evidence from Dynamic Causal Modeling Analysis Supriya Bordoloi et.al. Updated 2025-12-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging Zhijian Shu et.al. Updated 2025-12-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Terahertz Fourier Ptychographic Imaging Pitambar Mukherjee et.al. Updated 2025-12-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
TEMPO-VINE: A Multi-Temporal Sensor Fusion Dataset for Localization and Mapping in Vineyards Mauro Martini et.al. Updated 2025-12-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
MemLoRA: Distilling Expert Adapters for On-Device Memory Systems Massimo Bini et.al. Updated 2025-12-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Spectral micro-CT for quantitative analysis of calcification in fibrocartilage Vittoria Mazzini et.al. Updated 2025-12-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval Zhiwei Chen et.al. Updated 2025-12-02

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization Zixuan Song et.al. Updated 2025-12-02

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval Xin Wang et.al. Updated 2025-12-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Winning Solutions for the Rayan AI Contest: Compositional Retrieval, Zero-Shot Anomaly Detection, and Backdoor Detection Ali Nafisi et.al. Updated 2025-12-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
MARVO: Marine-Adaptive Radiance-aware Visual Odometry Sacchin Sundar et.al. Updated 2025-11-28

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
UNION: A Lightweight Target Representation for Efficient Zero-Shot Image-Guided Retrieval with Optional Textual Queries Hoang-Bao Le et.al. Updated 2025-11-27

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models Naifu Zhang et.al. Updated 2025-11-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Fast 3D Ultrasound Localization Microscopy via Projection-based Processing Framework Jingke Zhang et.al. Updated 2025-11-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Qwen3-VL Technical Report Shuai Bai et.al. Updated 2025-11-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy Teng Hu et.al. Updated 2025-11-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
FITRep: Attention-Guided Item Representation via MLLMs Guoxiao Zhang et.al. Updated 2025-11-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning Xin Gu et.al. Updated 2025-11-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
HTTM: Head-wise Temporal Token Merging for Faster VGGT Weitian Wang et.al. Updated 2025-11-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Low-dose Chemically Specific Bioimaging via Deep-UV Lensless Holographic Microscopy on a Standard Camera Piotr Arcab et.al. Updated 2025-11-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Adaptive Lighting Control in Visible Light Systems: An Integrated Sensing, Communication, and Illumination Framework Xinyan Xie et.al. Updated 2025-11-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition Baoli Sun et.al. Updated 2025-11-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Wigner and Gabor phase-space analysis of propagators for evolution equations Elena Cordero et.al. Updated 2025-11-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Real-Time Object Tracking with On-Device Deep Learning for Adaptive Beamforming in Dynamic Acoustic Environments Jorge Ortigoso-Narro et.al. Updated 2025-11-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
In-vivo imaging with a low-cost MRI scanner and cloud data processing in low-resource settings Teresa Guallart-Naval et.al. Updated 2025-11-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Can Modern Vision Models Understand the Difference Between an Object and a Look-alike? Itay Cohen et.al. Updated 2025-11-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation Moazzam Umer Gondal et.al. Updated 2025-11-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Graph-based 3D Human Pose Estimation using WiFi Signals Jichao Chen et.al. Updated 2025-11-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach Fan Nie et.al. Updated 2025-11-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
LAA3D: A Benchmark of Detecting and Tracking Low-Altitude Aircraft in 3D Space Hai Wu et.al. Updated 2025-11-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Multi-Agent Monocular Dense SLAM With 3D Reconstruction Priors Haihang Wu et.al. Updated 2025-11-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting Qiyang Yu et.al. Updated 2025-11-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization Yikun Wang et.al. Updated 2025-11-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
First Frame Is the Place to Go for Video Content Customization Jingxi Chen et.al. Updated 2025-11-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Hierarchical Semantic Tree Anchoring for CLIP-Based Class-Incremental Learning Tao Hu et.al. Updated 2025-11-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Multi-Text Guided Few-Shot Semantic Segmentation Qiang Jiao et.al. Updated 2025-11-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome Dabin Jeong et.al. Updated 2025-11-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation Linyin Luo et.al. Updated 2025-11-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
The Empowerment of Science of Science by Large Language Models: New Tools and Methods Guoqiang Liang et.al. Updated 2025-11-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models Nayoung Oh et.al. Updated 2025-11-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval Qing Wang et.al. Updated 2025-11-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation Jin Wang et.al. Updated 2025-11-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Multi-modal Loop Closure Detection with Foundation Models in Severely Unstructured Environments Laura Alejandra Encinar Gonzalez et.al. Updated 2025-11-07

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
DAFM: Dynamic Adaptive Fusion for Multi-Model Collaboration in Composed Image Retrieval Yawei Cai et.al. Updated 2025-11-07

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA Itbaan Safwan et.al. Updated 2025-11-06

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
An Efficient Algorithm for Learning-Based Visual Localization Jindi Zhong et.al. Updated 2025-11-06

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization Tao Liu et.al. Updated 2025-11-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
LUMA-RAG: Lifelong Multimodal Agents with Provably Stable Streaming Alignment Rohan Wandre et.al. Updated 2025-11-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment Xinyu Mao et.al. Updated 2025-11-03

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Evaluating Perspectival Biases in Cross-Modal Retrieval Teerapol Saengsukhiran et.al. Updated 2025-11-03

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Dynamic Multi-level Weighted Alignment Network for Zero-shot Sketch-based Image Retrieval Hanwen Su et.al. Updated 2025-11-02

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Multi-Mapcher: Loop Closure Detection-Free Heterogeneous LiDAR Multi-Session SLAM Leveraging Outlier-Robust Registration for Autonomous Vehicles Hyungtae Lim et.al. Updated 2025-11-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Approximate Diverse $k$-nearest Neighbor Search in Vector Database Jiachen Zhao et.al. Updated 2025-10-31

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Scaling Image Geo-Localization to Continent Level Philipp Lindenberger et.al. Updated 2025-10-30

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Instance-Level Composed Image Retrieval Bill Psomas et.al. Updated 2025-10-29

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts Binbin Li et.al. Updated 2025-10-28

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Accurate and Scalable Multimodal Pathology Retrieval via Attentive Vision-Language Alignment Hongyi Wang et.al. Updated 2025-10-27

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Seeing the Unseen: Towards Zero-Shot Inspection for Wind Turbine Blades using Knowledge-Augmented Vision Language Models Yang Zhang et.al. Updated 2025-10-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
TWC-SLAM: Multi-Agent Cooperative SLAM with Text Semantics and WiFi Features Integration for Similar Indoor Environments Chunyu Li et.al. Updated 2025-10-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Cross-view Localization and Synthesis -- Datasets, Challenges and Opportunities Ningli Xu et.al. Updated 2025-10-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models Mahiro Ukai et.al. Updated 2025-10-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Bag-of-Word-Groups (BoWG): A Robust and Efficient Loop Closure Detection Method Under Perceptual Aliasing Xiang Fei et.al. Updated 2025-10-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models Ziheng Zhang et.al. Updated 2025-10-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection Ji Du et.al. Updated 2025-10-21

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization Yuanhe Guo et.al. Updated 2025-10-21

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
DualHash: A Stochastic Primal-Dual Algorithm with Theoretical Guarantee for Deep Hashing Luxuan Li et.al. Updated 2025-10-21

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Joint Multi-Condition Representation Modelling via Matrix Factorisation for Visual Place Recognition Timur Ismagilov et.al. Updated 2025-10-20

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Small Language Models Offer Significant Potential for Science Community Jian Zhang et.al. Updated 2025-10-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Acquisition of interpretable domain information during brain MR image harmonization for content-based image retrieval Keima Abe et.al. Updated 2025-10-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Through the Lens of Doubt: Robust and Efficient Uncertainty Estimation for Visual Place Recognition Emily Miller et.al. Updated 2025-10-15

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Embedding the Teacher: Distilling vLLM Preferences for Scalable Image Retrieval Eric He et.al. Updated 2025-10-13

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Hierarchical Scheduling for Multi-Vector Image Retrieval Maoliang Li et.al. Updated 2025-10-10

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
DarkHash: A Data-Free Backdoor Attack Against Deep Hashing Ziqi Zhou et.al. Updated 2025-10-09

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning Weihuang Lin et.al. Updated 2025-10-09

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Mutual Learning for Hashing: Unlocking Strong Hash Functions from Weak Supervision Xiaoxu Ma et.al. Updated 2025-10-09

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Retrieval Didrik Bergström et.al. Updated 2025-10-08

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval Bin Kang et.al. Updated 2025-10-07

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Personalizing Retrieval using Joint Embeddings or "the Return of Fluffy" Bruno Korbar et.al. Updated 2025-10-06

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition Yu Kiu et.al. Updated 2025-10-05

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
The Overlooked Value of Test-time Reference Sets in Visual Place Recognition Mubariz Zaffar et.al. Updated 2025-10-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Novel UWB Synthetic Aperture Radar Imaging for Mobile Robot Mapping Charith Premachandra et.al. Updated 2025-10-03

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation -- Technical Report for IROS 2025 RoboSense Challenge Track 4 Lingfeng Zhang et.al. Updated 2025-10-03

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory Jiahao Wang et.al. Updated 2025-10-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features Axel Barroso-Laguna et.al. Updated 2025-10-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Semantic Visual Simultaneous Localization and Mapping: A Survey on State of the Art, Challenges, and Future Directions Thanh Nguyen Canh et.al. Updated 2025-10-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Video Object Segmentation-Aware Audio Generation Ilpo Viertola et.al. Updated 2025-09-30

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval Ren-Di Wu et.al. Updated 2025-09-30

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SETR: A Two-Stage Semantic-Enhanced Framework for Zero-Shot Composed Image Retrieval Yuqi Xiao et.al. Updated 2025-09-30

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SAGE: Spatial-visual Adaptive Graph Exploration for Visual Place Recognition Shunpeng Chen et.al. Updated 2025-09-30

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Robust Visual Localization in Compute-Constrained Environments by Salient Edge Rendering and Weighted Hamming Similarity Tu-Hoa Pham et.al. Updated 2025-09-29

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Performance-Efficiency Trade-off for Fashion Image Retrieval Julio Hurtado et.al. Updated 2025-09-29

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Prepare for Warp Speed: Sub-millisecond Visual Place Recognition Using Event Cameras Vignesh Ramanathan et.al. Updated 2025-09-28

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Johnson-Lindenstrauss Lemma Guided Network for Efficient 3D Medical Segmentation Jinpeng Lu et.al. Updated 2025-09-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Efficient Multimodal Dataset Distillation via Generative Models Zhenghao Zhao et.al. Updated 2025-09-25

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
A Versatile Foundation Model for AI-enabled Mammogram Interpretation Fuxiang Huang et.al. Updated 2025-09-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment Binod Singh et.al. Updated 2025-09-23

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions Ioanna Ntinou et.al. Updated 2025-09-23

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata Oussema Dhaouadi et.al. Updated 2025-09-22

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Learning Attribute-Aware Hash Codes for Fine-Grained Image Retrieval via Query Optimization Peng Wang et.al. Updated 2025-09-21

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models Thong Nguyen et.al. Updated 2025-09-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
PRISM: Product Retrieval In Shopping Carts using Hybrid Matching Arda Kabadayi et.al. Updated 2025-09-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Chain-of-Thought Re-ranking for Image Retrieval Tasks Shangrong Wu et.al. Updated 2025-09-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
DiffVL: Diffusion-Based Visual Localization on 2D Maps via BEV-Conditioned GPS Denoising Li Gao et.al. Updated 2025-09-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Event-LAB: Towards Standardized Evaluation of Neuromorphic Localization Methods Adam D. Hines et.al. Updated 2025-09-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Hashing-Baseline: Rethinking Hashing in the Age of Pretrained Models Ilyass Moummad et.al. Updated 2025-09-17

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
CSMoE: An Efficient Remote Sensing Foundation Model with Soft Mixture-of-Experts Leonard Hackel et.al. Updated 2025-09-17

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
DiffHash: Text-Guided Targeted Attack via Diffusion Models against Deep Hashing Image Retrieval Zechao Liu et.al. Updated 2025-09-17

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Semantic-Enhanced Cross-Modal Place Recognition for Robust Robot Localization Yujia Lin et.al. Updated 2025-09-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
MapAnything: Universal Feed-Forward Metric 3D Reconstruction Nikhil Keetha et.al. Updated 2025-09-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Bridging Vision Language Models and Symbolic Grounding for Video Question Answering Haodi Ma et.al. Updated 2025-09-15

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Listening for "You": Enhancing Speech Image Retrieval via Target Speaker Extraction Wenhao Yang et.al. Updated 2025-09-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Aerial-ground Cross-modal Localization: Dataset, Ground-truth, and Benchmark Yandi Yang et.al. Updated 2025-09-09

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Back To The Drawing Board: Rethinking Scene-Level Sketch-Based Image Retrieval Emil Demić et.al. Updated 2025-09-08

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Towards an Accurate and Effective Robot Vision (The Problem of Topological Localization for Mobile Robots) Emanuela Boros et.al. Updated 2025-09-05

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
FloodVision: Urban Flood Depth Estimation Using Foundation Vision-Language Models and Domain Knowledge Graph Zhangding Liu et.al. Updated 2025-09-05

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Global-to-Local or Local-to-Global? Enhancing Image Retrieval with Efficient Local Search and Effective Global Re-ranking Dror Aiger et.al. Updated 2025-09-05

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
DUDE: Diffusion-Based Unsupervised Cross-Domain Image Retrieval Ruohong Yang et.al. Updated 2025-09-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Scale, Don't Fine-tune: Guiding Multimodal LLMs for Efficient Visual Place Recognition at Test-Time Jintao Cheng et.al. Updated 2025-09-02

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Ensemble-Based Event Camera Place Recognition Under Varying Illumination Therese Joseph et.al. Updated 2025-09-02

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision Che Liu et.al. Updated 2025-09-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization Thinh-Phuc Nguyen et.al. Updated 2025-09-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval Jeong-Woo Park et.al. Updated 2025-07-17

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval Jeong-Woo Park et.al. Updated 2025-07-17

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval Jaehyun Kwak et.al. Updated 2025-07-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
CorrMoE: Mixture of Experts with De-stylization Learning for Cross-Scene and Cross-Domain Correspondence Pruning Peiwen Xia et.al. Updated 2025-07-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space David G. Shatwell et.al. Updated 2025-07-14

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Text-to-Remote-Sensing-Image Retrieval beyond RGB Sources Daniele Rege Cambrin et.al. Updated 2025-07-14

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures Xinlong Ding et.al. Updated 2025-07-14

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features Inye Na et.al. Updated 2025-07-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
LiDAR, GNSS and IMU Sensor Alignment through Dynamic Time Warping to Construct 3D City Maps Haitian Wang et.al. Updated 2025-07-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Deep Hashing with Semantic Hash Centers for Image Retrieval Li Chen et.al. Updated 2025-07-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SCREP: Scene Coordinate Regression and Evidential Learning-based Perception-Aware Trajectory Generation Juyeop Han et.al. Updated 2025-07-10

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
VP-SelDoA: Visual-prompted Selective DoA Estimation of Target Sound via Semantic-Spatial Matching Yu Chen et.al. Updated 2025-07-10

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Evaluating Attribute Confusion in Fashion Text-to-Image Generation Ziyue Liu et.al. Updated 2025-07-09

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval Naoya Sogi et.al. Updated 2025-07-09

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval Haiwen Li et.al. Updated 2025-07-08

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval Zhiwei Chen et.al. Updated 2025-07-08

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
What's Making That Sound Right Now? Video-centric Audio-Visual Localization Hahyeon Choi et.al. Updated 2025-07-08

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model Mengyao Xu et.al. Updated 2025-07-07

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
An analysis of vision-language models for fabric retrieval Francesco Giuliari et.al. Updated 2025-07-07

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Simultaneous Localization and Mapping Using Active mmWave Sensing in 5G NR Tao Du et.al. Updated 2025-07-07

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration Xiaofan Li et.al. Updated 2025-07-06

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition Jiuhong Xiao et.al. Updated 2025-07-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment Juelin Zhu et.al. Updated 2025-07-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Utilizing a Novel Deep Learning Method for Scene Categorization in Remote Sensing Data Ghufran A. Omran et.al. Updated 2025-06-28

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval Li-Cheng Shen et.al. Updated 2025-06-28

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
MatChA: Cross-Algorithm Matching with Feature Augmentation Paula Carbó Cubero et.al. Updated 2025-06-27

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography Caoshuo Li et.al. Updated 2025-06-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Referring Expression Instance Retrieval and A Strong End-to-End Baseline Xiangzhao Hao et.al. Updated 2025-06-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Visualizing intercalation effects in 2D materials using AFM based techniques Karmen Kapustić et.al. Updated 2025-06-25

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
On the Burstiness of Faces in Set Jiong Wang et.al. Updated 2025-06-25

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval Michael Günther et.al. Updated 2025-06-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Class Agnostic Instance-level Descriptor for Visual Instance Search Qi-Ying Sun et.al. Updated 2025-06-20

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
MambaHash: Visual State Space Deep Hashing Model for Large-Scale Image Retrieval Chao He et.al. Updated 2025-06-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Fine-grained Image Retrieval via Dual-Vision Adaptation Xin Jiang et.al. Updated 2025-06-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Adversarial Attacks and Detection in Visual Place Recognition for Safer Robot Navigation Connor Malone et.al. Updated 2025-06-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Semantic and Feature Guided Uncertainty Quantification of Visual Localization for Autonomous Vehicles Qiyuan Wu et.al. Updated 2025-06-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections Ziling Huang et.al. Updated 2025-06-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
HARMONY: A Scalable Distributed Vector Database for High-Throughput Approximate Nearest Neighbor Search Qian Xu et.al. Updated 2025-06-17

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
TACS-Graphs: Traversability-Aware Consistent Scene Graphs for Ground Robot Indoor Localization and Mapping Jeewon Kim et.al. Updated 2025-06-17

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Hierarchical Multi-Positive Contrastive Learning for Patent Image Retrieval Kshitij Kavimandan et.al. Updated 2025-06-17

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
A Semantically-Aware Relevance Measure for Content-Based Medical Image Retrieval Evaluation Xiaoyang Wei et.al. Updated 2025-06-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
EmbodiedPlace: Learning Mixture-of-Features with Embodied Constraints for Visual Place Recognition Bingxi Liu et.al. Updated 2025-06-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SuperPlace: The Renaissance of Classical Feature Aggregation for Visual Place Recognition in the Era of Foundation Models Bingxi Liu et.al. Updated 2025-06-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Feature Complementation Architecture for Visual Place Recognition Weiwei Wang et.al. Updated 2025-06-14

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Towards a general-purpose foundation model for fMRI analysis Cheng Wang et.al. Updated 2025-06-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Improving Personalized Search with Regularized Low-Rank Parameter Updates Fiona Ryan et.al. Updated 2025-06-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Hierarchical Image Matching for UAV Absolute Visual Localization via Semantic and Structural Constraints Xiangkai Zhang et.al. Updated 2025-06-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Safeguarding Multimodal Knowledge Copyright in the RAG-as-a-Service Environment Tianyu Chen et.al. Updated 2025-06-10

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Robust Visual Localization via Semantic-Guided Multi-Scale Transformer Zhongtao Tian et.al. Updated 2025-06-10

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs Yikun Ji et.al. Updated 2025-06-08

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Zero Shot Composed Image Retrieval Santhosh Kakarla et.al. Updated 2025-06-07

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
GenIR: Generative Visual Feedback for Mental Image Retrieval Diji Yang et.al. Updated 2025-06-06

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning Sheng Chen et.al. Updated 2025-06-06

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition Suhan Woo et.al. Updated 2025-06-05

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Deep Learning Reforms Image Matching: A Survey and Outlook Shihua Zhang et.al. Updated 2025-06-05

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Entity Image and Mixed-Modal Image Retrieval Datasets Cristian-Ioan Blaga et.al. Updated 2025-06-02

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Quantization-based Bounds on the Wasserstein Metric Jonathan Bobrutsky et.al. Updated 2025-06-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SORCE: Small Object Retrieval in Complex Environments Chunxu Liu et.al. Updated 2025-05-30

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch Aneeshan Sain et.al. Updated 2025-05-29

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians Hidenobu Matsuki et.al. Updated 2025-05-28

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
UAVPairs: A Challenging Benchmark for Match Pair Retrieval of Large-scale UAV Images Junhuan Liu et.al. Updated 2025-05-28

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Fast Feature Matching of UAV Images via Matrix Band Reduction-based GPU Data Schedule San Jiang et.al. Updated 2025-05-28

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Visual Loop Closure Detection Through Deep Graph Consensus Martin Büchner et.al. Updated 2025-05-27

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
QuARI: Query Adaptive Retrieval Improvement Eric Xing et.al. Updated 2025-05-27

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval Eric Xing et.al. Updated 2025-05-27

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Visualized Text-to-Image Retrieval Di Wu et.al. Updated 2025-05-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval Rong-Cheng Tu et.al. Updated 2025-05-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Can Visual Encoder Learn to See Arrows? Naoyuki Terashita et.al. Updated 2025-05-26

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
TAT-VPR: Ternary Adaptive Transformer for Dynamic and Efficient Visual Place Recognition Oliver Grainge et.al. Updated 2025-05-22

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval Siting Li et.al. Updated 2025-05-21

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval Nikolaos Chaidos et.al. Updated 2025-05-21

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Multimodal RAG-driven Anomaly Detection and Classification in Laser Powder Bed Fusion using Large Language Models Kiarash Naghavi Khanghah et.al. Updated 2025-05-20

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark Yiwei Ou et.al. Updated 2025-05-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Improved Bag-of-Words Image Retrieval with Geometric Constraints for Ground Texture Localization Aaron Wilhelm et.al. Updated 2025-05-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing Mathis Jürgen Adler et.al. Updated 2025-05-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SafeNav: Safe Path Navigation using Landmark Based Localization in a GPS-denied Environment Ganesh Sapkota et.al. Updated 2025-05-13

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Thermal-LiDAR Fusion for Robust Tunnel Localization in GNSS-Denied and Low-Visibility Conditions Lukas Schichler et.al. Updated 2025-05-06

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
LiftFeat: 3D Geometry-Aware Local Feature Matching Yepeng Liu et.al. Updated 2025-05-06

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Seeing the Abstract: Translating the Abstract Language for Vision Language Models Davide Talon et.al. Updated 2025-05-06

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
OBD-Finder: Explainable Coarse-to-Fine Text-Centric Oracle Bone Duplicates Discovery Chongsheng Zhang et.al. Updated 2025-05-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
NeuroLoc: Encoding Navigation Cells for 6-DOF Camera Localization Xun Li et.al. Updated 2025-05-02

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
GSFeatLoc: Visual Localization Using Feature Correspondence on 3D Gaussian Splatting Jongwon Lee et.al. Updated 2025-05-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval Yabing Wang et.al. Updated 2025-04-25

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
A Guide to Structureless Visual Localization Vojtech Panek et.al. Updated 2025-04-24

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Rethinking Vision Transformer for Large-Scale Fine-Grained Image Retrieval Xin Jiang et.al. Updated 2025-04-23

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Media Content Atlas: A Pipeline to Explore and Investigate Multidimensional Media Space using Multimodal LLMs Merve Cerit et.al. Updated 2025-04-22

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
A Multimodal Recaptioning Framework to Account for Perceptual Diversity in Multilingual Vision-Language Modeling Kyle Buettner et.al. Updated 2025-04-19

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs Haoxuan Li et.al. Updated 2025-04-17

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Generalized Visual Relation Detection with Diffusion Models Kaifeng Gao et.al. Updated 2025-04-16

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Visual Re-Ranking with Non-Visual Side Information Gustav Hanning et.al. Updated 2025-04-15

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
TMCIR: Token Merge Benefits Composed Image Retrieval Chaoyang Wang et.al. Updated 2025-04-15

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition Changwei Wang et.al. Updated 2025-04-14

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Evolved Hierarchical Masking for Self-Supervised Learning Zhanzhou Feng et.al. Updated 2025-04-12

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
HAL-NeRF: High Accuracy Localization Leveraging Neural Radiance Fields Asterios Reppas et.al. Updated 2025-04-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Hypergraph Vision Transformers: Images are More than Nodes, More than Edges Joshua Fixelle et.al. Updated 2025-04-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations Cheng-Yu Hsieh et.al. Updated 2025-04-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
PNE-SGAN: Probabilistic NDT-Enhanced Semantic Graph Attention Network for LiDAR Loop Closure Detection Xiong Li et.al. Updated 2025-04-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval Zehong Ma et.al. Updated 2025-04-10

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
A Pointcloud Registration Framework for Relocalization in Subterranean Environments David Akhihiero et.al. Updated 2025-04-09

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception Ruotian Peng et.al. Updated 2025-04-09

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
To Match or Not to Match: Revisiting Image Matching for Reliable Visual Place Recognition Davide Sferrazza et.al. Updated 2025-04-08

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
NCL-CIR: Noise-aware Contrastive Learning for Composed Image Retrieval Peng Gao et.al. Updated 2025-04-06

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Re-thinking Temporal Search for Long-Form Video Understanding Jinhui Ye et.al. Updated 2025-04-06

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
REJEPA: A Novel Joint-Embedding Predictive Architecture for Efficient Remote Sensing Image Retrieval Shabnam Choudhury et.al. Updated 2025-04-04

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
A Chefs KISS -- Utilizing semantic information in both ICP and SLAM framework Sven Ochs et.al. Updated 2025-04-02

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Prompt-Guided Attention Head Selection for Focus-Oriented Image Retrieval Yuji Nozawa et.al. Updated 2025-04-02

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval Bangwei Liu et.al. Updated 2025-04-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data Yiqun Duan et.al. Updated 2025-04-01

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
CIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP Generalization Yingrui Ji et.al. Updated 2025-03-31

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
LiM-Loc: Visual Localization with Dense and Accurate 3D Reference Maps Directly Corresponding 2D Keypoints to 3D LiDAR Point Clouds Masahiko Tsuji et.al. Updated 2025-03-31

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Multiview Image-Based Localization Cameron Fiore et.al. Updated 2025-03-30

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
LOCORE: Image Re-ranking with Long-Context Sequence Modeling Zilin Xiao et.al. Updated 2025-03-27

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck Adrian Bulat et.al. Updated 2025-03-27

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
UGNA-VPR: A Novel Training Paradigm for Visual Place Recognition Based on Uncertainty-Guided NeRF Augmentation Yehui Shen et.al. Updated 2025-03-27

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval Zixu Li et.al. Updated 2025-03-27

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Clean Image May be Dangerous: Data Poisoning Attacks Against Deep Hashing Shuai Li et.al. Updated 2025-03-27

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
CoLLM: A Large Language Model for Composed Image Retrieval Chuong Huynh et.al. Updated 2025-03-25

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Scene-agnostic Pose Regression for Visual Localization Junwei Zheng et.al. Updated 2025-03-25

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting Zhiwei Huang et.al. Updated 2025-03-25

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval Haoqiang Lin et.al. Updated 2025-03-25

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
LocDiffusion: Identifying Locations on Earth by Diffusing in the Hilbert Space Zhangyu Wang et.al. Updated 2025-03-23

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning Xiang Fang et.al. Updated 2025-03-23

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images Dongheng Lin et.al. Updated 2025-03-23

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval Pranavi Kolouju et.al. Updated 2025-03-22

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval Yuanmin Tang et.al. Updated 2025-03-21

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Autonomous Exploration-Based Precise Mapping for Mobile Robots through Stepwise and Consistent Motions Muhua Zhang et.al. Updated 2025-03-21

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval Qiang Zou et.al. Updated 2025-03-20

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Automating 3D Dataset Generation with Neural Radiance Fields P. Schulz et.al. Updated 2025-03-20

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
3D Densification for Multi-Map Monocular VSLAM in Endoscopy X. Anadón et.al. Updated 2025-03-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios Huy-Hoang Bui et.al. Updated 2025-03-18

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Scale Efficient Training for Large Datasets Qing Zhou et.al. Updated 2025-03-17

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Multi-Platform Teach-and-Repeat Navigation by Visual Place Recognition Based on Deep-Learned Local Features Václav Truhlařík et.al. Updated 2025-03-17

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
All You Need to Know About Training Image Retrieval Models Gabriele Berton et.al. Updated 2025-03-17

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning Pengfei Luo et.al. Updated 2025-03-13

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark Yibin Ye et.al. Updated 2025-03-12

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Revisiting Medical Image Retrieval via Knowledge Consolidation Yang Nan et.al. Updated 2025-03-12

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
CQVPR: Landmark-aware Contextual Queries for Visual Place Recognition Dongyue Li et.al. Updated 2025-03-11

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization Michael Green et.al. Updated 2025-03-10

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Zero-Shot Hashing Based on Reconstruction With Part Alignment Yan Jiang et.al. Updated 2025-03-10

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
Improving Visual Place Recognition with Sequence-Matching Receptiveness Prediction Somayeh Hussaini et.al. Updated 2025-03-10

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
RoboDesign1M: A Large-scale Dataset for Robot Design Understanding Tri Le et.al. Updated 2025-03-09

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
StructVPR++: Distill Structural and Semantic Knowledge with Weighting Samples for Visual Place Recognition Yanqing Shen et.al. Updated 2025-03-09

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
TextInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification Huaqi Tao et.al. Updated 2025-03-09

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand
NeuraLoc: Visual Localization in Neural Implicit Map with Dual Complementary Features Hongjia Zhai et.al. Updated 2025-03-08

Abstract unavailable in cached data. It will appear after the next refresh.

Preview loads on expand