AnyLoc: Towards Universal Visual Place Recognition
完整文本记录
We present AnyLoc an approach towards Universal visual Place recognition or VPR.
Imagine a robot exploring a place for the first time,
it creates a reference map of images it captures along the way.
Now consider the same robot returning to the place and observing a new image,
which we call the query image. The task of VPR is to find the
best image match for the query image from the pre-built reference database.
We call a VPR system Universal if it
works across any type of environment
and is robust to short and long-term appearance changes
and works across extreme viewpoint variations.
To evaluate if current VPR approaches can meet these ambitious standards,
we assess their applicability in a diverse range of scenarios:
Urban, indoors, significant viewpoint shifts, diametrically opposite views with minimal overlap,
underwater, subt and degraded, Aerial and across day night transitions.
When we test the current state-of-the-art approaches on this diverse suite, we observe that,
while they excel in urban driving scenarios that are similar to the training distribution,
they do not generalize to other diverse conditions -- a key requirement for a universal VPR solution.
Hence in this paper, we explore self-supervised Foundation models, like CLIP and DINO,
these models have demonstrated remarkable visual and semantic capabilities at the pixel level.
When we use the per-image descriptors from these models as-is, we observe the results to be subpar.
Key to our approach, AnyLoc, is a deeper dive into the process of extracting and aggregating
features from these Foundation models for VPR. Here, we use the DINOv2 Vision Transformer and
extract per-pixel features across layers and facets exploring their various properties.
The shallower ViT layers display a strong position encoding bias and capture local structure.
On the flip side, features from the final layer capture global structure and semantics
but lack the precision needed for VPR aggregation.
So how do you get the best of both these properties?
After further analysis, we observed that selecting features from deeper layers such as
layer 31 and the value facet offers the best mix of background contrast and positioning accuracy.
Once we extract these per-pixel ViT features, we apply several unsupervised local feature
aggregation methods like VLAD and GeM aggregation methods to convert
the per-pixel visual and semantic descriptors into place-level descriptors useful for VPR.
We can clearly observe how features computed by AnyLoc are more discriminative for VPR
compared to existing methods by visualizing low dimensional projections of the feature space.
For MixVPR, the top-performing prior method, we see that the
features compared across multiple data sets tend to concentrate very closely.
However, for AnyLoc, the features are much further spread out and exhibit better separability.
All these aspects contribute to any lock performing significantly better than prior
approaches over a wide range of environments and challenging conditions. Now. let's take
a look at the qualitative retrieval videos across diverse domains showcasing the prowess of AnyLoc.
For more information regarding AnyLoc and to see Universal VPR
in action through interactive demos head over to our website!