The objective of this project is to develop a location recognition model by processing street view images of a city and constructing a database of feature embeddings for location matching.
To obtain Street View panoramas in Kyiv, the Google Street View Static API was utilized, selecting only outdoor panoramas suitable for image recognition. Each panorama is composed of three consecutive images captured with headings of 0°, 120°, and 240°, each with a 120° field of view (FoV).
To scrape data from a specific district, the process begins with a square region of a given size, centered on a specified point and aligned with the cardinal directions (north, south, east, and west). This square is then subdivided into smaller circles of predefined radii, with the centers of these circles serving as the initial coordinates for panorama retrieval.
After specifying multiple districts, the collected data is visualized in:
output_map.html.
While the entire city of Kyiv was not scraped (due to cost constraints), the dataset presents significant challenges for location matching. There are a lot of locations with similar foliage or buildings (Troieshchyna is a good example).
As a first approach, features are extracted using the SIFT detector, specifically its RootSIFT variant, which enhances codebook clustering by leveraging the Hellinger distance metric. The extracted features are then aggregated using VLAD (Vector of Locally Aggregated Descriptors) with a specified codebook size and stored in ChromaDB for efficient retrieval.
As a second approach, a Vision Transformer (ViT) is employed, specifically the pretrained ViT-B/16 model on ImageNet, used primarily for feature extraction. Feature embeddings are also stored in ChromaDB for efficient retrieval.
Since the default panorama size is
A comparison of the two retrieval approaches can be found in Comparison.ipynb. The results clearly demonstrate that ViT-based retrieval outperforms VLAD + RootSIFT on this dataset.
To run geoguessr.py, follow these steps:
- Create a Conda environment using
environment.yml. - Run the script with the following arguments:
python geoguessr.py --data_dir "path_to_store_data" --query_image_path "query_image_path" --verbose
-
Omnidirectional Image Distortions: Both SIFT and ViT struggle with feature extraction due to distortions inherent in panoramic images.
-
Fine-tuning ViT: Training ViT on panoramic images with associated geographic coordinates could enhance feature extraction and location recognition performance.