Skip to content

ostapkhm/GeoGuessr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overall Goal of the Project

The objective of this project is to develop a location recognition model by processing street view images of a city and constructing a database of feature embeddings for location matching.


Data Scraping

To obtain Street View panoramas in Kyiv, the Google Street View Static API was utilized, selecting only outdoor panoramas suitable for image recognition. Each panorama is composed of three consecutive images captured with headings of , 120°, and 240°, each with a 120° field of view (FoV).

To scrape data from a specific district, the process begins with a square region of a given size, centered on a specified point and aligned with the cardinal directions (north, south, east, and west). This square is then subdivided into smaller circles of predefined radii, with the centers of these circles serving as the initial coordinates for panorama retrieval.

For example:
Example1

After specifying multiple districts, the collected data is visualized in:
output_map.html.

While the entire city of Kyiv was not scraped (due to cost constraints), the dataset presents significant challenges for location matching. There are a lot of locations with similar foliage or buildings (Troieshchyna is a good example).


Image Retrieval Using VLAD and RootSIFT

As a first approach, features are extracted using the SIFT detector, specifically its RootSIFT variant, which enhances codebook clustering by leveraging the Hellinger distance metric. The extracted features are then aggregated using VLAD (Vector of Locally Aggregated Descriptors) with a specified codebook size and stored in ChromaDB for efficient retrieval.


Image Retrieval Using Vision Transformer (ViT)

As a second approach, a Vision Transformer (ViT) is employed, specifically the pretrained ViT-B/16 model on ImageNet, used primarily for feature extraction. Feature embeddings are also stored in ChromaDB for efficient retrieval.

Since the default panorama size is $1920 \times 400$, each panorama is divided into three segments, with ViT extracting features from each segment individually. The embeddings are then concatenated, resulting in a $2304$-dimensional feature vector.


Comparison of Retrieval Methods

A comparison of the two retrieval approaches can be found in Comparison.ipynb. The results clearly demonstrate that ViT-based retrieval outperforms VLAD + RootSIFT on this dataset.


How to Run the Model

To run geoguessr.py, follow these steps:

  1. Create a Conda environment using environment.yml.
  2. Run the script with the following arguments:
    python geoguessr.py --data_dir "path_to_store_data" --query_image_path "query_image_path" --verbose

Limitations and Future Work

  • Omnidirectional Image Distortions: Both SIFT and ViT struggle with feature extraction due to distortions inherent in panoramic images.

  • Fine-tuning ViT: Training ViT on panoramic images with associated geographic coordinates could enhance feature extraction and location recognition performance.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published