GitHub

Overall Goal of the Project

The objective of this project is to develop a location recognition model by processing street view images of a city and constructing a database of feature embeddings for location matching.

Data Scraping

To obtain Street View panoramas in Kyiv, the Google Street View Static API was utilized, selecting only outdoor panoramas suitable for image recognition. Each panorama is composed of three consecutive images captured with headings of 0°, 120°, and 240°, each with a 120° field of view (FoV).

To scrape data from a specific district, the process begins with a square region of a given size, centered on a specified point and aligned with the cardinal directions (north, south, east, and west). This square is then subdivided into smaller circles of predefined radii, with the centers of these circles serving as the initial coordinates for panorama retrieval.

For example:

After specifying multiple districts, the collected data is visualized in:
output_map.html.

While the entire city of Kyiv was not scraped (due to cost constraints), the dataset presents significant challenges for location matching. There are a lot of locations with similar foliage or buildings (Troieshchyna is a good example).

Image Retrieval Using VLAD and RootSIFT

As a first approach, features are extracted using the SIFT detector, specifically its RootSIFT variant, which enhances codebook clustering by leveraging the Hellinger distance metric. The extracted features are then aggregated using VLAD (Vector of Locally Aggregated Descriptors) with a specified codebook size and stored in ChromaDB for efficient retrieval.

Image Retrieval Using Vision Transformer (ViT)

As a second approach, a Vision Transformer (ViT) is employed, specifically the pretrained ViT-B/16 model on ImageNet, used primarily for feature extraction. Feature embeddings are also stored in ChromaDB for efficient retrieval.

Since the default panorama size is $1920 \times 400$, each panorama is divided into three segments, with ViT extracting features from each segment individually. The embeddings are then concatenated, resulting in a $2304$-dimensional feature vector.

Comparison of Retrieval Methods

A comparison of the two retrieval approaches can be found in Comparison.ipynb. The results clearly demonstrate that ViT-based retrieval outperforms VLAD + RootSIFT on this dataset.

How to Run the Model

To run geoguessr.py, follow these steps:

Create a Conda environment using environment.yml.

Run the script with the following arguments:

python geoguessr.py --data_dir "path_to_store_data" --query_image_path "query_image_path" --verbose

Limitations and Future Work

Omnidirectional Image Distortions: Both SIFT and ViT struggle with feature extraction due to distortions inherent in panoramic images.
Fine-tuning ViT: Training ViT on panoramic images with associated geographic coordinates could enhance feature extraction and location recognition performance.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
configs		configs
image_retrieval		image_retrieval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overall Goal of the Project

Data Scraping

Image Retrieval Using VLAD and RootSIFT

Image Retrieval Using Vision Transformer (ViT)

Comparison of Retrieval Methods

How to Run the Model

Limitations and Future Work

About

Uh oh!

Releases

Packages

Languages

License

ostapkhm/GeoGuessr

Folders and files

Latest commit

History

Repository files navigation

Overall Goal of the Project

Data Scraping

Image Retrieval Using VLAD and RootSIFT

Image Retrieval Using Vision Transformer (ViT)

Comparison of Retrieval Methods

How to Run the Model

Limitations and Future Work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages