Characterizing the Urban Road Network for Automated Mobility: A Scalable Typology for Evidence-Based Policy
This project provides a complete pipeline for downloading, processing, and analyzing how current road infrastructure is ready to support Connect and Automated Vehicles (CAVs), using data from OpenStreetMap (OSM) and Mapillary. It downloads road network from OSM, enriches it with traffic sign data from Mapillary, performs a k-Prototypes clustering to identify street types, and generates statistical reports and visualizations based on the results.
This repository is a companion to our academic research. We aim for full reproducibility, so that is the reason we are including our full results here as well. For a detailed explanation of the methodology, cluster definitions, and findings, please refer to our paper:
Characterizing the Urban Road Network for Automated Mobility: A Scalable Typology for Evidence-Based Policy (link in the future)
If you use this project, its data, or its findings in your research, please cite our work as follows:
Citation in the future
- Automated Data Ingestion: Downloads street network and traffic sign data for any given city using
OSMnxand the Mapillary API. - Data Conflation: Intelligently merges data from multiple sources (OSM, Mapillary, Transit Agencies) to create a single, enriched dataset.
- Advanced Clustering: Uses the k-Prototypes algorithm to cluster street segments based on both numerical (e.g., traffic sign counts) and categorical (e.g., highway type) features.
- Reproducible Workflow: Organized into a series of Jupyter notebooks that handle each stage of the process from data collection to final analysis.
- Comprehensive Analysis: Generates detailed cluster profiles, city-level statistics, and regional comparisons based on World Bank classifications.
The repository is organized to separate code, data, and results.
├── input_data/ # Raw, original input data (read-only).
│ ├── countries_signs/ # Country-specific traffic sign mappings.
│ └── manual_boundaries/ # Manual GeoJSON boundaries for cities.
│
├── notebooks/ # Jupyter notebooks containing the project's code.
│ ├── 01_main_data_processing.ipynb
│ ├── 02_clustering_all_features.ipynb
│ ├── 03_clustering_filtered.ipynb
│ └── 04_statistics_and_viz.ipynb
│
├── data/ # Intermediate, processed data generated by the first notebook.
│
├── results/ # All final outputs from the notebooks.
│ ├── figures/ # All generated plots and charts.
│ └── *.csv / *.gpkg # Final datasets and statistical reports.
│
├── LICENSE # Project license file.
└── README.md # You are here!
A. Create an Environment:
It is highly recommended to use a virtual environment to manage dependencies.
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`B. Requirements:
This project was developed using Python 3.9. It is recommended to use the same version to ensure compatibility with the specified library versions.
All required Python packages are listed in the requirements.txt file. You can install them using the following command:
pip install -r requirements.txtPlease note: at the time of research, we had to use a Mapillary SDK that was not being updated for 2 years. Therefore, for compatibility, we had to also use old versions of common libraries, such as pandas and numpy. Mapillary team finally updated their library, but we did not test it nor upgraded the requirements in order to not break anything.
C. Handle API Key (very important!)
This project requires a Mapillary API token to download Mapillary data (not required for notebooks 2 and 3). For security, it should not be stored directly in the code.
For that, please create a file named .env in the project's root directory.
Add your token to this file like so:
MLY_ACCESS_TOKEN="MLY|your_long_token_here"To get a Client Access Token you will need to register an application on https://www.mapillary.com/dashboard/developer
-
If you would like to try the code and generate data by yourself, before running any code, please delete
results/anddata/folders. The code will read those files, and if they exist, it will skip any data download/generation. Also, if you wish only to read the full results, or do not wish to run everything again, please download the extra files pointed atresults/links_to_download_full_results.txt(we could not upload these heavy files here in Github). -
Please change the user agent in this line, using a very specific name for your project:
geolocator = Nominatim(user_agent="cav-road-typology-your-name"
If you do not do that, you risk OSM preventing you accessing the data.
- Change the cities/regions you would like to download in these two lines (first notebook):
# --- Fallback Dictionary (Option 1) ---
if __name__ == "__main__":
You can use the code with other cities, but also neighborhoods as well. It can work with entire countries, but it might take a huge time, and is not guaranteed to work. If the area boundary is not mapped in OSM, you can put a .geojson file containing the boundary inside input_data/manual_boundaries/ folder. For statistics to proper work with the cities you would like, you will likely have to change core_city_to_region_map = { part in the third notebook.
- Run the notebooks in numerical order from the
notebooks/directory.
01_main_data_processing.ipynb: This notebook downloads all raw data for the specified cities, processes it, and saves the cleaned, individual city networks to the data/ folder and the combined network to the results/ folder.
02_clustering_all_features.ipynb and 03_clustering_filtered.ipynb: These two notebook loads the combined network (generated in the previous notebook), performs k-Prototypes clustering to find street types, and saves the final clustered dataset to the results/ folder. All clustering figures (elbow method, silhouette plots) are saved to results/figures/clustering/. The only difference between them is that in the first file, all features are used, and since we identified they were not significant, we tweaked the features (better explained in the paper), and used the new features in the 03_clustering_filtered notebook.
Note: if you are running full data (7 millions segments), it is very likely you will run out of RAM for the Silhouette method. Therefore, we recommend running on a sample (Determining Optimal k using the Silhouette Method on a Samplepart). Feel free to tweak the range of k values and the sample size.
04_statistics_and_viz.iynb: This notebook loads the clustered data and performs a statistical analysis. It generates cluster profiles, city-level reports, and regional comparisons, saving all outputs to the results/ and results/figures/statistics/ folders.
This project uses a dual-license model:
-
The data (all files inside
data/andinput_data/manual_boundaries/folders) is licensed under the Open Data Commons Open Database License (ODbL), since these files are derived from OpenStreetMap. -
Everything else, including the source code (
.ipynbnotebooks), is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This project was developed with significant assistance from Large Language Models (LLMs) to accelerate tasks such as code generation, refactoring, and debugging. All key logic has been reviewed and validated.
The Python data science ecosystem, especially within the geospatial domain, is highly sensitive to library versions. A combination of packages that works today might produce errors or warnings in the future as individual libraries are updated.
This project was confirmed to be fully functional in July 2025 using Python 3.9.
To ensure you can reproduce the results, it is essential to:
- Use a virtual environment to isolate project dependencies.
- Install the exact package versions specified in the
requirements.txtfile usingpip install -r requirements.txt.
This code is provided 'as-is' under the aforementioned License. It is intended for educational and portfolio purposes. While every effort has been made to ensure its functionality at the time of development, we offer no warranties and accept no liability for its use.