[Draft Proposal] Building a performance benchmark framework

This proposal was written subsequentially to the March 2026 iTowns Hackaton.
Involved contributors: @mgermerie, @PierreAntoineChiron and @airnez 

## Context
Developing features for iTowns, we often ask ourselves :
- To what extent this feature / contribution impacts performances ?
- How to benchmark performances for iTowns ?
- What are the metrics to monitor ?
- Is there a memory leak somewhere ?

This proposal aims at defining an architecture for automated performance tests for iTowns, answering those questions.
## Description of the proposal
We'll try to provide soon a first PR partially fulfilling requirements stated bellow. It will be a starting point for performance test development.
### Identified use-cases : 
1)  Automatically checking for performance regressions against master branch before merging a pull request
2)  Checking for performance regressions when bumping dependencies
3)  Providing a reliable performance tracing toolbox for debugging and bottleneck identification

For now, we'll focus on use-case n°1 while enabling the other ones to be answered later by the same test architecture.
#### What this proposal does NOT aims at solving
* Providing better live performance debugging tools: This is a different job (#2020)
* Providing automated bottleneck identification : A human will still be needed to read metrics exposed by those tests
## Implementation 

### Functional Implementation
#### Test scenarios types
1) "Functional" performance tests: Instantiating a view and running real-use scenarios
2) "Unit" performance tests: Only benchmarking an iTowns sub-system (Parsing, reprojection... ). This second approach is not as good as n°1, but might be needed for specific parts of the code. 

We'll provide a first working "Functional" performance test to begin with.
#### What are the metrics to monitor ?
We identified a first list of metrics that would be interesting to have for Functional performance tests. Here is a non-exhaustive list (feel free to suggest other ones) :

| Metric                            | Description                                                               |
| --------------------------------- | ------------------------------------------------------------------------- |
| update Time                       | Time required to perform an update                                        |
| Frame Time                        | Time to render a frame                                                    |
| Time to first frame               | Time from test start to first render                                      |
| Time to first tile                | Time from test start to first tile rendered                               |
| Test time                         | Time to perform the test itself                                           |
| Data Parsing and converting time  | Time spend parsing / converting data                                      |
| Draw Calls                        | Number of draw calls per frame                                            |
| Shader Compilation time           | Time required to compile shader. Can be responsible for startup slow-down |
| Textures count                    | The number of active textures                                             |
| Triangle Count                    | The number of rendered triangles primitives                               |
| Geometries count                  | The number of active geometries                                           |
| Number of shaders                 | The number of shader programs                                             |
| JS Heap Size                      | Heap memory used after garbage collecting                                 |
| Number and duration of long tasks | Counting tasks >50ms blocking main thread and summing their duration      |

For statistics that are measured for each frame / render, they should be then accumulated into statistics :
- Min / Max
- Average
- 95th percentile (P95)
- Standard deviation

#### Statistical significancy
When looking for reliable measurement (e.g. automated testing), we should be able to run the tests multiple times and measure statistical difference between tested options in a round-robin manner.
Quoting google tachometer readme :
> Even if you run the same JavaScript, on the same browser, on the same machine, on the same day, you'll still get a different result every time. But if you take enough _repeated samples_ and apply the right statistics, you can reliably identify even tiny differences in runtime.

#### How to compare official release performances over time ?
* We can not test performances for each version independently and expect metrics to be comparable. It is environment and browser-version dependent. We will always have to run all compared versions during the same test run. Like [Maptiler did for Mapbox and Maplibre](https://labs.maptiler.com/sdk-benchmarks/).

### Technical Implementation details
#### Identified third-party tool candidates
* Test harness : Running several tests for many times and compare them
	* [Tachometer](https://github.com/google/tachometer): It is designed to only work with time measures. This might not perfectly fit our needs. **We will probably just implement what we need ourselves.**
* Browser control API
	* Playwright : Cool and trendy but we did not consider it (see reason just bellow)
	* Puppeteer: We are already using it for functional tests, we'll keep it for performance tests for now (arbitrary and debatable)
* Tracing performance metrics for debugging
	* [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/) seems unavoidable for that use case. We don't need it for automated tests, but this is the go-to for a human deep-dive investigation
	
#### Dependency to the Chrome browser
Chrome environment seems to stand above competitors for performance metrics collection. BUT it would be nice to be able to run those tests for Firefox at least. Since [puppeteer supports firefox now](https://hacks.mozilla.org/2024/08/puppeteer-support-for-firefox/), it should be feasible.

#### How to measure each metric ?
Puppeteer coupled with standard browser API should already be enough to get most of what we need without modifying iTowns core code
* Timings should be measured using [performance api](https://developer.mozilla.org/en-US/docs/Web/API/Performance_API), coupled to events listenners
* [PerformanceObserver](https://developer.mozilla.org/en-US/docs/Web/API/PerformanceObserver) allows us to track long tasks and other things
* The Three.js [WebGLRenderer](https://threejs.org/docs/#WebGLRenderer) exposes most required 3D metrics
* Puppeteer exposes [Page.metrics()](https://pptr.dev/api/puppeteer.page.metrics), including `JSHeapUsedSize`.
* Other ways are available for memory monitoring :
	* [performance.memory](https://developer.mozilla.org/en-US/docs/Web/API/Performance/memory) (deprecated ?)
	* [measureUserAgentSpecificMemory()](https://developer.mozilla.org/en-US/docs/Web/API/Performance/measureUserAgentSpecificMemory) (not standard yet)
* Tracing for debugging
	* Puppeteer exposes a [tracing api](https://pptr.dev/api/puppeteer.tracing) over Chrome DevTools Protocol

| Metric                            | How to measure                                              |
| --------------------------------- | ----------------------------------------------------------- |
| update Time                       | MAIN_LOOP_EVENTS UPDATE_START / UPDATE_END                  |
| Frame Time                        | MAIN_LOOP_EVENTS BEFORE_RENDER / AFTER_RENDER               |
| Time to first frame               | compute it from first AFTER_RENDER event                    |
| Time to first tile                | check for visible level0Node in view.tileLayer              |
| Time to stable view               | check for empty scheduler and RENDERING_PAUSED              |
| Data Parsing and Conversion time  | Hard to track: requires specific tests or new logs / events |
| Draw Calls                        | renderer.info.render.calls                                  |
| Shader Compilation time           | sketchy but possible: call renderer.compile beforehand      |
| Textures count                    | renderer.info.memory.textures                               |
| Triangle Count                    | renderer.info.render.triangles                              |
| Geometries count                  | renderer.info.memory.geometries                             |
| Number of shaders                 | renderer.info.programs.length                               |
| JS Heap Size                      | performance.memory after forcing a GC (how?)                |
| Number and duration of long tasks | PerformanceObserver: observe({ entryTypes: ['longtask'] })  |

⚠️ Those metrics are the one expected for an automated headless "functional" performance test with no tracing. We want to implement different types of performance monitoring watchers depending on use-cases

#### Good practices
1. Any network-related behavior should be mocked. Data has to be pre-fetched before testing or being available locally.
2. The build method used for the performance tests should be as close to production build as possible.
3. When comparing versions, we will need to build them separately. Maplibre has an interesting approach building a minified js test file for each version providing all the test and core code: [See Maplibre-gl-js benchmark readme](https://github.com/maplibre/maplibre-gl-js/tree/main/test/bench).
4. In the end, it would be nice to have dedicated hardware WITH GPUs unlike what we have on Github to run those tests consistently. For now it will run on developer's hardware. 

### Architecture proposal
@mgermerie will complete this section 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft Proposal] Building a performance benchmark framework #2714

Context

Description of the proposal

Identified use-cases :

What this proposal does NOT aims at solving

Implementation

Functional Implementation

Test scenarios types

What are the metrics to monitor ?

Statistical significancy

How to compare official release performances over time ?

Technical Implementation details

Identified third-party tool candidates

Dependency to the Chrome browser

How to measure each metric ?

Good practices

Architecture proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Description
update Time	Time required to perform an update
Frame Time	Time to render a frame
Time to first frame	Time from test start to first render
Time to first tile	Time from test start to first tile rendered
Test time	Time to perform the test itself
Data Parsing and converting time	Time spend parsing / converting data
Draw Calls	Number of draw calls per frame
Shader Compilation time	Time required to compile shader. Can be responsible for startup slow-down
Textures count	The number of active textures
Triangle Count	The number of rendered triangles primitives
Geometries count	The number of active geometries
Number of shaders	The number of shader programs
JS Heap Size	Heap memory used after garbage collecting
Number and duration of long tasks	Counting tasks >50ms blocking main thread and summing their duration

Metric	How to measure
update Time	MAIN_LOOP_EVENTS UPDATE_START / UPDATE_END
Frame Time	MAIN_LOOP_EVENTS BEFORE_RENDER / AFTER_RENDER
Time to first frame	compute it from first AFTER_RENDER event
Time to first tile	check for visible level0Node in view.tileLayer
Time to stable view	check for empty scheduler and RENDERING_PAUSED
Data Parsing and Conversion time	Hard to track: requires specific tests or new logs / events
Draw Calls	renderer.info.render.calls
Shader Compilation time	sketchy but possible: call renderer.compile beforehand
Textures count	renderer.info.memory.textures
Triangle Count	renderer.info.render.triangles
Geometries count	renderer.info.memory.geometries
Number of shaders	renderer.info.programs.length
JS Heap Size	performance.memory after forcing a GC (how?)
Number and duration of long tasks	PerformanceObserver: observe({ entryTypes: ['longtask'] })

[Draft Proposal] Building a performance benchmark framework #2714

Description

Context

Description of the proposal

Identified use-cases :

What this proposal does NOT aims at solving

Implementation

Functional Implementation

Test scenarios types

What are the metrics to monitor ?

Statistical significancy

How to compare official release performances over time ?

Technical Implementation details

Identified third-party tool candidates

Dependency to the Chrome browser

How to measure each metric ?

Good practices

Architecture proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions