Skip to content

[Draft Proposal] Building a performance benchmark framework #2714

@airnez

Description

@airnez

This proposal was written subsequentially to the March 2026 iTowns Hackaton.
Involved contributors: @mgermerie, @PierreAntoineChiron and @airnez

Context

Developing features for iTowns, we often ask ourselves :

  • To what extent this feature / contribution impacts performances ?
  • How to benchmark performances for iTowns ?
  • What are the metrics to monitor ?
  • Is there a memory leak somewhere ?

This proposal aims at defining an architecture for automated performance tests for iTowns, answering those questions.

Description of the proposal

We'll try to provide soon a first PR partially fulfilling requirements stated bellow. It will be a starting point for performance test development.

Identified use-cases :

  1. Automatically checking for performance regressions against master branch before merging a pull request
  2. Checking for performance regressions when bumping dependencies
  3. Providing a reliable performance tracing toolbox for debugging and bottleneck identification

For now, we'll focus on use-case n°1 while enabling the other ones to be answered later by the same test architecture.

What this proposal does NOT aims at solving

  • Providing better live performance debugging tools: This is a different job (Improve debug tools with rendering stats #2020)
  • Providing automated bottleneck identification : A human will still be needed to read metrics exposed by those tests

Implementation

Functional Implementation

Test scenarios types

  1. "Functional" performance tests: Instantiating a view and running real-use scenarios
  2. "Unit" performance tests: Only benchmarking an iTowns sub-system (Parsing, reprojection... ). This second approach is not as good as n°1, but might be needed for specific parts of the code.

We'll provide a first working "Functional" performance test to begin with.

What are the metrics to monitor ?

We identified a first list of metrics that would be interesting to have for Functional performance tests. Here is a non-exhaustive list (feel free to suggest other ones) :

Metric Description
update Time Time required to perform an update
Frame Time Time to render a frame
Time to first frame Time from test start to first render
Time to first tile Time from test start to first tile rendered
Test time Time to perform the test itself
Data Parsing and converting time Time spend parsing / converting data
Draw Calls Number of draw calls per frame
Shader Compilation time Time required to compile shader. Can be responsible for startup slow-down
Textures count The number of active textures
Triangle Count The number of rendered triangles primitives
Geometries count The number of active geometries
Number of shaders The number of shader programs
JS Heap Size Heap memory used after garbage collecting
Number and duration of long tasks Counting tasks >50ms blocking main thread and summing their duration

For statistics that are measured for each frame / render, they should be then accumulated into statistics :

  • Min / Max
  • Average
  • 95th percentile (P95)
  • Standard deviation

Statistical significancy

When looking for reliable measurement (e.g. automated testing), we should be able to run the tests multiple times and measure statistical difference between tested options in a round-robin manner.
Quoting google tachometer readme :

Even if you run the same JavaScript, on the same browser, on the same machine, on the same day, you'll still get a different result every time. But if you take enough repeated samples and apply the right statistics, you can reliably identify even tiny differences in runtime.

How to compare official release performances over time ?

  • We can not test performances for each version independently and expect metrics to be comparable. It is environment and browser-version dependent. We will always have to run all compared versions during the same test run. Like Maptiler did for Mapbox and Maplibre.

Technical Implementation details

Identified third-party tool candidates

  • Test harness : Running several tests for many times and compare them
    • Tachometer: It is designed to only work with time measures. This might not perfectly fit our needs. We will probably just implement what we need ourselves.
  • Browser control API
    • Playwright : Cool and trendy but we did not consider it (see reason just bellow)
    • Puppeteer: We are already using it for functional tests, we'll keep it for performance tests for now (arbitrary and debatable)
  • Tracing performance metrics for debugging
    • Chrome DevTools Protocol seems unavoidable for that use case. We don't need it for automated tests, but this is the go-to for a human deep-dive investigation

Dependency to the Chrome browser

Chrome environment seems to stand above competitors for performance metrics collection. BUT it would be nice to be able to run those tests for Firefox at least. Since puppeteer supports firefox now, it should be feasible.

How to measure each metric ?

Puppeteer coupled with standard browser API should already be enough to get most of what we need without modifying iTowns core code

Metric How to measure
update Time MAIN_LOOP_EVENTS UPDATE_START / UPDATE_END
Frame Time MAIN_LOOP_EVENTS BEFORE_RENDER / AFTER_RENDER
Time to first frame compute it from first AFTER_RENDER event
Time to first tile check for visible level0Node in view.tileLayer
Time to stable view check for empty scheduler and RENDERING_PAUSED
Data Parsing and Conversion time Hard to track: requires specific tests or new logs / events
Draw Calls renderer.info.render.calls
Shader Compilation time sketchy but possible: call renderer.compile beforehand
Textures count renderer.info.memory.textures
Triangle Count renderer.info.render.triangles
Geometries count renderer.info.memory.geometries
Number of shaders renderer.info.programs.length
JS Heap Size performance.memory after forcing a GC (how?)
Number and duration of long tasks PerformanceObserver: observe({ entryTypes: ['longtask'] })

⚠️ Those metrics are the one expected for an automated headless "functional" performance test with no tracing. We want to implement different types of performance monitoring watchers depending on use-cases

Good practices

  1. Any network-related behavior should be mocked. Data has to be pre-fetched before testing or being available locally.
  2. The build method used for the performance tests should be as close to production build as possible.
  3. When comparing versions, we will need to build them separately. Maplibre has an interesting approach building a minified js test file for each version providing all the test and core code: See Maplibre-gl-js benchmark readme.
  4. In the end, it would be nice to have dedicated hardware WITH GPUs unlike what we have on Github to run those tests consistently. For now it will run on developer's hardware.

Architecture proposal

@mgermerie will complete this section

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions