Skip to content

Multi-provider results can contain duplicate pages #9

@oritwoen

Description

@oritwoen

When using multiple providers (via `providers.all()` or an array), `combineResults` in `src/archive.ts:115-155` merges all pages and sorts by timestamp, but doesn't deduplicate.

The same URL archived at the same time can appear from both Wayback and CommonCrawl (they share CDX-based data). The result array ends up with near-identical entries that only differ in `_meta.provider`.

This matters when users rely on `.pages.length` or iterate over results — they process the same snapshot twice.

A reasonable dedup key would be `url + timestamp` (or `url + snapshot`), keeping the first occurrence per provider ordering. Could also be opt-in via an option if preserving all provider entries is wanted in some cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions