Spark score for retrievals using any available provider #254

bajtos · 2025-03-24T11:03:44Z

In https://space-meridian.slack.com/archives/C06RPCL6QGL/p1742802306479569, we discussed how different storage DePINs address the availability of retrievals.

In Walrus, each sliver is stored on N nodes. The content is considered retrievable when at least F nodes serve the slivers, where F<N.
In Spark, we consider content retrievable only if the SP under test serves retrieval. We ignore copies stored with other SPs or even on IPFS nodes outside of Filecoin.

This difference makes it difficult to compare retrievability scores for different storage networks.

Let's add a new Spark score that measures how many deals (CIDs) can be retrieved from the network using any available retrieval provider, including non-Filecoin nodes running IPFS.

If a piece is stored with multiple SPs but only one of them serves retrievals, the new score should flag this content as retrievable. That matches the experience of retrieval clients: they wanted to retrieve a CID and they got back their content, all was good.
The new score will demonstrate the real-world benefits of content addressing and IPFS-based retrievals.
The new score can potentially double the observed RSR of data stored on Filecoin.

Notes:

This new score is useful as a network-wide metric only. It must not affect the current per-miner/per-client/per-allocator RSR metrics. We shouldn't even collect it with per-miner/per-client/per-allocator granularity.
This new check should be added to the existing Spark infrastructure, similarly to how we added HTTP HEAD retrieval checking (see Test HEAD requests before GET spark-checker#104)
Proposed algorithm:
- If the current retrieval check passes, the outcome of the new check is "OK".
- If the current retrieval check fails because of IPNI error (e.g. 404), the outcome of the new check is the same.
- Only when the IPNI lookup fails with NO_VALID_ADVERTISEMENT, we want to try to retrieve the payload CId from all providers found in the IPNI lookup response. (Potentially de-duplicating entries from the same provider but with different protocols.)

The text was updated successfully, but these errors were encountered:

pyropy · 2025-04-08T11:29:51Z

Only when the IPNI lookup fails with NO_VALID_ADVERTISEMENT, we want to try to retrieve the payload CId from all providers found in the IPNI lookup response. (Potentially de-duplicating entries from the same provider but with different protocols.)

Would we really want to check all providers or check them until we receive payload from at least one of them?

On other subnets we randomly check one node for the given blob id / transaction hash. I don't think that it would be fair toward them if we check ALL providers serving this retrieval.

I propose that we pick one random node from the IPNI lookup response (excluding the SP node we have probed before that) and try to perform retrieval on that node.

cc @bajtos

bajtos · 2025-04-08T15:19:43Z

Only when the IPNI lookup fails with NO_VALID_ADVERTISEMENT, we want to try to retrieve the payload CId from all providers found in the IPNI lookup response. (Potentially de-duplicating entries from the same provider but with different protocols.)

Would we really want to check all providers or check them until we receive payload from at least one of them?

On other subnets we randomly check one node for the given blob id / transaction hash. I don't think that it would be fair toward them if we check ALL providers serving this retrieval.

I propose that we pick one random node from the IPNI lookup response (excluding the SP node we have probed before that) and try to perform retrieval on that node.

SGTM. We can start with what you proposed and then iteratively improve the solution later as needed.

It would be great if we could find a simple heuristic for preferring a Filecoin retrieval provider. (IPFS nodes can advertise to IPNI, too, I am concerned about how many of them actually serve retrievals.) Here are some ideas:

Prefer retrieval providers supporting HTTP protocol over those supporting Graphsync only
Ignore Bitswap protocol
Prefer advertisements where ContextID starts with ghsA (see Extract PieceCID from ContextID piece-indexer#118)
What else?

pyropy · 2025-04-08T16:37:32Z

SGTM. We can start with what you proposed and then iteratively improve the solution later as needed.

It would be great if we could find a simple heuristic for preferring a Filecoin retrieval provider. (IPFS nodes can advertise to IPNI, too, I am concerned about how many of them actually serve retrievals.) Here are some ideas:

Prefer retrieval providers supporting HTTP protocol over those supporting Graphsync only

Ignore Bitswap protocol

Prefer advertisements where ContextID starts with ghsA (see Extract PieceCID from ContextID piece-indexer#118)

What else?

That sounds great to me!

I only wonder if we should more stats about this measurement or keep it simple and just include measurements status (status code or boolean status)? I think it would be okay to go with the latter and add in more fields later in a backwards compatible way.

bajtos · 2025-04-09T11:12:30Z

I only wonder if we should more stats about this measurement or keep it simple and just include measurements status (status code or boolean status)? I think it would be okay to go with the latter and add in more fields later in a backwards compatible way.

I agree to keep it simple. I prefer a new status code (one number) over a boolean status; it will give us much more information for troubleshooting at a minimal overhead.

We should also add information about which retrieval provider we picked up. That information will be important for troubleshooting, too. You can reports the provider peer id, similarly to how we are reporting stats.providerId now.

pyropy · 2025-04-09T14:05:31Z

@bajtos We're evaluating retrieval on few fields: timeout, car_too_large, status_code , end_at and indexer_result.

In this case we won't evaluate network wide measurement success status by indexer_result but we would still need other fields to evaluate the success status.

Do you think that it would be an overkill to include all these fields?

bajtos · 2025-04-10T13:42:43Z

@bajtos We're evaluating retrieval on few fields: timeout, car_too_large, status_code , end_at and indexer_result.

In this case we won't evaluate network wide measurement success status by indexer_result but we would still need other fields to evaluate the success status.

Do you think that it would be an overkill to include all these fields?

I don't have a strong opinion. If we need this fields to capture the retrieval check result, then we must include them.

I think it would be nice to refactor the checker to signal timeout and car_too_large via a new status code value, see https://github.com/CheckerNetwork/spark-checker/blob/9c29967ebdc68afe07f9b154d62e0c387230f426/lib/spark.js#L360-L397. E.g. timeout can be status_code: 803 (it's an error related to network communication) and car_too_large can be status_code: 905 (it's related to content verification).

If you decide to make such a change, then please open a standalone set of pull requests for that. Remember that we must support both flavours (timeout and status_code:803) for a while because it will take some time until all checkers upgrade to the new version.

bajtos added the 💥 Spark label Mar 24, 2025

bajtos added this to Space Meridian Mar 24, 2025

bajtos moved this to 📥 next in Space Meridian Mar 24, 2025

NikolasHaimerl assigned NikolasHaimerl and unassigned NikolasHaimerl Apr 1, 2025

pyropy moved this from 📥 next to 📋 planned in Space Meridian Apr 8, 2025

pyropy self-assigned this Apr 8, 2025

This was referenced Apr 9, 2025

Add alternative provider retrieval measurement CheckerNetwork/spark-api#571

Open

Evaluate alternative provider measurement CheckerNetwork/spark-evaluate#518

Open

Add alternative provider retrieval check CheckerNetwork/spark-checker#132

Open

pyropy moved this from 📋 planned to 🏗 in progress in Space Meridian Apr 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark score for retrievals using any available provider #254

Spark score for retrievals using any available provider #254

bajtos commented Mar 24, 2025 •

edited

Loading

pyropy commented Apr 8, 2025 •

edited

Loading

Uh oh!

bajtos commented Apr 8, 2025

Uh oh!

pyropy commented Apr 8, 2025

Uh oh!

bajtos commented Apr 9, 2025

Uh oh!

pyropy commented Apr 9, 2025

Uh oh!

bajtos commented Apr 10, 2025

Uh oh!

Spark score for retrievals using any available provider #254

Spark score for retrievals using any available provider #254

Comments

bajtos commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pyropy commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bajtos commented Apr 8, 2025

Uh oh!

pyropy commented Apr 8, 2025

Uh oh!

bajtos commented Apr 9, 2025

Uh oh!

pyropy commented Apr 9, 2025

Uh oh!

bajtos commented Apr 10, 2025

Uh oh!

bajtos commented Mar 24, 2025 •

edited

Loading

pyropy commented Apr 8, 2025 •

edited

Loading