Skip to content

Spark score for retrievals using any available provider #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bajtos opened this issue Mar 24, 2025 · 6 comments
Open

Spark score for retrievals using any available provider #254

bajtos opened this issue Mar 24, 2025 · 6 comments
Assignees

Comments

@bajtos
Copy link
Member

bajtos commented Mar 24, 2025

In https://space-meridian.slack.com/archives/C06RPCL6QGL/p1742802306479569, we discussed how different storage DePINs address the availability of retrievals.

  • In Walrus, each sliver is stored on N nodes. The content is considered retrievable when at least F nodes serve the slivers, where F<N.
  • In Spark, we consider content retrievable only if the SP under test serves retrieval. We ignore copies stored with other SPs or even on IPFS nodes outside of Filecoin.

This difference makes it difficult to compare retrievability scores for different storage networks.

Let's add a new Spark score that measures how many deals (CIDs) can be retrieved from the network using any available retrieval provider, including non-Filecoin nodes running IPFS.

  • If a piece is stored with multiple SPs but only one of them serves retrievals, the new score should flag this content as retrievable. That matches the experience of retrieval clients: they wanted to retrieve a CID and they got back their content, all was good.
  • The new score will demonstrate the real-world benefits of content addressing and IPFS-based retrievals.
  • The new score can potentially double the observed RSR of data stored on Filecoin.

Notes:

  • This new score is useful as a network-wide metric only. It must not affect the current per-miner/per-client/per-allocator RSR metrics. We shouldn't even collect it with per-miner/per-client/per-allocator granularity.
  • This new check should be added to the existing Spark infrastructure, similarly to how we added HTTP HEAD retrieval checking (see Test HEAD requests before GET spark-checker#104)
  • Proposed algorithm:
    • If the current retrieval check passes, the outcome of the new check is "OK".
    • If the current retrieval check fails because of IPNI error (e.g. 404), the outcome of the new check is the same.
    • Only when the IPNI lookup fails with NO_VALID_ADVERTISEMENT, we want to try to retrieve the payload CId from all providers found in the IPNI lookup response. (Potentially de-duplicating entries from the same provider but with different protocols.)
@pyropy
Copy link

pyropy commented Apr 8, 2025

Only when the IPNI lookup fails with NO_VALID_ADVERTISEMENT, we want to try to retrieve the payload CId from all providers found in the IPNI lookup response. (Potentially de-duplicating entries from the same provider but with different protocols.)

Would we really want to check all providers or check them until we receive payload from at least one of them?

On other subnets we randomly check one node for the given blob id / transaction hash. I don't think that it would be fair toward them if we check ALL providers serving this retrieval.

I propose that we pick one random node from the IPNI lookup response (excluding the SP node we have probed before that) and try to perform retrieval on that node.

cc @bajtos

@pyropy pyropy moved this from 📥 next to 📋 planned in Space Meridian Apr 8, 2025
@pyropy pyropy self-assigned this Apr 8, 2025
@bajtos
Copy link
Member Author

bajtos commented Apr 8, 2025

Only when the IPNI lookup fails with NO_VALID_ADVERTISEMENT, we want to try to retrieve the payload CId from all providers found in the IPNI lookup response. (Potentially de-duplicating entries from the same provider but with different protocols.)

Would we really want to check all providers or check them until we receive payload from at least one of them?

On other subnets we randomly check one node for the given blob id / transaction hash. I don't think that it would be fair toward them if we check ALL providers serving this retrieval.

I propose that we pick one random node from the IPNI lookup response (excluding the SP node we have probed before that) and try to perform retrieval on that node.

SGTM. We can start with what you proposed and then iteratively improve the solution later as needed.

It would be great if we could find a simple heuristic for preferring a Filecoin retrieval provider. (IPFS nodes can advertise to IPNI, too, I am concerned about how many of them actually serve retrievals.) Here are some ideas:

@pyropy
Copy link

pyropy commented Apr 8, 2025

SGTM. We can start with what you proposed and then iteratively improve the solution later as needed.

It would be great if we could find a simple heuristic for preferring a Filecoin retrieval provider. (IPFS nodes can advertise to IPNI, too, I am concerned about how many of them actually serve retrievals.) Here are some ideas:

That sounds great to me!

I only wonder if we should more stats about this measurement or keep it simple and just include measurements status (status code or boolean status)? I think it would be okay to go with the latter and add in more fields later in a backwards compatible way.

@bajtos
Copy link
Member Author

bajtos commented Apr 9, 2025

I only wonder if we should more stats about this measurement or keep it simple and just include measurements status (status code or boolean status)? I think it would be okay to go with the latter and add in more fields later in a backwards compatible way.

I agree to keep it simple. I prefer a new status code (one number) over a boolean status; it will give us much more information for troubleshooting at a minimal overhead.

We should also add information about which retrieval provider we picked up. That information will be important for troubleshooting, too. You can reports the provider peer id, similarly to how we are reporting stats.providerId now.

@pyropy
Copy link

pyropy commented Apr 9, 2025

@bajtos We're evaluating retrieval on few fields: timeout, car_too_large, status_code , end_at and indexer_result.

In this case we won't evaluate network wide measurement success status by indexer_result but we would still need other fields to evaluate the success status.

Do you think that it would be an overkill to include all these fields?

@bajtos
Copy link
Member Author

bajtos commented Apr 10, 2025

@bajtos We're evaluating retrieval on few fields: timeout, car_too_large, status_code , end_at and indexer_result.

In this case we won't evaluate network wide measurement success status by indexer_result but we would still need other fields to evaluate the success status.

Do you think that it would be an overkill to include all these fields?

I don't have a strong opinion. If we need this fields to capture the retrieval check result, then we must include them.

I think it would be nice to refactor the checker to signal timeout and car_too_large via a new status code value, see https://github.com/CheckerNetwork/spark-checker/blob/9c29967ebdc68afe07f9b154d62e0c387230f426/lib/spark.js#L360-L397. E.g. timeout can be status_code: 803 (it's an error related to network communication) and car_too_large can be status_code: 905 (it's related to content verification).

If you decide to make such a change, then please open a standalone set of pull requests for that. Remember that we must support both flavours (timeout and status_code:803) for a while because it will take some time until all checkers upgrade to the new version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏗 in progress
Development

No branches or pull requests

3 participants