Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Secondary provider identifer #33

Open
LexLuthr opened this issue Nov 4, 2024 · 14 comments
Open

Secondary provider identifer #33

LexLuthr opened this issue Nov 4, 2024 · 14 comments

Comments

@LexLuthr
Copy link

LexLuthr commented Nov 4, 2024

Currently, all providers are identified using the libp2p peer ID. This creates a problem for providers which do not have a libp2p subsystem (ex: a HTTP provider).

My requirement:
Curio does not use the same libp2p peerID for IPNI provider and libp2p. The libp2p ID is shared between multiple minerIDs. This make using it impossible for IPNI. I need a reliable way to establish a relation between peerID and a miner ID within the indexer i.e. no external look up.

Possible solutions:

  1. If we can change the provider identifier to be any arbitrary yet unique string, it would make things much easier. Or maybe Just allow minerID.
  2. Or maybe allow additional metadata with advertisement to allow adding "minerID" or other data.
@masih masih transferred this issue from ipni/storetheindex Nov 4, 2024
@bajtos
Copy link

bajtos commented Nov 6, 2024

Here is the Spark perspective, which most likely applies to any other networks performing retrieval testing based on on-chain data.

  1. A Filecoin deal is defined (roughly) as (minerId, clientId, pieceCid, pieceSize, sectorId). Both miner and client is identified by their f0 address.
  2. In order to verify that the SP (the miner) is providing retrievals, we need to filter the retrieval info returned by IPNI to find the record that was created by the SP under test.

Currently:

  • Spark assumes that miner's libp2p PeerID returned by the MinerInfo method is the same as the value in the field .MultihashResults[]. r.ProviderResults[].Provider.ID. (Going forward, I'll refer to the latter as ProviderID.)
  • Based on the information from @LexLuthr, this assumption will no longer be true.
  • (In hindsight, it was a bit optimistic to make this assumption on our side.)

Curio does not use the same libp2p peerID for IPNI provider and libp2p. The libp2p ID is shared between multiple minerIDs. This make using it impossible for IPNI.

It may be worth exploring what this means for retrieval testing in general.

  • If one Curio instance runs multiple miners, then maybe retrieval tests don't need to distinguish between different miners - they can map multiple miner IDs to the same IPNI Provider ID that identifies the Curio instance.
  • That works only if Curio provides the same retrievability for all deals, irrespective of which miner ID they are linked to. @LexLuthr is that true?

A bigger question is how much can retrieval checkers trust the MinerID-to-ProviderID mapping.

  • For example, can a malicious SP advertise ProviderID of somebody else who does provide good retrievability so that checker nodes search for IPNI records matching that other ProviderID? Then the malicious SP does not need to serve retrievals at all and can still get a non-zero Spark RSR.
  • What happens when a malicious SP (an index provider) uses the same Provider ID as somebody else? Is this an attack vector allowing malicious index providers to overwrite the advertisement chain of somebody else?

A potential solution I see - let's discuss if it's viable?

  • Curio "merges" all IPNI advertisements from all miners it operates into a single shared chain.
  • This shared chain uses Curio's Filecoin libp2p PeerID as the Provider ID.
  • Spark can keep looking up this Filecoin PeerID / IPNI ProviderID via Filecoin.MinerInfo method

I don't know if this makes the implementation any simpler on the Curio side? Also, this is viable only if Curio provides the same retrievability for all deals, irrespective of which miner ID they are linked to.

@aschmahmann
Copy link

@LexLuthr and @masih tagged me on this issue, likely related to my interest in #20 and that IMO the binding of provider IDs to peerIDs that's currently in place is not a good idea. I'll try and give my understanding of both the current situation for IPNI and IPFS as well as what's needed with Curio + Spark which isn't necessarily the same.

TLDR on recommendations:

  • It should be possible to advertise a provider with multiaddresses where it's not assumed that /p2p/ProviderID is added to the end
  • PeerIDs should be kept as the identifiers for the advertisement chain itself
  • It's worth considering adding a small amount of metadata that can be associated with a provider as a whole (like addresses) rather than per-advertisement
  • Curio / Spark should consider using libp2p over HTTP with PeerID auth with their IPFS HTTP Gateway endpoints, not to gate access to the gateway but to prove they own it (Spark would have to be careful about if they follow HTTP redirects so they don't get tricked). Note: IIUC not doing this is an abuse vector for Spark today

My current understanding (but someone like @masih should double check me) is that there are two reasons for the current structure of having a provider PeerID + a set of (multi)addresses.

  1. So that there's some proof / acknowledgement that the peer has acknowledged the advertisement since they sign with their key
  2. To do manual compress of bytes in the address field so there aren't 5 addresses that all end in the binary version of /p2p/<peerID>

These are both fairly unimportant reasons though because:

  1. Providers and clients are already circumventing this proof / acknowledgement thing by using HTTP multiaddrs with bogus peerIDs associated with providers that aren't checked by clients (e.g. Boost and web3.storage / Storacha as providers, Lassie and Helia as clients)
  2. Saving these bytes via manual compression seems not particularly important when standard on the wire compression could be used

The downsides of operating this way are mostly that we add the inconvenience and confusion of bogus libp2p peerIDs being sometimes added to multiaddrs in ways that are confusing. For example, when encountering /dns/foo.tld/tcp/443/https/p2p/12D3Foobar are you supposed to drop the /p2p/... component because you know it was added as a hack, or does this indicate that you're trying to do libp2p over HTTP with PeerID auth? Overall this makes using IPNI for content routing with systems that are not using libp2p peerIDs (e.g. CA authenticated HTTPS addresses, HTTP to Tor hidden services, BitTorrent ....) more painful and hacky for no real benefit.


My requirement:
Curio does not use the same libp2p peerID for IPNI provider and libp2p. The libp2p ID is shared between multiple minerIDs. This make using it impossible for IPNI. I need a reliable way to establish a relation between peerID and a miner ID within the indexer i.e. no external look up.

I'd wonder what @masih and @willscott think, but as I understand it:

  • IPNI already supports a somewhat YOLO metadata field associated with advertisements that can have whatever you want in it
  • The metadata field is currently used more "per advertisement" than "per provider" and doesn't update automatically like addresses do since it's expected that addresses are per-provider
  • This might make a good case for adding space for per provider information that, like addresses, is always updated. This could be within the advertisement chain (like with addresses), or we could start separating them out (e.g. enable peer routing to be handled separately from content routing).

@LexLuthr Having more information about how you need this mapping to show up / be used would likely make discussion easier.


A bigger question is how much can retrieval checkers trust the MinerID-to-ProviderID mapping.

@bajtos let's back up a bit to consider what the attack model is, before figuring out the solution. Some examples:

  1. If you don't want a miner you're testing to be able to serve requests cheaply by on-the-fly fetching from some other provider -> you seem out of luck, this is why Filecoin does PoRep with all it's associated tradeoffs
  2. If you don't want a miner you're testing to be able to serve requests cheaply on-the-fly fetching from some other provider unless they pay some penalty (e.g. they proxy all the bytes through an endpoint they control) -> just make sure the data transfer protocol is authenticated (e.g. if using HTTP you can require something like libp2p over HTTP's PeerID auth or some other auth scheme run on the same domain)

What happens when a malicious SP (an index provider) uses the same Provider ID as somebody else? Is this an attack vector allowing malicious index providers to overwrite the advertisement chain of somebody else?

This seems resolvable by keeping peerIDs / public keys as the identifiers for the advertisement chain itself, but still figuring out a way to associate arbitrary data with a provider (e.g. the metadata field or something else). Using something other than a cryptographic key here to identify the mutable data that is the advertisement chain seems like a bigger ordeal (e.g. it looks a lot like the entire DID space).

@LexLuthr
Copy link
Author

LexLuthr commented Nov 6, 2024

I will try to answer all the question directed at me as best as I can. In case I have missed something, please feel free to tag me in.

  1. Curio cannot/should not publish Ads from multiple minerID using the same peerID. Spark has no way to distinguish who is running which minerID on which Curio cluster. This will make it impossible to make spark work in any reliable way.
  2. Curio will be wrapping up the retrievals in UCANs soon. So retrieval will be decided by client. There is nothing in Curio that prevents retrieving any data from any minerIDs it is operating.
  3. HTTP over libp2p is not the solution Curio is planning to switch over to. HTTP in general is very mature and widely used protocol with much less overhead. We plan to stick to plain HTTP for foreseeable future.
  4. Curio can still sign Ads with specific peerID per minerID it maintains but these peerID are not on chain. We need some way to tell indexer which peerID is for which minerID. This will allow spark to work as is.
  5. Curio right now uses http metadata as this is how provider can tell a client how to retrieve the data. The metadata field seems to specific for this and may be https://github.com/ipni/go-libipni/blob/f9b76606526a41b3291c3cce97d24a5d078645be/announce/message/message.go#L20-L23 can be used for it if we start processing it. It can be applied on provider level instead of advertisement level. I am not clear on what this field is actually used for or is supposed to be used for.

@aschmahmann
Copy link

aschmahmann commented Nov 6, 2024

HTTP over libp2p is not the solution Curio is planning to switch over to. HTTP in general is very mature and widely used protocol with much less overhead. We plan to stick to plain HTTP for foreseeable future.

Note: HTTP over libp2p != libp2p over HTTP

  • HTTP over libp2p: Likely you mean sending HTTP 1.1 requests over libp2p streams (e.g. TCP+Yamux, QUIC, WebRTC, etc.) https://github.com/libp2p/specs/tree/master/http#using-http-semantics-over-stream-transports
  • libp2p over HTTP or really just HTTP PeerID authentication: https://github.com/libp2p/specs/blob/master/http/peer-id-auth.md, allows the client and/or server to authenticate with each other using peerIDs
    • My recommendation for Curio and Spark was to consider this as a mechanism of binding the IPNI advertising entity to the HTTP provider so that if two SPs Alice and Bob both host the same data Alice can't just advertise in IPNI using Bob's HTTP endpoints and pass spark tests
    • But of course you can use whatever auth you want with Spark, this just might be helpful in letting you reuse some pieces

@aschmahmann
Copy link

We need some way to tell indexer which peerID is for which minerID. This will allow spark to work as is.

Perhaps my own cluelessness about Spark, but how is this not abusable? If Spark wants to prove "minerX has advertised CID Y to IPNI and it's downloadable from an endpoint controlled by minerX" then there needs to be some proof binding minerX to the peerID (i.e. not just some text mapping) and some proof binding the HTTP endpoint to either the peerID or the minerID. It sounds like both are missing.

You could relax the condition and say minerX doesn't have to advertise their CIDs as long as somebody out there advertises that minerX has CID Y at an endpoint minerX controls. Doing this would mean an "advisory" mapping of peerID -> minerX in IPNI could be ok, but it comes with the potential of added work / attack surface for Spark since what if someone who isn't minerX also publishes an advisory mapping but to an endpoint that doesn't resolve properly?

@LexLuthr
Copy link
Author

LexLuthr commented Nov 7, 2024

HTTP over libp2p is not the solution Curio is planning to switch over to. HTTP in general is very mature and widely used protocol with much less overhead. We plan to stick to plain HTTP for foreseeable future.

Note: HTTP over libp2p != libp2p over HTTP

  • HTTP over libp2p: Likely you mean sending HTTP 1.1 requests over libp2p streams (e.g. TCP+Yamux, QUIC, WebRTC, etc.) https://github.com/libp2p/specs/tree/master/http#using-http-semantics-over-stream-transports

  • libp2p over HTTP or really just HTTP PeerID authentication: https://github.com/libp2p/specs/blob/master/http/peer-id-auth.md, allows the client and/or server to authenticate with each other using peerIDs

    • My recommendation for Curio and Spark was to consider this as a mechanism of binding the IPNI advertising entity to the HTTP provider so that if two SPs Alice and Bob both host the same data Alice can't just advertise in IPNI using Bob's HTTP endpoints and pass spark tests
    • But of course you can use whatever auth you want with Spark, this just might be helpful in letting you reuse some pieces

Thanks for clarifying this.

  1. We generate a new ad for say Alice and Bob. Even if they share a piece, we will generate individual ads for both. Both Alice and bob will sign their respective ads using their individual (not shared) libp2p private key. The "head" is then signed and announced to the IPNI.
  2. When IPNI reaches out to Bob or Alice, they reach out to same HTTP server but in different path and data will only be sent if the requested data exists for the said provider (i.e. peerID).
  3. My ask is to allow an additional name for provider with peerID Alice to be called t01000 and Bob to be called t01001. Of course all this will be signed by the libp2p key as part of the ad.
  4. My understanding of the provider system was that it can advertise for anyone and actual data provider need not be the index-provider. Or has this changed?
  5. As for using libp2p over HTTP, I still don't see an advantage in Curio<>IPNI. This is not really a 2 way communication as indexer decides when to sync from a provider and simply requested specific signed data. So, why would I want to go through the pain of initial auth and other steps when I can simply sign and send the data. The signature can be easily verified by indexer to verify the authenticity. This is in conjunction with my point 5. It would be nice if you can point out the exact advantage that makes the overhead worth it.
  6. One obvious issue with what I requested is that Alice and Bob both can claim to be t01000. So how do we establish that identity. For this the easiest fix I could think if was that we signed the text with worker wallet address of the miner. This can again be verified easily on indexer side.

Maybe something like below. It would allow more flexibility around what this extra binding info can be.

type ExtraMetadataType string

const (
    Filecoin-SP ExtraMetadataType "miner"
    IPFS ExtraMetadataType "IPFS"
    ... Extend as required
)

type ExtraMetadata struct {
    Type extraMetadataType
    Data []byte
    Sig crypto.Signature
}

func (e ExtraMetadata) GetType() ExtraMetadataType {
	return e.Type
}

@aschmahmann
Copy link

For this the easiest fix I could think if was that we signed the text with worker wallet address of the miner. This can again be verified easily on indexer side.

Not an IPNI maintainer or anything, but IMO requiring IPNI nodes to understand Filecoin seems like a bad idea / mismatch of concerns. IPNI does not care at all about the minerID, Spark cares about the minerID and so it seems like those systems should negotiate the relationship.

This is a way bigger ask then an extra metadata field, it's asking IPNI nodes to understand Filecoin and run their own nodes and/or outsource to some trusted RPC provider.

As for using libp2p over HTTP, I still don't see an advantage in Curio<>IPNI.

My suggestion wasn't to use it for Curio<>IPNI, but to use it between Curio<>Spark because my understanding is that Spark needs some way to know that the HTTP endpoint Alice advertises belongs to her and she's not just pointing you at Bob's endpoint.

There is a flaw here in that I assumed that there was a mapping of minerID -> peerID somewhere that Spark could trust. This seems to indicate that you can either:

  1. Find a way for Spark to discover a verifiable/trusted mapping from the minerID to a peerID/public key, then bind the HTTP address to the peerID (e.g. with libp2p over HTTP)
    • Some examples of how to bind the minerID to a peerID / public key:
      • Have some provable on chain mapping between the two and have Spark check either via a trusted RPC or something more like a light client
      • Have some attestation signed by the keys underlying the minerID attesting to the peerID that is distributed offchain (e.g. if small enough then in the IPNI metadata, or if too large then some HTTP endpoint in Curio), but Spark will likely still need a trusted RPC or light client to validate that it's correct
  2. Bind the HTTP address directly to the minerID (probably more work, but certainly fewer steps)
  3. Not care about the attack vector

@LexLuthr
Copy link
Author

LexLuthr commented Nov 7, 2024

For this the easiest fix I could think if was that we signed the text with worker wallet address of the miner. This can again be verified easily on indexer side.

Not an IPNI maintainer or anything, but IMO requiring IPNI nodes to understand Filecoin seems like a bad idea / mismatch of concerns. IPNI does not care at all about the minerID, Spark cares about the minerID and so it seems like those systems should negotiate the relationship.

This is a way bigger ask then an extra metadata field, it's asking IPNI nodes to understand Filecoin and run their own nodes and/or outsource to some trusted RPC provider.

I agree that IPNI protocol should not get Filecoin specific. But it is libp2p specific right now. This is a problem for anyone trying to get away from libp2p. The next iteration of deal protocol will be pure HTTP. The on chain libp2p peerID won't matter after that.

As for using libp2p over HTTP, I still don't see an advantage in Curio<>IPNI.

My suggestion wasn't to use it for Curio<>IPNI, but to use it between Curio<>Spark because my understanding is that Spark needs some way to know that the HTTP endpoint Alice advertises belongs to her and she's not just pointing you at Bob's endpoint.

Spark is another retrieval client for Curio. Curio does not distinguish between who requested what from which minerID. You request some data and if Curio has it then it will respond. Charlie can retrieve deal made with Alice and Bob both using the same endpoint. This is by design. This is HTTP retrieval for full piece and an IPFS gateway.

There is a flaw here in that I assumed that there was a mapping of minerID -> peerID somewhere that Spark could trust. This seems to indicate that you can either:

  1. Find a way for Spark to discover a verifiable/trusted mapping from the minerID to a peerID/public key, then bind the HTTP address to the peerID (e.g. with libp2p over HTTP)

MinerID to IPNI Provider PeerID mapping is Curio internal at the moment. There is no existing format on chain that we can use to update it.

  • Some examples of how to bind the minerID to a peerID / public key:

    • Have some provable on chain mapping between the two and have Spark check either via a trusted RPC or something more like a light client
    • Have some attestation signed by the keys underlying the minerID attesting to the peerID that is distributed offchain (e.g. if small enough then in the IPNI metadata, or if too large then some HTTP endpoint in Curio), but Spark will likely still need a trusted RPC or light client to validate that it's correct
  1. Bind the HTTP address directly to the minerID (probably more work, but certainly fewer steps)

HTTP Address are per cluster and not per minerID. So on chain address of multiple minerIDs can be same.

  1. Not care about the attack vector

I am not sure why this particular attack vector is important or maybe I am misunderstanding it. All Spark does is verify retrievability unless I am wrong here. It should not matter how backend serves the data or from which source. Only thing we should care about is that I looked up a piece which was sealed with Alice on IPNI. I got the address to retrieve the said piece (or part of a piece) and I was able to retrieve it. It doesn't matter if it was served from some sector Bob might be holding.

@aschmahmann
Copy link

I am not sure why this particular attack vector is important or maybe I am misunderstanding it....
It doesn't matter if it was served from some sector Bob might be holding.

Maybe I'm not the one understanding Spark's purpose and so someone will correct me, but IIUC it matters to Spark who is serving the data. Aside from one Curio instance being able to back many miners with different IDs, if Spark's goal is to figure out which SPs are serving data well I can do the following:

  1. look at the FIL+ requirements and see that many require storing with multiple SPs
  2. only take on clients with FIL+ looking to store with multiple SPs
  3. if another SP is serving the data then I can cheat and just HTTP redirect to them and/or be a proxy for them and then I don't have to actually do any of the things that I might if I was serving the data (e.g. indexing, storing the unsealed data, etc.) but still look good on the Spark metrics
    • This means that as an SP I can trick clients, FIL+ allocators, etc. who rely on Spark's data into thinking I'm a provider that behaves well at serving data when in reality I don't do that at all

I agree that IPNI protocol should not get Filecoin specific. But it is libp2p specific right now. This is a problem for anyone trying to get away from libp2p. The next iteration of deal protocol will be pure HTTP. The on chain libp2p peerID won't matter after that.

Can you walk me through the libp2p-specific parts? Below I've tried to list every place I can recall libp2p being used within IPNI and almost everything seems optional, and certainly anything at the transport layer looks optional.

  1. The identifier for the chain (i.e. when you publish an advertisement chain who is it signed by that allows it to do an update) -> uses a libp2p peerID.
    • As long as IPNI allows for mutable updates you'll need some sort of verifiable identifier for the advertisement chain. I suppose you could ditch libp2p peerIDs for did:key, some other public key format, or any other identification system decentralized or otherwise. My guess is a public key based format is probably the way to go here because it puts the fewest requirements on IPNI but maybe demand is high enough to consider something else
  2. The provider ID -> Also uses a libp2p peerID .... BUT is also meaningless. People do have libp2p peerIDs they put in the providers field for extended providers and then implicitly throw away at the application layer.
  3. Index providers can/should advertise their HEADs over libp2p gossipsub so it can reach any IPNI nodes
    • Yes, this is part of the IPNI spec that enables multiple providers to exist. It's also optional and you can just choose some IPNI providers to update directly and hope they sync up on the backend
  4. Index providers can serve their advertisements over HTTP or HTTP over libp2p -> yes, it's an option since not everyone has a public IP with a domain name and TLS cert
  5. The data transports you can retrieve with
    • These aren't libp2p specific, you can see that because HTTP is already supported

So what's the libp2p-specific thing you're concerned with? The only thing you're actually stuck with is using the libp2p peerID format for encoding public keys instead of a different public key encoding format or pushing for to be used instead. Is it just the inelegance of the libp2p peerID format?

@LexLuthr
Copy link
Author

LexLuthr commented Nov 7, 2024

Maybe I'm not the one understanding Spark's purpose and so someone will correct me, but IIUC it matters to Spark who is serving the data. Aside from one Curio instance being able to back many miners with different IDs, if Spark's goal is to figure out which SPs are serving data well I can do the following:

  1. look at the FIL+ requirements and see that many require storing with multiple SPs

  2. only take on clients with FIL+ looking to store with multiple SPs

  3. if another SP is serving the data then I can cheat and just HTTP redirect to them and/or be a proxy for them and then I don't have to actually do any of the things that I might if I was serving the data (e.g. indexing, storing the unsealed data, etc.) but still look good on the Spark metrics

    • This means that as an SP I can trick clients, FIL+ allocators, etc. who rely on Spark's data into thinking I'm a provider that behaves well at serving data when in reality I don't do that at all

Spark doesn't announce which data it will look up and when. So, how can an SP which is serving the retrievals, be a bad SP? As an SP, I should have every right to save space and b/w as long as I don't compromise on the provided service quality. Another things, all minerIDs served by Curio have same peerID for making deals i.e. on chain peerID. So, again there is no way to know who signed the data. Forcing SPs to have separate keys just to sign retrievals seems too much.
Maybe @bajtos or @willscott can clarify the requirements for Spark.

Can you walk me through the libp2p-specific parts? Below I've tried to list every place I can recall libp2p being used within IPNI and almost everything seems optional, and certainly anything at the transport layer looks optional.

  1. The identifier for the chain (i.e. when you publish an advertisement chain who is it signed by that allows it to do an update) -> uses a libp2p peerID.

    • As long as IPNI allows for mutable updates you'll need some sort of verifiable identifier for the advertisement chain. I suppose you could ditch libp2p peerIDs for did:key, some other public key format, or any other identification system decentralized or otherwise. My guess is a public key based format is probably the way to go here because it puts the fewest requirements on IPNI but maybe demand is high enough to consider something else

I would love it if this identifier can be arbitrary public key. But if I look at the code, it is not. That is what makes this libp2p specific. If we can make this use bls3 or other public keys along with peerID then Curio or other providers can simply sign with worker wallets(or other relevant keys). This would make whole thing cleaner and easier to look things up on chain.

  1. The provider ID -> Also uses a libp2p peerID .... BUT is also meaningless. People do have libp2p peerIDs they put in the providers field for extended providers and then implicitly throw away at the application layer.

  2. Index providers can/should advertise their HEADs over libp2p gossipsub so it can reach any IPNI nodes

GossipSub is deprecated AFAIK. New versions all use http-libp2p or HTTP. Curio is a HTTP only provider.

  • Yes, this is part of the IPNI spec that enables multiple providers to exist. It's also optional and you can just choose some IPNI providers to update directly and hope they sync up on the backend
  1. Index providers can serve their advertisements over HTTP or HTTP over libp2p -> yes, it's an option since not everyone has a public IP with a domain name and TLS cert

This works in Boost right now without any TLS or domain name. All data is still signed by libp2p key for data auth in indexer side. In fact, Gossipsub perf was really bad with Boost. All of our users using libp2p only announcements had sync issues. Some of them were not even found by indexer.

  1. The data transports you can retrieve with

    • These aren't libp2p specific, you can see that because HTTP is already supported

So what's the libp2p-specific thing you're concerned with? The only thing you're actually stuck with is using the libp2p peerID format for encoding public keys instead of a different public key encoding format or pushing for to be used instead. Is it just the inelegance of the libp2p peerID format?
I am not partial to any format but using a key libp2p key only approach is the part I want to change.

We mostly agree on how things should work. Maybe just supporting more formats is the solution.

@bajtos
Copy link

bajtos commented Nov 20, 2024

Hey, great discussion!

Regarding the Spark attack vector, where SPs delegate serving retrievals to other SPs.

It is a valid attack vector, but the impact is very low right now - most SPs don't serve retrievals at all (less than 15% of retrieval checks succeed), and from what we have seen, people operating SPs are not sophisticated enough to deploy such a solution. They are struggling to even properly configure Boost + booster-http + IPNI integration.

From our perspective, we need to have an idea of how to mitigate this attack vector in the future (6+ months), but we don't need the solution to be designed & implemented right now.

Potential options I see:

  • Leverage libp2p over HTTP with PeerID auth, as suggested above. The downside is that this extra request leaks signal that can be used by SPs to discriminate Spark checkers from other clients.
  • In the longer term, we would like to introduce retrieval attestations (see IPIP-431: Opt-in Extensible CAR Metadata on Trustless Gateway ipfs/specs#431) to verify that the untrusted checker did make a retrieval request to the SP, as opposed to getting the metadata about the retrieval check result in other means. I think this mechanism can also verify that the retrieval was served by the miner being tested and not by some other SP.

If Spark wants to prove "minerX has advertised CID Y to IPNI and it's downloadable from an endpoint controlled by minerX" then there needs to be some proof binding minerX to the peerID (i.e. not just some text mapping) and some proof binding the HTTP endpoint to either the peerID or the minerID. It sounds like both are missing.

You could relax the condition and say minerX doesn't have to advertise their CIDs as long as somebody out there advertises that minerX has CID Y at an endpoint minerX controls. Doing this would mean an "advisory" mapping of peerID -> minerX in IPNI could be ok, but it comes with the potential of added work / attack surface for Spark since what if someone who isn't minerX also publishes an advisory mapping but to an endpoint that doesn't resolve properly?

I have slightly different view.

  • "minerX has advertised CID Y to IPNI" - this is a requirement; we are already enforcing this in Spark v1, and want to keep enforcing
  • "and it's downloadable from an endpoint controlled by minerX" - this is nuanced. We want to give SPs flexibility in how they provide retrievals. For example, we consider it a legitimate strategy when multiple SPs create a joint operation where they keep only one hot copy for serving retrieval requests.

Checking whether "minerX has advertised CID Y to IPNI" is good enough for now, as far as we are concerned.

Checking whether "it's downloadable from an endpoint controlled by minerX" is one of many improvements we will eventually need to implement, and we need to prioritise it relative to other improvements needed.

Linking MinerId to IndexProvider PeerID

In Spark, we have only two requirements:

  1. We need a way to map MinerID (f0abc) to index ProviderID used in the IPNI advertisements announced by this miner.
  2. The solution must be secure in the sense that only the miner can establish the link between their MinerID and the IPNI index ProviderID.

We are open-minded about which solution to use. Spark can support multiple ways of linking miners to index/retrieval providers, if necessary.

Another things, all minerIDs served by Curio have same peerID for making deals i.e. on chain peerID. So, again there is no way to know who signed the data. Forcing SPs to have separate keys just to sign retrievals seems too much.

I don't see why it seems too much to ask Curio to have a unique IPNI provider ID for each miner it serves. Having said that, I don't have a strong opinion. If we can find a solution that works within your constraints, then we can adopt it.

Cross-posting from a Slack discussion thread in the #ipni channel:

https://filecoinproject.slack.com/archives/C06GD1SS56Y/p1731947311574889?thread_ts=1731531597.840889&cid=C06GD1SS56Y

So in the curio world:

  • Everything runs as a cluster
  • A cluster has a pool of machines
  • A cluster can manage multiple Miner Actors
  • There is just one real libp2p node per cluster, with one PeerID
    • it is still HA, when the libp2p node dies, a new machine is elected to run it; Others that listen on the all-in-one http endpoint redirect websoctet connections to the currently running node
    • All miner actors in the cluster will have this PeerID / Multiaddrs in the actor configuration
  • IPNI in curio uses "virtual" libp2p keys - each miner gets its own, and there is a separate ad chain per miner
    • Those PeerIDs only exist in IPNI, and only to sign the advertisements. There is no on-chain mapping for them currently - tho I imagine writing a small solidity contract to do that wouldn't be hard

Based on all that has been written so far, I'd like to propose the following solution.

(1)
Use IPNI index provider metadata to communicate a list of miner ids served by the Curio instance (the index provider). The list can contain 1 item if there is 1:1 mapping between miners and index providers. The list will contain N items if a single index provider serves N miners.

(2)
To establish trust, each minerId-providerId must be signed by a wallet linked to the miner ID. I am not familiar with different miner-related wallet types, but I can imagine we can allow any of the owner, worker or control addresses to sign the item.

The signature must be over a data structure that includes (indexProviderId, minerId).

Spark can obtain the owner/worker/control wallet address using the RPC API method Filecoin.StateMinerInfo.

Curio can produce the signature using the existing infrastructure for signing messages with the owner/worker/control wallet. (Is this feasible & reasonably easy to implement?)

The obvious downside is metadata size - for each miner listed in the index provider metadata, we need to include the signature (64 bytes when using ECDSA+ secp256k1).

@masih what is the (practical) limit on how many bytes index providers can put into boost extended providers or metadata at root ad?

@bajtos
Copy link

bajtos commented Dec 4, 2024

Depending on how the timing and the outcome of this discussion, we may want to include the results in my FRC documenting Retrieval Checking Requirements - see filecoin-project/FIPs#1089

@masih
Copy link
Member

masih commented Dec 17, 2024

Taking a step back: the simple requirement here is to map Peer ID to miner ID.

The simple solution to this is to publish the miner ID as a metadata value of the top level ad.
To look it up then one can scan the /providers endpoint or IPNI can provide that lookup API as we are talking about < 500 entries here.

What have I missed?

@bajtos
Copy link

bajtos commented Dec 17, 2024

Taking a step back: the simple requirement here is to map Peer ID to miner ID.

The simple solution to this is to publish the miner ID as a metadata value of the top level ad. To look it up then one can scan the /providers endpoint or IPNI can provide that lookup API as we are talking about < 500 entries here.

That works for us (retrieval checking & Spark).

However, we need measures to prevent an adversary index provider from claiming a miner ID they don't control.

See the second half of my comment #33 (comment) for more details and a possible solution.

I guess we need to hear from @LexLuthr @steven004 whether such a solution is feasible for Curio and Venus Droplet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants