Skip to content

Extract PieceCID from ContextID #118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bajtos opened this issue Feb 13, 2025 · 7 comments · Fixed by #144
Closed

Extract PieceCID from ContextID #118

bajtos opened this issue Feb 13, 2025 · 7 comments · Fixed by #144
Assignees

Comments

@bajtos
Copy link
Member

bajtos commented Feb 13, 2025

To allow Spark to link deals to content advertised to IPNI, SP software like Curio creates ContextID from Piece info. This is provides an alternate way for extracting PieceCID from IPNI advertisements.

We need to enhance piece-indexer to support both options: PieceCID extracted from ContextID, PieceCID extracted from Graphsync metadata.

Spec:
https://github.com/CheckerNetwork/FIPs/blob/frc-retrieval-checking-requirements/FRCs/frc-retrieval-checking-requirements.md#construct-ipni-contextid-from-piececid-piecesize

Optimisation:
ContextID values following the spec above start with the prefix ghsA. If the ContextID value does not start with this prefix, we can skip it (there is no need to try to do base64 and CBOR decoding).

See #117 for an example index provider that uses this new format.

Here is the place where we are extracting PieceCID from Graphsync metadata, we can add the new ContextID-based PieceCID extraction there:

const meta = parseMetadata(advertisement.Metadata['/'].bytes)
const pieceCid = meta.deal?.PieceCID.toString()
if (!pieceCid) {
debug('advertisement %s has no PieceCID in metadata: %j', advertisementCid, meta.deal)
return {
error: /** @type {const} */('MISSING_PIECE_CID'),
previousAdvertisementCid
}
}

@bajtos
Copy link
Member Author

bajtos commented Feb 21, 2025

Example ContextID value in the new format:

ghsAAAAIAAAAANgqWCgAAYHiA5IgIFeksuf2VqNvrrRUxrvA+itvJhrDRju06ThagW6ULKw2

Example Node.js code showing how to parse it:

import { decode as decodeDagCbor } from '@ipld/dag-cbor'

const ContextID = 'ghsAAAAIAAAAANgqWCgAAYHiA5IgIFeksuf2VqNvrrRUxrvA+itvJhrDRju06ThagW6ULKw2'

const bytes = Buffer.from(ContextID, 'base64')
const [pieceSize, pieceCID] = decodeDagCbor(bytes)
console.log('PieceCID:', pieceCID.toString())
// CID(baga6ea4seaqfpjfs473fni3pv22fjrv3yd5cw3zgdlbumo5u5e4fvalosqwkynq)
console.log('PieceSize:', pieceSize)
// 34359738368

Note: we cannot assume that the decoded value will always be a pair of [pieceSize, pieceCID]. We should check that:

  • decodeDagCbor(bytes) returned an array (not an object)
  • the array has exactly two items
  • typeof pieceSize === 'number'
  • the pieceCID item is a CID link (encoded using custom CBOR tag 42) - this can be validated on the JavaScript side as follows:
    typeof pieceCID === 'object' && pieceCID?.constructor?.name === 'CID'

@NikolasHaimerl
Copy link
Contributor

NikolasHaimerl commented Feb 27, 2025

Implementation Plan for ContextID-based PieceCID Extraction

Overview

This implementation plan covers the enhancement of piece-indexer to support extracting PieceCID from both Graphsync metadata and IPNI ContextID, following the FRC retrieval checking requirements specification.

Background

  • Spark needs to link deals to content advertised to IPNI
  • SP software like Curio creates ContextID from Piece info
  • ContextID provides an alternate method for extracting PieceCID from IPNI advertisements
  • We need to support both PieceCID extraction methods:
    1. From Graphsync metadata (existing)
    2. From ContextID (new)

Implementation Steps

1. Create Utility Function for ContextID Parsing

Create the extractPieceCidFromContextID function with the following key elements:

export function extractPieceCidFromContextID(contextID, logDebugMessage = debug) {
  // Check if ContextID exists with proper structure
  if (!contextID || !contextID['/'] || !contextID['/'].bytes) {
    return null
  }
  
  try {
    // Get bytes and check for "ghsA" prefix (optimization)
    const contextIDBytes = contextID['/'].bytes
    const contextIDString = Buffer.from(contextIDBytes, 'base64').toString('ascii')
    if (!contextIDString.startsWith('ghsA')) {
      return null
    }
    
    // Decode using CBOR and validate structure
    const bytes = Buffer.from(contextIDBytes, 'base64')
    const decoded = decodeDagCbor(bytes)
    
    // Validation checks for array structure, pieceSize, and pieceCid
    // [Type and structure validation logic here]
    
    return { pieceCid, pieceSize }
  } catch (err) {
    // Error handling
    return null
  }
}

2. Modify PieceCID Extraction in advertisement-walker.js

Update the existing code to use the new function:

// First, try to get PieceCID from Graphsync metadata (existing approach)
const meta = parseMetadata(advertisement.Metadata['/'].bytes)
let pieceCid = meta.deal?.PieceCID.toString()

// If not found in metadata, try to extract from ContextID
if (!pieceCid) {
  const extractedData = extractPieceCidFromContextID(advertisement.ContextID, debug)
  pieceCid = extractedData?.pieceCid?.toString()
  
  // If still not found, return error
  if (!pieceCid) {
    debug('advertisement %s has no PieceCID in metadata or ContextID', advertisementCid)
    return {
      error: /** @type {const} */('MISSING_PIECE_CID'),
      previousAdvertisementCid
    }
  }
}

3. Testing

ghsAAAAIAAAAANgqWCgAAYHiA5IgIFeksuf2VqNvrrRUxrvA

4. Performance Considerations

  • The prefix check (startsWith('ghsA')) provides quick filtering of irrelevant ContextIDs. The downside of using the prefix is that we cannot differentiate between the types of errors (not an array, not exactly 2 entries in array,...).

@juliangruber
Copy link
Member

Great plan, @NikolasHaimerl 👏

Do you think it's useful to get visibility into Context IDs that we can't parse? Or rather: Is it expected that there are Context IDs that we can't parse, or should we be able to parse all? If it's the former, no need to do anything, if it's the latter, let's add logging/telemetry.

Regarding the performance optimization, it's a smart and simple idea 👍 But, as always: Have you measured what time we add if we don't have the optimization? And are we parsing Context IDs a lot, therefore the cycle saving is relevant?

For now, the implementation suggests to try looking at metadata, then at the Context ID. Can it happen that both metadata and Context ID have Piece CID, but they are different?

@NikolasHaimerl
Copy link
Contributor

Great plan, @NikolasHaimerl 👏

Do you think it's useful to get visibility into Context IDs that we can't parse? Or rather: Is it expected that there are Context IDs that we can't parse, or should we be able to parse all? If it's the former, no need to do anything, if it's the latter, let's add logging/telemetry.

Regarding the performance optimization, it's a smart and simple idea 👍 But, as always: Have you measured what time we add if we don't have the optimization? And are we parsing Context IDs a lot, therefore the cycle saving is relevant?

For now, the implementation suggests to try looking at metadata, then at the Context ID. Can it happen that both metadata and Context ID have Piece CID, but they are different?

  1. I believe it makes sense to gain insight into whether ContextIDs can be parsed or not. Should the ContextID not be parseable, then the reason for it should also be of concern to us. It will give us insight into whether our assumptions were wrong or not. I will be sure to add logging and telemetry for it.
  2. The performance optimization comes from not having to decode the context ID using cbor. Cbor is supposed to be more efficient and faster than json encoding. I am not sure whether the performance increase justifies the loss in insight on the reason for not being able to parse a context ID as expected. From what I understood, the prefix ghsA only occurs if the piece Size is 32GB, it is an array and it has exactly two entries. So checking for the prefix would eliminate the checks for any of the other possible reasons why the Context ID is not what we expected. @bajtos Please correct me if I am wrong here. I have not measured the performance gain yet.
  3. I am not sure whether it can happened that metadata and ContextID have different piece CIDs. Maybe @bajtos knows more.

@juliangruber
Copy link
Member

The performance optimization comes from not having to decode the context ID using cbor

I understand that part, but what is the impact of the optimization? Ie how many ms / cycles are we saving? And how often does this saving occur?

@bajtos bajtos moved this from 📥 next to 📋 planned in Space Meridian Mar 3, 2025
@bajtos
Copy link
Member Author

bajtos commented Mar 3, 2025

Great description of the plan, @NikolasHaimerl! 👏🏻

Do you think it's useful to get visibility into Context IDs that we can't parse? Or rather: Is it expected that there are Context IDs that we can't parse, or should we be able to parse all? If it's the former, no need to do anything, if it's the latter, let's add logging/telemetry.

  1. I believe it makes sense to gain insight into whether ContextIDs can be parsed or not. Should the ContextID not be parseable, then the reason for it should also be of concern to us. It will give us insight into whether our assumptions were wrong or not. I will be sure to add logging and telemetry for it.

Yes, it is expected that there will be many Context IDs that we can't parse.

  • All advertisements from non-Filecoin providers, e.g. IPFS nodes.
  • Advertisements from Filecoin SW not implementing Spark's ContextID format, e.g. all miners running the current Boost version.

Regarding the performance optimization, it's a smart and simple idea 👍 But, as always: Have you measured what time we add if we don't have the optimization? And are we parsing Context IDs a lot, therefore the cycle saving is relevant?

  1. The performance optimization comes from not having to decode the context ID using cbor. Cbor is supposed to be more efficient and faster than json encoding. I am not sure whether the performance increase justifies the loss in insight on the reason for not being able to parse a context ID as expected. From what I understood, the prefix ghsA only occurs if the piece Size is 32GB, it is an array and it has exactly two entries. So checking for the prefix would eliminate the checks for any of the other possible reasons why the Context ID is not what we expected. @bajtos Please correct me if I am wrong here. I have not measured the performance gain yet.
  • A small correction: the prefix ghsA only occurs when ContextID is a string containing a base64-encoded binary payload representing a CBOR-encoded array with exactly two items, where the first item has type uint64 and the value of the first item has 0s in the most-significant 8 bits. (PieceSize must be less than 2^(64-8), which covers both 32GiB and 64GiB pieces that seem to be the norm these days).

  • We can measure the fraction of index providers using this new format by iterating over all providers returned by https://cid.contact/providers. For each provider, download the latest advertisement and extract the ContextID version value.

  • However, this is such a trivial optimisation that I feel this discussion has already cost us more time than it would take to implement & review it.

Let's make a decision and move on. I am fine either way, with a slight preference for implementing this optimisation.

For now, the implementation suggests to try looking at metadata, then at the Context ID. Can it happen that both metadata and Context ID have Piece CID, but they are different?

  1. I am not sure whether it can happened that metadata and ContextID have different piece CIDs. Maybe @bajtos knows more.

Yes, a miner using the new ContextID format and advertising Graphsync retrievals will set PieceCID in both places (ContextID, Graphsync metadata).

I would expect these values to be always the same, but that depends on the implementation in Miner SW.

Let's search for PieceCID in ContextID first and treat Graphsync metadata as a fall-back option in case we cannot find PieceCID in ContextID.

  • PieceCID in ContextID is the new convention that we expect the ecosystem to eventually adopt.
  • Graphsync metadata is a short-term workaround to support existing miner SW. Graphsync is deprecated; most people want to move away from it.

@NikolasHaimerl
Copy link
Contributor

Yes, it is expected that there will be many Context IDs that we can't parse.

I would go with the more detailed error logging than rather than the optimization. Should the performance of the piece-indexer be of concern we can always make the optimization later as well. Since the curio implementation is new to the entire stack, there is a chance of our assumptions being incorrect about how curio works/interacts with spark. Detailed error logging will help quite important here IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants