Extract PieceCID from ContextID #118

bajtos · 2025-02-13T07:42:15Z

To allow Spark to link deals to content advertised to IPNI, SP software like Curio creates ContextID from Piece info. This is provides an alternate way for extracting PieceCID from IPNI advertisements.

We need to enhance piece-indexer to support both options: PieceCID extracted from ContextID, PieceCID extracted from Graphsync metadata.

Spec:
https://github.com/CheckerNetwork/FIPs/blob/frc-retrieval-checking-requirements/FRCs/frc-retrieval-checking-requirements.md#construct-ipni-contextid-from-piececid-piecesize

Optimisation:
ContextID values following the spec above start with the prefix ghsA. If the ContextID value does not start with this prefix, we can skip it (there is no need to try to do base64 and CBOR decoding).

See #117 for an example index provider that uses this new format.

Here is the place where we are extracting PieceCID from Graphsync metadata, we can add the new ContextID-based PieceCID extraction there:

piece-indexer/indexer/lib/advertisement-walker.js

Lines 273 to 281 in 684bdc0

    
           const meta = parseMetadata(advertisement.Metadata['/'].bytes) 
        
           const pieceCid = meta.deal?.PieceCID.toString() 
        
           if (!pieceCid) { 
        
             debug('advertisement %s has no PieceCID in metadata: %j', advertisementCid, meta.deal) 
        
             return { 
        
               error: /** @type {const} */('MISSING_PIECE_CID'), 
        
               previousAdvertisementCid 
        
             } 
        
           }

The text was updated successfully, but these errors were encountered:

bajtos · 2025-02-21T08:01:26Z

Example ContextID value in the new format:

ghsAAAAIAAAAANgqWCgAAYHiA5IgIFeksuf2VqNvrrRUxrvA+itvJhrDRju06ThagW6ULKw2

Example Node.js code showing how to parse it:

import { decode as decodeDagCbor } from '@ipld/dag-cbor'

const ContextID = 'ghsAAAAIAAAAANgqWCgAAYHiA5IgIFeksuf2VqNvrrRUxrvA+itvJhrDRju06ThagW6ULKw2'

const bytes = Buffer.from(ContextID, 'base64')
const [pieceSize, pieceCID] = decodeDagCbor(bytes)
console.log('PieceCID:', pieceCID.toString())
// CID(baga6ea4seaqfpjfs473fni3pv22fjrv3yd5cw3zgdlbumo5u5e4fvalosqwkynq)
console.log('PieceSize:', pieceSize)
// 34359738368

Note: we cannot assume that the decoded value will always be a pair of [pieceSize, pieceCID]. We should check that:

decodeDagCbor(bytes) returned an array (not an object)
the array has exactly two items
typeof pieceSize === 'number'
the pieceCID item is a CID link (encoded using custom CBOR tag 42) - this can be validated on the JavaScript side as follows:
```
typeof pieceCID === 'object' && pieceCID?.constructor?.name === 'CID'
```

NikolasHaimerl · 2025-02-27T08:48:27Z

Implementation Plan for ContextID-based PieceCID Extraction

Overview

This implementation plan covers the enhancement of piece-indexer to support extracting PieceCID from both Graphsync metadata and IPNI ContextID, following the FRC retrieval checking requirements specification.

Background

Spark needs to link deals to content advertised to IPNI
SP software like Curio creates ContextID from Piece info
ContextID provides an alternate method for extracting PieceCID from IPNI advertisements
We need to support both PieceCID extraction methods:
1. From Graphsync metadata (existing)
2. From ContextID (new)

Implementation Steps

1. Create Utility Function for ContextID Parsing

Create the extractPieceCidFromContextID function with the following key elements:

export function extractPieceCidFromContextID(contextID, logDebugMessage = debug) {
  // Check if ContextID exists with proper structure
  if (!contextID || !contextID['/'] || !contextID['/'].bytes) {
    return null
  }
  
  try {
    // Get bytes and check for "ghsA" prefix (optimization)
    const contextIDBytes = contextID['/'].bytes
    const contextIDString = Buffer.from(contextIDBytes, 'base64').toString('ascii')
    if (!contextIDString.startsWith('ghsA')) {
      return null
    }
    
    // Decode using CBOR and validate structure
    const bytes = Buffer.from(contextIDBytes, 'base64')
    const decoded = decodeDagCbor(bytes)
    
    // Validation checks for array structure, pieceSize, and pieceCid
    // [Type and structure validation logic here]
    
    return { pieceCid, pieceSize }
  } catch (err) {
    // Error handling
    return null
  }
}

2. Modify PieceCID Extraction in advertisement-walker.js

Update the existing code to use the new function:

// First, try to get PieceCID from Graphsync metadata (existing approach)
const meta = parseMetadata(advertisement.Metadata['/'].bytes)
let pieceCid = meta.deal?.PieceCID.toString()

// If not found in metadata, try to extract from ContextID
if (!pieceCid) {
  const extractedData = extractPieceCidFromContextID(advertisement.ContextID, debug)
  pieceCid = extractedData?.pieceCid?.toString()
  
  // If still not found, return error
  if (!pieceCid) {
    debug('advertisement %s has no PieceCID in metadata or ContextID', advertisementCid)
    return {
      error: /** @type {const} */('MISSING_PIECE_CID'),
      previousAdvertisementCid
    }
  }
}

3. Testing

For integration tests, we can use the existing provider as described in Support http-path in index provider address #117 as test data.
For unit tests we can use the example given by @bajtos

ghsAAAAIAAAAANgqWCgAAYHiA5IgIFeksuf2VqNvrrRUxrvA

4. Performance Considerations

The prefix check (startsWith('ghsA')) provides quick filtering of irrelevant ContextIDs. The downside of using the prefix is that we cannot differentiate between the types of errors (not an array, not exactly 2 entries in array,...).

juliangruber · 2025-02-27T09:41:25Z

Great plan, @NikolasHaimerl 👏

Do you think it's useful to get visibility into Context IDs that we can't parse? Or rather: Is it expected that there are Context IDs that we can't parse, or should we be able to parse all? If it's the former, no need to do anything, if it's the latter, let's add logging/telemetry.

Regarding the performance optimization, it's a smart and simple idea 👍 But, as always: Have you measured what time we add if we don't have the optimization? And are we parsing Context IDs a lot, therefore the cycle saving is relevant?

For now, the implementation suggests to try looking at metadata, then at the Context ID. Can it happen that both metadata and Context ID have Piece CID, but they are different?

NikolasHaimerl · 2025-02-27T11:37:57Z

Great plan, @NikolasHaimerl 👏

Do you think it's useful to get visibility into Context IDs that we can't parse? Or rather: Is it expected that there are Context IDs that we can't parse, or should we be able to parse all? If it's the former, no need to do anything, if it's the latter, let's add logging/telemetry.

Regarding the performance optimization, it's a smart and simple idea 👍 But, as always: Have you measured what time we add if we don't have the optimization? And are we parsing Context IDs a lot, therefore the cycle saving is relevant?

For now, the implementation suggests to try looking at metadata, then at the Context ID. Can it happen that both metadata and Context ID have Piece CID, but they are different?

I believe it makes sense to gain insight into whether ContextIDs can be parsed or not. Should the ContextID not be parseable, then the reason for it should also be of concern to us. It will give us insight into whether our assumptions were wrong or not. I will be sure to add logging and telemetry for it.
The performance optimization comes from not having to decode the context ID using cbor. Cbor is supposed to be more efficient and faster than json encoding. I am not sure whether the performance increase justifies the loss in insight on the reason for not being able to parse a context ID as expected. From what I understood, the prefix ghsA only occurs if the piece Size is 32GB, it is an array and it has exactly two entries. So checking for the prefix would eliminate the checks for any of the other possible reasons why the Context ID is not what we expected. @bajtos Please correct me if I am wrong here. I have not measured the performance gain yet.
I am not sure whether it can happened that metadata and ContextID have different piece CIDs. Maybe @bajtos knows more.

juliangruber · 2025-02-27T13:59:51Z

The performance optimization comes from not having to decode the context ID using cbor

I understand that part, but what is the impact of the optimization? Ie how many ms / cycles are we saving? And how often does this saving occur?

bajtos · 2025-03-03T09:58:11Z

Great description of the plan, @NikolasHaimerl! 👏🏻

Do you think it's useful to get visibility into Context IDs that we can't parse? Or rather: Is it expected that there are Context IDs that we can't parse, or should we be able to parse all? If it's the former, no need to do anything, if it's the latter, let's add logging/telemetry.

I believe it makes sense to gain insight into whether ContextIDs can be parsed or not. Should the ContextID not be parseable, then the reason for it should also be of concern to us. It will give us insight into whether our assumptions were wrong or not. I will be sure to add logging and telemetry for it.

Yes, it is expected that there will be many Context IDs that we can't parse.

All advertisements from non-Filecoin providers, e.g. IPFS nodes.
Advertisements from Filecoin SW not implementing Spark's ContextID format, e.g. all miners running the current Boost version.

Regarding the performance optimization, it's a smart and simple idea 👍 But, as always: Have you measured what time we add if we don't have the optimization? And are we parsing Context IDs a lot, therefore the cycle saving is relevant?

The performance optimization comes from not having to decode the context ID using cbor. Cbor is supposed to be more efficient and faster than json encoding. I am not sure whether the performance increase justifies the loss in insight on the reason for not being able to parse a context ID as expected. From what I understood, the prefix ghsA only occurs if the piece Size is 32GB, it is an array and it has exactly two entries. So checking for the prefix would eliminate the checks for any of the other possible reasons why the Context ID is not what we expected. @bajtos Please correct me if I am wrong here. I have not measured the performance gain yet.

A small correction: the prefix ghsA only occurs when ContextID is a string containing a base64-encoded binary payload representing a CBOR-encoded array with exactly two items, where the first item has type uint64 and the value of the first item has 0s in the most-significant 8 bits. (PieceSize must be less than 2^(64-8), which covers both 32GiB and 64GiB pieces that seem to be the norm these days).
We can measure the fraction of index providers using this new format by iterating over all providers returned by https://cid.contact/providers. For each provider, download the latest advertisement and extract the ContextID ~~version~~ value.
However, this is such a trivial optimisation that I feel this discussion has already cost us more time than it would take to implement & review it.

Let's make a decision and move on. I am fine either way, with a slight preference for implementing this optimisation.

For now, the implementation suggests to try looking at metadata, then at the Context ID. Can it happen that both metadata and Context ID have Piece CID, but they are different?

I am not sure whether it can happened that metadata and ContextID have different piece CIDs. Maybe @bajtos knows more.

Yes, a miner using the new ContextID format and advertising Graphsync retrievals will set PieceCID in both places (ContextID, Graphsync metadata).

I would expect these values to be always the same, but that depends on the implementation in Miner SW.

Let's search for PieceCID in ContextID first and treat Graphsync metadata as a fall-back option in case we cannot find PieceCID in ContextID.

PieceCID in ContextID is the new convention that we expect the ecosystem to eventually adopt.
Graphsync metadata is a short-term workaround to support existing miner SW. Graphsync is deprecated; most people want to move away from it.

NikolasHaimerl · 2025-03-06T09:24:09Z

Yes, it is expected that there will be many Context IDs that we can't parse.

I would go with the more detailed error logging than rather than the optimization. Should the performance of the piece-indexer be of concern we can always make the optimization later as well. Since the curio implementation is new to the entire stack, there is a chance of our assumptions being incorrect about how curio works/interacts with spark. Detailed error logging will help quite important here IMHO.

bajtos mentioned this issue Jan 31, 2025

Spark DDO support for Curio CheckerNetwork/roadmap#231

Closed

bajtos added this to Space Meridian Feb 14, 2025

bajtos moved this to 📥 todo in Space Meridian Feb 14, 2025

juliangruber assigned NikolasHaimerl Feb 25, 2025

NikolasHaimerl mentioned this issue Mar 1, 2025

Support DAG-CBOR encoded entries #119

Closed

bajtos moved this from 📥 next to 📋 planned in Space Meridian Mar 3, 2025

NikolasHaimerl mentioned this issue Mar 6, 2025

feat: extract piece cid from contextId #144

Merged

NikolasHaimerl closed this as completed in #144 Mar 11, 2025

github-project-automation bot moved this from 📋 planned to ✅ done in Space Meridian Mar 11, 2025

bajtos mentioned this issue Apr 8, 2025

Spark score for retrievals using any available provider CheckerNetwork/roadmap#254

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract PieceCID from ContextID #118

Extract PieceCID from ContextID #118

bajtos commented Feb 13, 2025

bajtos commented Feb 21, 2025

Uh oh!

NikolasHaimerl commented Feb 27, 2025 •

edited

Loading

Uh oh!

juliangruber commented Feb 27, 2025

Uh oh!

NikolasHaimerl commented Feb 27, 2025

Uh oh!

juliangruber commented Feb 27, 2025

Uh oh!

bajtos commented Mar 3, 2025 •

edited

Loading

Uh oh!

NikolasHaimerl commented Mar 6, 2025

Uh oh!

Extract PieceCID from ContextID #118

Extract PieceCID from ContextID #118

Comments

bajtos commented Feb 13, 2025

bajtos commented Feb 21, 2025

Uh oh!

NikolasHaimerl commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation Plan for ContextID-based PieceCID Extraction

Overview

Background

Implementation Steps

1. Create Utility Function for ContextID Parsing

2. Modify PieceCID Extraction in advertisement-walker.js

3. Testing

4. Performance Considerations

Uh oh!

juliangruber commented Feb 27, 2025

Uh oh!

NikolasHaimerl commented Feb 27, 2025

Uh oh!

juliangruber commented Feb 27, 2025

Uh oh!

bajtos commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NikolasHaimerl commented Mar 6, 2025

Uh oh!

NikolasHaimerl commented Feb 27, 2025 •

edited

Loading

bajtos commented Mar 3, 2025 •

edited

Loading