Skip to content

Add the responsible program's account index and inner instruction index to each InstructionError #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

steveluscher
Copy link
Contributor

@steveluscher steveluscher commented Mar 6, 2025

Problem

Consider a transaction error that originates from a cross-program invocation (ie. an inner instruction). Currently TransactionError returns you the index of the outer instruction, which in no way helps you to correlate the content of the error message with the actual program from whence it came.

Summary of changes

Added the transaction-level account index of the erroring program to the TransactionError::InstructionError variant, which will help consumers correlate instruction errors with their actual source.

The account index of the program is designed to be from the perspective of the transaction and not the perspective of the instruction to make this change safe from SIMD-163.

Addresses anza-xyz/agave#5152 and anza-xyz/kit#149.
Blocks anza-xyz/agave#6083.

@steveluscher steveluscher requested a review from a team as a code owner March 6, 2025 21:44
@steveluscher
Copy link
Contributor Author

steveluscher commented Mar 6, 2025

This is a bit of a spitball pr. Things I'm thinking about:

  1. This will require a major version bump, because Rust. It will probably require a major version bump in a lot of downstream packages too. Perhaps all software ever written.
  2. I imagine that this will increase the storage cost of transactions by the 32 bytes of the pubkey. Maybe this is fine, but perhaps there is an alternative, like storing the index of the program in the accounts array and then reconstructing the program address later (ie. in the RPC method that loads and vends the InstructionError)?
  3. We'll have to teach the program runtime to include the program address when it throws InstructionErrors.

@steveluscher steveluscher force-pushed the add_program_address_to_instruction_error branch 4 times, most recently from 270d3c8 to d7fbd88 Compare March 6, 2025 22:42
/// third element indicates the address of the program that raised the error, if applicable; the
/// error could after all have been raised during a cross-program invocation (ie. in an inner
/// instruction).
InstructionError(u8, InstructionError, Option<Pubkey>),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't have a program address in cases where the error is InstructionError::UnsupportedProgramId, so this is made an Option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate what cases UnsupportedProgramId is for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for sure. It's thrown, for instance, here:

https://github.com/anza-xyz/agave/blob/1bcf743f445270407bccf55681f74e06ef8b9a48/program-runtime/src/invoke_context.rs#L250-L252

ie. if you tried to issue an instruction without providing the address of a program.

The larger point here is that there exist InstructionErrors that are more like ‘there was a structural problem with this instruction’ rather than ‘a program died and here's why.’ The latter has a program address associated with it while the former may not.

@kevinheavey
Copy link
Contributor

btw this is a breaking change, but we're already due a bump to 3.0 because of #46

/// third element indicates the address of the program that raised the error, if applicable; the
/// error could after all have been raised during a cross-program invocation (ie. in an inner
/// instruction).
InstructionError(u8, InstructionError, Option<Pubkey>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding Option<Pubkey> here means TransactionError is now 64 bytes instead of 32 bytes. I'm not sure how often we move/clone these on the agave side, but this could have some small perf implications. I'm aware of at least a few places we clone these values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ this is not a comment on why we shouldn't do this - just a note that we should be aware of this size change, and be more aware of where we're cloning these in our processing pipeline so we can avoid performance hits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels bad. I should probably endeavour to store just the index of the program account in the accounts array, and then make the downstream thing that really wants to know what the program was look it up.

I'll rifle through the Agave code and see if that's tractable (ie. the instruction will need to be available wherever something needs know what program address was involved).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR! This is now implemented as a u8 that points to the transaction-level account index of the program responsible for the error.

@steveluscher
Copy link
Contributor Author

So, here's what I'm thinking. There are three ways to approach this.

Option 1 – Add program address to TransactionError::InstructionError

- TransactionError::InstructionError(u8, InstructionError)
+ TransactionError::InstructionError(u8, InstructionError, Option<Pubkey>)
  • the address (or address index) will have to be Option, because not all InstructionErrors are related to a program
  • breaks people who construct TransactionError::InstructionError (ie. the SVM)
  • does not break people who construct InstructionError

Option 2 – Add program address to every InstructionError that is related to a program

  ArithmeticOverflow,  // Leave as is; not related to a program
- InstructionError::CallDepth,
+ InstructionError::CallDepth(Pubkey)
  • does not break people who construct TransactionError::InstructionError (ie. the SVM)
  • breaks people who construct InstructionError

Note

There still might exist some InstructionErrors that sometimes have programs related to them and other times do not. These would still, unfortunately for complexity, be Option<Pubkey>

Option 3 – Create a TransactionError::InstructionErrorCausedByProgram type

  InstructionError(u8, InstructionError),
+
+ InstructionErrorCausedByProgram(u8, InstructionError, Pubkey)
  • no need for making the address (or address index) Option; just return the relevant TransactionError::InstructionError
  • would need to make a separate enum to represent the union of both, or use Either (ie. functions would have to have return types like Result<(), Either<InstructionError, InstructionErrorCausedByProgram>)

@joncinque
Copy link
Collaborator

So good news, I went through all of the InstructionError variants, and they pretty much all pertain to a certain program, or preparing the execution of a certain program. [1]

To go with that, the TransactionError::InstructionError variant uses a u8 for the index of the top-level instruction in the transaction, which means that there's already some notion of "this error requires some greater context to truly understand".

I'm down to go with the option 1, but using a u8 for the index of the account in the transaction, as mentioned earlier. Looking at the storage protos for transaction errors, I think we should just be able to add a new field for the originating account: https://github.com/anza-xyz/agave/blob/bc09ffa335d9773fd6c4b354e61c44b8fc36724a/storage-proto/proto/transaction_by_addr.proto#L21

It'll be a bit of a slog to get through all the changes, but it should be mostly straightforward. Happy to hear other ideas though!

[1] if I'm incorrect, then we should fix the usage of that particular error variant rather than torpedo the design

@steveluscher
Copy link
Contributor Author

steveluscher commented Mar 18, 2025

…pretty much all pertain to a certain program, or preparing the execution of a certain program

I started then scrapped a PR that did option 1 and I seem to remember it being really hard to obtain the program address in places. Here's cargo check after adding a u8 to TransactionError::InstructionError (gist).

I can try again, but a few on spec:

  • UnsupportedProgramId doesn't, by its very nature. (link)
  • Do MissingAccount and NotEnoughAccountKeys ever happen before we know what the program address is? (link, link)

The big problem is here, because it takes in a dynamic error after long having forgotten what program is responsible for the error: (link)

So basically you have to follow process_precompile and process_instruction all the way down, and make sure that they throw the responsible program address up through the Result, which means now you're not throwing an InstructionError there, you're throwing either an InstructionErrorWithResponsibleProgram or a (InstructionError, u8). I found this to get really hairy. (link)

@joncinque
Copy link
Collaborator

UnsupportedProgramId doesn't, by its very nature. (link)

That one looks at the owner of the first program account in the transaction context, so we could probably use that, right? https://github.com/anza-xyz/agave/blob/ccbf3f25f332cf4bfa4f1a9bf27db8d3333b3064/program-runtime/src/invoke_context.rs#L523

Do MissingAccount and NotEnoughAccountKeys ever happen before we know what the program address is? (link, link)

If they do, then that's an issue with their usage, which should be fixable.

The big problem is here, because it takes in a dynamic error after long having forgotten what program is responsible for the error: (link)

That plumbing does look a bit involved, but should end up similar to any big refactor.

@joncinque joncinque added the breaking PR contains breaking changes label Mar 31, 2025
@buffalojoec
Copy link
Contributor

buffalojoec commented Apr 7, 2025

@joncinque @steveluscher Hey guys, sorry to come in here late!

Another major issue I see with obtaining the program ID for an inner instruction from the transaction accounts is that CPI callees will eventually no longer be in that list at all.
https://github.com/solana-foundation/solana-improvement-documents/blob/main/proposals/0163-lift-cpi-caller-restriction.md

If a program decides to hard-code TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA and CPI to it, the RPC isn't going to know about it without parsing either the transaction logs or the inner instructions payload.

Rather than plumbing callee IDs all the way from SVM, what about just enabling inner instructions recording on Bank all of the time, and walking that payload to grab the program ID? The array of inner instructions comes back empty when there's an error, but we can just change that behavior.

@steveluscher
Copy link
Contributor Author

steveluscher commented Apr 29, 2025

Rather than plumbing callee IDs all the way from SVM, what about just enabling inner instructions recording on Bank all of the time, and walking that payload to grab the program ID?

Oh, I like this. Also, we wouldn't have to enable CPI logging all the time; I think we could lazily walk the TransactionContext to figure out what the last instruction in the trace was before it died. I think all that's involved is to see where the logs ended, to know in what program the error originated. I'll give this a shot.

Something like

let status = status.map(|_| ()).map_err(|e| {
    // Walk the `TransactionContext` to figure out which program was responsible for `e`.
});

Comment on lines 46 to 48
/// the index of the outer instruction in which the error occurred, and the third the account
/// index of the program responsible for the error (ie. the error may have originated from an
/// inner instruction). The account index of the responsible program may be `None` for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this approach actually clears up much ambiguity. There could be multiple inner instructions that invoke the same program. In order to pinpoint the failing inner instruction you would still need to rely on the inner ix metadata that we already provide.

Copy link
Contributor Author

@steveluscher steveluscher May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Though the original goal was to be able to correlate what program Custom(###) pertains to so that you can properly decode and handle ###, we could also add the path to the actual inner instruction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't really follow your response. Is the "original goal" still the current goal? When you say "we could also add the path to the actual inner instruction" is that rhetorical? If this is a "good catch," does this warrant any change with your current approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to say that you're just trying to map a custom program error code to the actual program that returned it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're just mapping custom error codes, do we need to add the tx account index for the failing program to all instruction error variants?

Copy link
Contributor Author

@steveluscher steveluscher May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're just mapping custom error codes…

We're not. That was the original impetus for this change, but the exploration led to the insight that most InstructionErrors would benefit from knowing what program caused them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a "good catch," does this warrant any change with your current approach?

Yep! It warrants me adding the inner instruction index so that you can – for instance – tell that a InsufficientFundsForRent came from instruction 1.4 as opposed to 1.2, even though 1.2 and 1.4 have the same program address.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the index of the inner instruction here in the type, and here in the implementation, @jstarry!

@steveluscher steveluscher force-pushed the add_program_address_to_instruction_error branch from 821245c to d2fd373 Compare May 5, 2025 22:31
@joncinque joncinque self-requested a review May 6, 2025 20:11
@jstarry
Copy link
Contributor

jstarry commented May 7, 2025

I think I'm a bit more up to speed on the evolution of the problem we're trying to solve here. I think it's a very good idea to add an indication that a given instruction error was actually caused by an inner instruction if that was the case. I'm not super excited about making a breaking change to the InstructionError variant to solve that problem.

Also, I think we're now trying to solve more than just the problem of knowing whether an error came from an inner instruction. We're trying to know which inner program produced the error and also where in the call tree the error was produced (partly my fault we went down that path due to this thread: #74 (comment))

I think that it's ok to force users to debug errors by first fetching tx metadata for now. They need metadata to resolve ALT's, to have full context of the call tree leading to the failure, as well as any pertinent logs. I think later we could separately add a better way to get rich error context from the SVM which includes the full call stack in errors so that local development doesn't require metadata fetching.

For differentiating top level vs inner instruction errors, I think I'm most in favor of option 3 where we add a new transaction error variant. As discussed in the PR, we can just use the program account index instead of including the full pubkey, so something like:

enum TransactionError {
  ..
  InnerInstructionError(InstructionError),
}

And then have the user figure out where in the call tree this happened from the tx metadata. I acknowledge this might be too minimalist, adding top level tx index and the account index of the program invoked in the inner instruction could be nice too.

It might also be worth exploring whether adding a new InstructionError variant would be less disruptive to users. We wouldn't want crazy recursive nesting but something like this could work:

enum InstructionError {
  ..
  InnerInstructionFailure(InnerInstructionError)
}

enum InnerInstructionError {
  // subset of InstructionError variants?
}

/// (ie. the error may have originated from an inner instruction). The inner instruction index
/// may be `None` if the error originated from the top-level program call. The account index of
/// the responsible program may be `None` for transactions created before it was introduced.
InstructionError(u8, InstructionError, Option<u8>, Option<u8>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to make a backwards incompatible change, might as well start naming these fields. Pretty easy to misuse this tuple as proposed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I had to look back at this PR a few times while reviewing the other one to remember which field was which

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not end up doing this, because we'd have to write a custom serde serializer to serialize the new struct variant with named fields in the old format the RPC requires.

What I did do, however, is to make type aliases, which gives you a bit of guidance when you're constructing one of these things.

image

Copy link
Contributor

@buffalojoec buffalojoec May 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not end up doing this, because we'd have to write a custom serde serializer to serialize the new struct variant with named fields in the old format the RPC requires.

this shouldn't be too bad to do, just a bit verbose.

/// An error occurred while processing an instruction. The first element of the tuple indicates
/// the index of the outer instruction in which the error occurred, the third the index of the
/// inner instruction if the program responsible for the error was called by cross-program
/// invocation (CPI), and the fourth the account index of the program responsible for the error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The program invoked by the inner instruction might be loaded from an ALT so downstream users may still need to fetch metadata to get the offending program id fyi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. This really sucks. I'll have to think about this for a hot second. Ideally we'd just store the program address, but that would increase the size of stored errors quite a bit, which we should try to avoid.

Copy link
Contributor

@buffalojoec buffalojoec May 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were just dealing in indices all over the place? Why do we need the ALT if we know the index? Are you guys talking about the eventual end-user on the client side having to resolve a program ID from the index?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah when I said downstream users I meant the the client side end user experience

Copy link
Collaborator

@joncinque joncinque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of comments, looks good otherwise

/// (ie. the error may have originated from an inner instruction). The inner instruction index
/// may be `None` if the error originated from the top-level program call. The account index of
/// the responsible program may be `None` for transactions created before it was introduced.
InstructionError(u8, InstructionError, Option<u8>, Option<u8>),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I had to look back at this PR a few times while reviewing the other one to remember which field was which

Comment on lines 177 to 182
// NOTE: We intentionally do not augment the error message in the event that the error
// carries the index of the inner instruction or the account index of the responsible
// program. While it would add value to the log, to do so now would also break any log
// parser that presumed the log format to be immutable for all time (eg.
// https://tinyurl.com/3uuczr68).
=> write!(f, "Error processing Instruction {index}: {err}"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine for now, people can also use the Debug formatting to get every piece of data.

We could consider creating a new enum variant, but then the UMI parser would be totally useless, which is even worse in my opinion.

@steveluscher steveluscher force-pushed the add_program_address_to_instruction_error branch from d2fd373 to 4e4a04a Compare May 16, 2025 05:44
@steveluscher steveluscher changed the title Add the affected program address to each InstructionError Add the responsible program's account index and inner instruction index to each InstructionError May 16, 2025
@steveluscher
Copy link
Contributor Author

Thanks for the detailed thoughts, @jstarry.

I think that it's ok to force users to debug errors by first fetching tx metadata for now.

tx metadata is, unfortunately, insufficient.

Consider these three transactions.

  • -> Program A
    • -> Program B
      • -> Program C
      • <- Program C (ERROR)
    • <- Program B
  • <- Program A

  • -> Program A
    • -> Program B
      • -> Program C
      • <- Program C
    • <- Program B (ERROR)
  • <- Program A

  • -> Program A
    • -> Program B
      • -> Program C
      • <- Program C
    • <- Program B
  • <- Program A (ERROR)

The inner instruction trace will look the same in all of these cases, making it impossible to discern where the error actually came from.

They need metadata to resolve ALT's

Guh, this sucks. In some cases you'll have the ALT information locally (eg. you used @solana/kit to create the transaction, and the error happened in simulation) but not in all cases for sure. I'll have to consult with @steviez to see if we can stand another 32 bytes of context data in stored errors or not, because the ideal thing would be to just bake the address itself into the error, rather than the index of the program account.

I think later we could separately add a better way to get rich error context from the SVM…

This is underway in https://github.com/anza-xyz/agave/pull/6083/commits

@jstarry
Copy link
Contributor

jstarry commented May 16, 2025

The inner instruction trace will look the same in all of these cases, making it impossible to discern where the error actually came from.

Ah yeah thanks for pointing that out, definitely insufficient to just look at metadata as it is right now, you would need to know the stack height of the last instruction which was running. You could probably parse the logs for this but that's not a great solution.

I guess my suggestion with going with option 3 that you listed earlier would be contingent on adding the stack height to the new error variant (something like InnerInstructionError { stack_height: u8, err: InstructionError}). If we had that, then we could actually be ok with having users fetch metadata for now.

Comment on lines 57 to 62
InstructionError(
OuterInstructionIndex,
InstructionError,
Option<ResponsibleProgramAccountIndex>,
Option<InnerInstructionIndex>,
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aliasing helps read this file, but anywhere you access these values in the runtime it's still going to be a tuple-index access, which means maintainers would have to go back to this file to double-check which index is what.

Personally I would rather see a struct here.

Comment on lines 187 to 191
// NOTE: We intentionally do not augment the error message in the event that the error
// carries the account index of the responsible program. While it would add value to
// the log, but to do so at this point would also break any log parser that presumes a
// stable log format (eg. https://tinyurl.com/3uuczr68).
=> write!(f, "Error processing Instruction {idx}: {err}"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I'm pretty sure if we did change this, it would break consensus.

@steveluscher steveluscher force-pushed the add_program_address_to_instruction_error branch from 4e4a04a to ec46e8b Compare May 28, 2025 20:22
@steveluscher steveluscher marked this pull request as draft May 28, 2025 20:22
@steveluscher steveluscher force-pushed the add_program_address_to_instruction_error branch from ec46e8b to 6a11e9b Compare May 29, 2025 05:52
…tructionError` and the inner instruction index that points to its location in the outer instruction

This will help app developers correlate an error apparent to the program from which it originated in cases where the instruction index alone is insufficient to do so (eg. when the program that caused the error is in an inner instruction / CPI)

Addresses: anza-xyz/agave#5152
@steveluscher steveluscher force-pushed the add_program_address_to_instruction_error branch from 6a11e9b to 7040146 Compare May 29, 2025 05:55
@steveluscher
Copy link
Contributor Author

OK, this puppy is ready to land, and is the prerequisite/companion to anza-xyz/agave#6083.

@steveluscher steveluscher marked this pull request as ready for review May 29, 2025 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking PR contains breaking changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants