-
Notifications
You must be signed in to change notification settings - Fork 255
feat: Lazer publisher agent #2718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
At a high level, is there a reason we can't put this in the Pyth Publisher SDK? Will review this in more detail later today when I get the time. |
The sdk is a library crate and this is a binary which will have some docker/devops stuff as well. No strong feelings though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave it a preliminary review. Broadly speaking, it seems this PR is running along the idea of getting them to run it first, then have them send the new type to a new endpoint after?
Retaining the V1 and V2 endpoints maintains the confusion behind all the types. I don't think providing them a new reasonably easy type to work with is that large a lift after getting them to run the agent, which is probably harder? But yeah curious to hear what you had in mind in terms of the onboarding process.
pyth-lazer-agent/config/config.toml
Outdated
@@ -0,0 +1,5 @@ | |||
relayer_urls = ["ws://localhost:1235/v1/transaction", "ws://localhost:1335/v1/transaction"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is local tilt Lazer. What's your plan for preparing staging and prod URLs/Configs? Different config files for each stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I shouldn't have checked this in here, will provide some sort of reference config and figure out how to run this with tilt (which is what I've been doing for local testing).
authorization_token = "token1" | ||
publish_keypair_path = "/path/to/solana/id.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be CLI args or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a principle I like us to be as similar to agent wrt to config. so let's leave it here
pyth-lazer-agent/config/config.toml
Outdated
authorization_token = "token1" | ||
publish_keypair_path = "/path/to/solana/id.json" | ||
listen_address = "0.0.0.0:1234" | ||
publish_interval_duration = "50ms" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We submit once every 0.5ms. So 500 microseconds.
} | ||
|
||
impl RelayerSender { | ||
async fn send_price_update( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is sending a transaction, not a price update?
We should be correct in terms of semantics. A SignedLazerTransaction contains the LazerTransaction. A Transaction can be a PublisherUpdate. The PublisherUpdate contains a batch of FeedUpdates. Each FeedUpdate targets one Feed ID and can be one of some set of types of updates such as PriceUpdate or FundingRateUpdate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the log line was just copy-pasted from other agent/publisher code, that's my mistake. I do understand and agree with your description above and indeed the intent is to send an entire SLT each call here.
//pub const MAX_EXPIRY_TIME_US: u64 = 5_000_000; // 5s | ||
//pub const MAX_TIMESTAMP_ERROR_US: u64 = 50_000; // 5ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove? Seems some leftover comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, these are from the relayer code which does timestamp validation, but I'm not sure if we should bother with that in the agent.
_ = ping_interval.tick() => { | ||
send_ping(&mut ws_sender).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging them is not necessary as Agent will not be able to collect and send this info anywhere. Maybe you can put it behind a flag to log the ping results, but for now it's not necessary.
_ = &mut publisher_timeout => { | ||
bail!("no updates received from publisher after {:?}", PUBLISHER_TIMEOUT); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the value in d/c'ing them from Agent? These are both being ran on their end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holdover code from relayer.
if let Some(Incoming::Pong(_)) = receive_type { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the ping code if its not going to do anything.
} | ||
} | ||
|
||
pub async fn send_ping<T: AsyncRead + AsyncWrite + Unpin>(sender: &mut Sender<T>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same deal here. Don't need it if we don't use it.
continue; | ||
} | ||
} | ||
} //_ => bail!("Publisher API request set with invalid context"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's left over from the relayer code, will remove.
for relayer_sender in self.relayer_senders.iter() { | ||
if let Err(e) = relayer_sender | ||
.sender | ||
.send(signed_lazer_transaction.clone()) | ||
.await | ||
{ | ||
error!("Error sending transaction to Lazer relayer session: {e:?}"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this sync? does it wait for 1 relayer send to finish before running the next one? In that case we should fix it.
Ideally concurrent, at least async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just sending to the channel for each relayer thread to pick up, but I guess I shouldn't call await for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is sync, though the docs do mention "This method will never block the current thread.". Regardless, it'd be nice to do them concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, then I'm going to use futures::future::join_all
whenever possible for fan-out cases like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's the wrong mpsc sender anyway. We're using tokio's. That one is async. But yeah use join_all. Be careful about failure cases though. Honestly sending into an MPSC should not fail but if it does, we want to still be able to send to the rest. I guess its possible for the channel to close due to filling up and so on.
impl RelayerSessionTask { | ||
pub async fn run(&mut self) { | ||
let mut failure_count = 0; | ||
let retry_duration = Duration::from_secs(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might wanna have expo backoff. Having 1 sec down time for a random disconnect is not good.
Also make it configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pavel used the backoff crate for configuring expo retries. Perhaps you can use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see it used in a couple places in lazer, will do that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I think this is pretty good. Some small comments. After that, I think we'll be good to go.
pyth-lazer-agent/src/config.rs
Outdated
fn default_publish_interval() -> Duration { | ||
Duration::from_millis(50) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this is not the interval we want to publish at. We want to publish at 0.5ms interval. On tick, we send any new feed updates over. The interval is overall, not per feed.
If no new data has arrived, we can skip the tick, up until perhaps 25ms where we instead send all current feed data over at once.
Publisher update rate will now have to be each unique source timestamp dated update. We'll probably need to update some metrics for this.
IMO, the above makes sense as an approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd updated my own un-checked in config and not the default here 😄 .
So we'll maintain most recent update per feed and then send an entire snapshot if we hit this expiry period?
I don't quite understand the wording of the last statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm no I guess on each tick you check if something hasn't been updated/sent for 25ms, and you send whatever you have. Otherwise, just keep sending only the new data that has come in.
The logic on tick is:
- Collect all new updates
- Collect latest data of feeds not updated for 25ms
- Update all feeds whose data is being sent that they're now being sent in whatever cache we are using
- Send the data
Something like that. Basically, we make sure we always send data on 0.5ms ticks if new data is coming in, and always send data every 25ms for sure for each feed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if a feed hasn't been published for 25ms we retransmit its latest update? Makes sense, although I admit I don't see what this sort of heartbeating buys us at this step in the data flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its hard as a server to know for sure if a client is behaving appropriately. If the client consistently sends something over, we can expect it is functioning properly. The connection being open is not sufficient to know this. We also still want to trigger aggregations and such. If the data itself (source timestamp) gets old enough, it will be expired anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darunrs 0.5ms doesn't sound right. That means 2000 updates per second per relayer. 50ms is also too much, I think we should fine-tune this number. I'm thinking somewhere between 2-10 ms.
|
||
#[derive(Parser)] | ||
#[command(version)] | ||
struct Cli { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can agree with that. Ali actually mentioned wanting to keep the interface the same as Pyth Agent and that's a good point too.
for relayer_sender in self.relayer_senders.iter() { | ||
if let Err(e) = relayer_sender | ||
.sender | ||
.send(signed_lazer_transaction.clone()) | ||
.await | ||
{ | ||
error!("Error sending transaction to Lazer relayer session: {e:?}"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is sync, though the docs do mention "This method will never block the current thread.". Regardless, it'd be nice to do them concurrently.
|
||
pub struct PublisherConnectionContext { | ||
pub request_type: http_server::Request, | ||
pub _remote_addr: SocketAddr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the leading underscore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did this temporarily for clippy because we don't log or metric this anywhere yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field is initialized in the struct but is never read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. If we don't intend to use the field, and we don't have a follow up PR already planned that enables the use of that field, let's remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps at some point perhaps we log it or it goes in a metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM other than two smaller comments.
There's not really much of any tests written up here. I would put down a linear task to add unit tests and add them in where possible. And also manually test it.
} | ||
futures::future::join_all( | ||
self.relayer_senders | ||
.iter_mut() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is iter_mut() needed? send should only need a regular reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Was experimenting with AI suggestions and left this in. Removing.
let mut failure_count = 0; | ||
let retry_duration = Duration::from_secs(1); | ||
|
||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having a manual loop, use backoff's retry. Something like this:
retry(expo_backoff_config, || { match connection().await { Ok(()) => 'shutdown', Err(err) => handle_err } })
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nice.
RUN apt update && apt install -y libssl-dev && apt clean all | ||
|
||
COPY --from=builder /pyth-lazer-agent/target/release/pyth-lazer-agent /pyth-lazer-agent/ | ||
COPY --from=builder /pyth-lazer-agent/config/* /pyth-lazer-agent/config/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed what pyth-agent does, although I'm not sure the best way to manage config here. Also we'll need to add the key file.
} | ||
|
||
fn default_publish_interval() -> Duration { | ||
Duration::from_micros(500) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this too aggressive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per Darun's comments above, it sounded like half a millisecond was the target frequency. I don't have background on our requirements here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The background on this is that the frequency needs to support 1ms feeds. Granted we don't have many currently. But the update frequency needs to be low enough to provide granularity here. The relayer side batching does 0.5ms batches itself currently.
let signing_key = SigningKey::from_keypair_bytes(&publish_keypair.to_bytes()) | ||
.context("Failed to create signing key from keypair")?; | ||
|
||
let mut publish_interval = interval(self.config.publish_interval_duration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interval is pinned on its own internally (and implements Unpin).
) | ||
.await; | ||
|
||
self.pending_updates.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you release the task a line before, here you might clear some pending updates that you haven't sent yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? pending_updates only gets pushes in the other arm of the select!, I'd have thought that and batch_transaction is atomic with respect to that.
@darunrs I've done ad hoc integration testing using example publisher -> lazer agent -> relayer in tilt, but yes, it's time for me to get into the habit of |
…ike/lazer-agent-1
Summary
Based on the Lazer relayer, this binary receives incoming update streams from a publisher and sends signed transactions to the Lazer relayer itself.
Rationale
Managing transaction signatures on the publisher side.
How has this been tested?