-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize trajectories integration and generation #74
Conversation
a7c6281
to
76ad504
Compare
I'll profile on Monday, though breaking these into separate files seems odd to me. For the noise generator, we already had some The |
removed redundant classes |
53d733c
to
bc9d591
Compare
I removed prefer forward from defaults. We can just clip velocity from the bottom if we need more forward / only forward moves |
f203df0
to
56a9562
Compare
It looks like this reduced the integrate state velocities CPU by about half (29% down to 15%) but didn't make much change in the generate noised controls function (~15-16%). It looks like from that, about 11% is spent on the random sampling processes + related container-y stuff from xtensor to align things / stride stepping. I'm not sure there is much we can do about that? Most of the time I can account for is spent in xtensor. So if there's anywhere we can minimize changes / strides / hooking up GPU / generally optimizing the use of xtensor, this would be the time to do it so we can push up our We now spend about 1/3 of the time on evaluating the trajectories and about 2/3 of the time generating trajectories. Of the controller time (e.g. where 100% is made up of the components within MPPIC, not other things), 22% is on generating noised controls and another 22% is on integrating state velocities. There's a number of < 8% tasks that I don't think are the big fish to fry. The critics change on a per-run basis just depending on what's going on in a particular test run, but right now none of them look to me to be a bottleneck comparatively - so there's no real need to try to minimize the number of critics at all. That's a drop in the bucket of the compute time. Anyway, this tells us our settings scale with the number of points and batches, not critics or complexity of the critics. And the number of batches and points within those batches are important things for us to push up to improve performance. With a set of parameters I am playing with (doesn't really matter what they are, just for comparison of runtime improvement):
I don't think that is the best to remove |
In pushing up numbers, I can get to about ~500 samples with 3s forward simulation time (currently 30 timesteps @ .1s increments, but still tuning and playing with those numbers wildly) at 40hz. I could do more, but that's as much as I would feel confident in at steady state, with the current performance. If we can improve things a bit, I'd love to have ~1000 which doesn't seem to need that much more optimization (10-20%). The autorally folks used 2560 batches at 2s increments (0.02 timestep, 100 of them). I don't think we need to go that level, but even with my 30 timesteps at .1s using 2560, the behavior is really nice. The differences between 500, 1000, 2560 is hard to gauge since I can't visualize the trajectories without making rviz get choppy. That's a hardware testing need. Though given the fact that this is random sampling based optimization, we need as many samples as possible so we can model our system as best as can - especially if we can't iterate multiple times per control frequency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the issues are fixed, this is good to go for me.
I started on some tuning this afternoon but couldn't finish due to the visualization issue and how many batches/frequency I could push in. I can pick it up tomorrow and propose some new parameters that I think help a number of the behaviors concerns
I got down to a tuned value of temperature around In terms of batches, minimum of 500 is necessary for smoother behavior. With the visualization stuff fixed, I'm trying to understand how much it helps / doesn't impact for 1000, 2000 to find the sweet spot. One major reason I want to squeeze a bit more performance out of this if possible to support a 1,000 batch system at 40hz on a "regular" intel CPU (raspberry pis or something will obviously need less). Anything above 50hz I think is just excessive, so we're closing in on that number and I couldn't be happier about that. If we can support 50hz @ 2000 batches with the previous settings, I can't imagine we'll ever need more. Though realistically I'd make the defaults less than that |
No description provided.