- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 393
Start blame from cache #1852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Start blame from cache #1852
Conversation
0173783    to
    68abfa1      
    Compare
  
    | Thanks a lot for contributing! It's great to see what can be done with a cache and I'd love to see this go further. What if there was enough tooling around it so that  This PR should of course remain minimal, it would just be interesting to make these changes with user-value in mind. @cruessler has been working towards a first integration into  Speaking of, all I did was read the PR text and make CI pass, and I wonder if @cruessler would like to take a first closer look? Thanks everyone 🙏 | 
| I’ll be happy to have a look! (Just FYI, I’ll be travelling for a week starting next Tuesday, so it might take some time until I finally get to it.) | 
68abfa1    to
    904285c      
    Compare
  
    | @cruessler Is this something you'd like to see merged, or need changes to have it merged? Thanks again. | 
| 
 I’ll respond in depth tomorrow! | 
| @holodorum @jtwaleson First: I absolutely love the idea of speeding up  Recently, I’ve added  This is all very high-level and I’m potentially missing important counter-arguments, but before diving deeper into an actual review, I wanted to make sure we’re choosing the best approach possible. | 
| Thanks a lot for your review, @cruessler, it's much appreciated!  I see this as real opportunity, and think that together with @holodorum we have yet another valid perspective that can help to make this implementation an incredibly versatile one. Think about the recent  Let's hope we can figure this out, I'd really want to avoid someone having to fork the codebase, and rather want to keep it all here and in the open to maximise the benefit of what seems to be a growing community of users. | 
| @cruessler , fyi, I've hired @holodorum to build the caching feature for my company, but now we have to rely on his voluntary time to chime in. I myself have not looked into the gitoxide changes that closely and am not an expert in the code. However, I understand the problem fundamentally as follows: a git blame is a File (with a Path) for a specific Commit, with for every line in the File, a pointer to the commit where this line originated. To store it a bit more efficiently, we can use a  Now, I would like to do three things with this: 
 That being said, I don't know if the current implementation that works on  | 
| Thanks for chiming in @jtwaleson. My response is probably only partial and maybe not very satisfying, but here we go. Right - previously I thought of  This would turn  That's how I can imagine it to be incremental, and inherently interruptible while naturally supporting caching without having to know the cache. There is probably no reason to not eventually have an own cache implementation, it's just that the place to store it and the exact format would probably be application defined (this is a plumbing crate after all). If nothing else then at least food for thought, and I am confident that we will eventually get there. @cruessler I wonder if this PR can be merged and refactored into its final shape, or if it should be closed instead, while making use of it as reference when considering to make the current implementation more incremental (eventually). (I am saying all of this without having looked at the implementation in detail) | 
| Thanks everyone for the input, and @cruessler — hope you had a good trip! 😉 Even though I was hired by @jtwaleson, I’d be more than happy to volunteer some time to help integrate this properly into gitoxide. Currently,  @cruessler — if I understand you correctly, you're proposing to pull that cache-handling logic out of  Regarding  @Byron — is this roughly what you meant by “incremental” and “cache-friendly without the cache being built in”? Let me know if that matches your thinking, or if I’m misunderstanding something. If you decide what the best way forward is, I’ll try to implement it ASAP. | 
| It's great to have you here, @holodorum! Thanks to your summary I also realise that I might have added salt to the cake by sharing an idea that might be misaligned with what @cruessler was suggesting. | 
| Sorry for the long delay! There’s quite a few things I’m working on simultaneously that it becomes harder to do each of them justice. :-) My suggestion/idea was related to this comment #1848 (comment) that mentioned creating full blames for certain “checkpoints” and only having to do a blame until it gets to one of the checkpoints. In the example above, let’s say there are three checkpoints,  For a commit between  What I was proposing was a separate function that would call  I hope this makes more sense now! (Also, we can hop on a short call if it still doen’t. 😄) And definitely let me know when there are major downsides to this approach or constraints it doesn’t meet. | 
| It appears this PR is degenerating and I wonder if @cruessler would be able to make the changes that make it mergeable? | 
| @cruessler and me had a good call about this PR last week and it seems clear what needs to happen. First step is to add a feature that allows for the blaming of multiple ranges, similar to git's functionality e.g. 
 This should have minimal impact on the existing API and still keep the existing functionality or easily improve it over time. Hope this sounds good to you too @Byron. I've started with it and hope to finish it later this week. | 
| Thanks so much for the update, I am excited to see this PR merged :)! | 
This commit includes a major overhaul of the blame operation logic. A new `BlameState` struct and a `BlameProcessor` struct were introduced to cleanly encapsulate different parts of the operation. The `BlameState` struct collects information about the blame operation as it is happening, while the `BlameProcessor` struct carries the bulk of the blame operation logic.
In this commit, we introduce the ability to execute a blame operation from a previously generated checkpoint. This functionality makes the computation of incremental blame operations more efficient by reusing already computed information. The refactoring performed in commit id d22965f allowed effortless integration of this feature into the existing blame processing algorithm. The fundamental distinction between a regular blame and a blame operation from a checkpoint is that in the latter, we can pre-fill the `BlameState` with some `BlameEntry` instances, thereby reducing the number of `UnblamedHunk` instances we need to process. This update’s algorithm for incorporating the detected modifications since the checkpoint into the blame entries is encapsulated in the `update_checkpoint_blames_with_changes` method. The newly introduced `BlameCheckpoint` type is a public type that users can utilize to establish a checkpoint state for their blame operations.
904285c    to
    cba93ab      
    Compare
  
    | Finally got around to implement it in the way that @cruessler and me discussed. Without the refactoring there would be quite some duplication in the code and actual operations performed. Since we would first do the work to calculate which blames can be updated and which ranges have to be passed to the  Another discussion point might be what we want to call this feature. We are continuing work from a previous point in time, so that's why I thought checkpoint might be appropriate. At the same time, the  Curious to hear your thoughts! Edit: I did make some tests that Pass, but couldn't fully figure out how the tests are structured and how to maximally leverage the existing structures. | 
| Thanks a lot for making this happen! I will leave the first review to @cruessler entirely, but can roughly say that the changes to  | 
As discussed in #1848, @jtwaleson and I have been working on a way to speed up git blame by introducing a caching mechanism. This allows us to start a blame operation from a checkpoint instead of computing it from scratch, significantly reducing computation time.
Proposed Changes
The function
function::filenow accepts aBlameCacheObject, which stores:Commit ID at which the blame was previously computed.
Blame entries corresponding to that commit.
Using the cached data, we compute the differences between the cached blob and the new target blob at the suspect commit.
If the file has been rewritten, this will probably error, so the BlameCacheObject might need to store the file path as well.
Cached blame entries are updated based on detected changes.
Only
UnblamedHunks(caused byAddedOrReplacechanges) are recomputed using the standard blame algorithm.Previously, the entire file or a range was marked as
UnblamedHunk, but now this only happens when necessary.So far the results show significant speed-ups. These are results for the README file in the linux repo starting with a blame at commit
bf4401f3ec700e1a7376a4cbf05ef40c7ffce064.Next Step's
BlameCacheObjectCurious to hear what you think!