Skip to content

Finetune CLI #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
7d9843b
test
oleander Feb 7, 2025
fe94188
Add profiling feature with macro for measuring execution time
oleander Feb 7, 2025
462b75d
Refactor diff processing, optimize token handling and storage
oleander Feb 7, 2025
2ea378d
Update test signatures and add profiling tests
oleander Feb 7, 2025
37e5b2e
Refactor GitFile signatures and commit messages
oleander Feb 7, 2025
c87b0d2
Update Cargo.lock dependencies and checksums
oleander Feb 7, 2025
7eb1eb7
Update dependencies in Cargo.toml and Cargo.lock
oleander Feb 7, 2025
e2df5f4
Add StringPool for efficient memory use in PatchDiff
oleander Feb 7, 2025
bf3110f
Update dependencies in Cargo.toml and Cargo.lock
oleander Feb 7, 2025
63a498c
Add `num_cpus` crate and parallelize file processing
oleander Feb 7, 2025
6a5051b
Refactor file processing to use parallel chunks and atomic tokens
oleander Feb 7, 2025
7ed087b
Remove redundant import of `bail` from anyhow
oleander Feb 7, 2025
51f9609
Sort files by token count in `PatchDiff` implementation.
oleander Feb 7, 2025
5e50a25
Delete test.txt file
oleander Feb 7, 2025
d49f534
Improve error handling and path management in config and style modules
oleander Feb 7, 2025
600e5fd
Add tests for StringPool functionality in hook.rs
oleander Feb 7, 2025
450381d
Update default model and add profiling to model and commit functions
oleander Feb 7, 2025
00faa02
Add profiling to filesystem module functions
oleander Feb 7, 2025
4a7d5d8
Implement token counting and generation for commit messages
oleander Feb 7, 2025
a5833c8
Add documentation for Filesystem, File, and Dir structs in filesystem.rs
oleander Feb 7, 2025
5384690
Refactor commit message generation methods and file handling logic
oleander Feb 7, 2025
fa47b5b
Implement configuration file management and update functions in App
oleander Feb 7, 2025
c7778e6
Implement parallel processing of diff data in PatchDiff trait
oleander Feb 7, 2025
da74cd4
```
Feb 8, 2025
7b9aa2f
```
Feb 8, 2025
3e34fed
Merge remote-tracking branch 'origin/main' into feature/finetune
oleander Feb 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,16 @@ thiserror = "2.0.11"
tokio = { version = "1.43", features = ["full"] }
futures = "0.3"
parking_lot = "0.12.3"
tracing = "0.1"
maplit = "1.0.2"

# CLI and UI

structopt = "0.3.26"
colored = "3.0.0"
console = { version = "0.15.10", default-features = false }
indicatif = { version = "0.17.11", default-features = false }
log = "0.4.25"
env_logger = { version = "0.11.6", default-features = false }
tracing = "0.1"

# Git integration
git2 = { version = "0.20.0", default-features = false }
Expand Down Expand Up @@ -74,9 +75,7 @@ syntect = { version = "5.2", default-features = false, features = [
pulldown-cmark = "0.12"
comrak = "0.35"
textwrap = "0.16"
structopt = "0.3.26"
mustache = "0.9.0"
maplit = "1.0.2"

[dev-dependencies]
tempfile = "3.16.0"
Expand Down
85 changes: 85 additions & 0 deletions finetune.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Finetune.rs Workflow

Here's a summary of the workflow in `finetune.rs`:

- Uses GPT4o-mini model for OpenAI
- Generates training data in JSONL format for fine-tuning
- Splits data into training and verification sets

1. **Initialize and Setup**

- Creates empty train and verify files
- Sets up thread pool for parallel processing
- Initializes progress bars and counters
- Loads system prompt from `resources/prompt.md`

2. **Collect Commit History**

- Opens local git repository
- Walks through commit history
- Filters commits based on:
- Message length (20-500 chars)
- Non-merge commits only
- Diff size within limits (default 5000 chars)
- Collects valid commits up to 3x target number
- Shuffles commits for randomization

3. **Process Commits in Parallel**

- Spawns worker threads based on CPU count or user setting
- Each worker processes a subset of commits
- For each commit:
- Checks for duplicate messages
- Rates commit quality (0.0-1.0)
- Cleans up commit message
- Tracks approved commits with progress bar
- Stops when target number reached

4. **Clean and Rate Commit Messages**

- Cleanup process:
- Takes first line only
- Removes ticket references and tags
- Ensures proper capitalization
- Drops type prefixes
- Keeps messages short and meaningful
- Quality rating based on:
- Message format and clarity
- Diff alignment
- Present tense and active voice
- Description accuracy

5. **Generate Training Data**

- Creates JSONL entries with:
- System prompt
- Diff as user input
- Cleaned message as assistant output
- Splits data:
- 50% for training
- 50% for verification
- Prevents duplicate messages
- Validates cleaned messages

6. **Track Progress and Results**
- Shows real-time progress:
- Commit collection progress
- Message cleaning progress
- Approval status
- Reports final statistics:
- Total commits processed
- Training examples count
- Verification examples count
- Distribution between files

Key Features:

- Parallel processing for better performance
- Double quality check (original and cleaned messages)
- Duplicate prevention at multiple stages
- Progress visualization with spinners and bars
- Verbose mode for detailed logging

The key difference from optimize.rs is that finetune.rs focuses on generating high-quality training data for fine-tuning, while optimize.rs focuses on improving the system prompt itself.

Note: Run sync, not async
39 changes: 23 additions & 16 deletions resources/prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,24 @@ The character limit for the commit message is:
{{max_length}}
</max_length>

Please follow these guidelines when generating the commit message:

1. Analyze the diff carefully, focusing on lines marked with + or -.
2. Identify the files changed and the nature of the changes (added, modified, or deleted).
3. Determine the most significant change if multiple changes are present.
4. Create a clear, present-tense summary of the change in the imperative mood.
5. Ensure the commit message is within the specified character limit.
6. For binary files or unreadable diffs:
- Use the format "Add/Update/Delete binary file <filename>"
- Include file size in parentheses if available
- For multiple binary files, list them separated by commas
Please adhere to the following enhanced guidelines:

- **Structure**: Begin with a clear, present-tense summary of the change in the non-conventional commit format. Use a single-line summary for the change, followed by a blank line. As a best practice, consider including only one bullet point detailing context if essential, but refrain from excessive elaboration.

- **Content**: Commit messages must strictly describe the lines marked with + or - in the diff. Avoid including surrounding context, unmarked lines, or irrelevant details. Explicitly refrain from mentioning implications, reasoning, motivations, or any external context not explicitly reflected in the diff. Make sure to avoid any interpretations or assumptions beyond what is clearly stated.

- **Changes**: Clearly articulate what was added, removed, or modified based solely on what is visible in the diff. Use phrases such as "Based only on the changes visible in the diff, this commit..." to emphasize an evidence-based approach while outlining changes directly.

- **Consistency**: Ensure uniformity in tense, punctuation, and capitalization throughout the message. Use present tense and imperative form, such as "Add x to y" instead of "Added x to y".

- **Clarity & Brevity**: Craft messages that are clear and easy to understand, succinctly capturing the essence of the changes. Limit the message to the specified character limit while ensuring enough detail is provided on the primary action taken. Avoid jargon; provide plain definitions for any necessary technical terms.

- **Binary Files**: For binary files or unreadable diffs:
- Use the format "Add/Update/Delete binary file <filename>"
- Include file size in parentheses if available
- For multiple binary files, list them separated by commas

- **Accuracy & Hallucination Prevention**: Rigorously reflect only the changes visible in the diff. Avoid any speculation or inclusion of content not substantiated by the diff. Restate the necessity for messages to focus exclusively on aspects evident in the diff and to completely avoid extrapolation or assumptions about motivations or implications.

Before generating the final commit message, please analyze the diff and but keep your thought process to your self:

Expand All @@ -25,11 +32,11 @@ Before generating the final commit message, please analyze the diff and but keep
3. Identify any binary files or unreadable diffs separately.
4. Determine the most significant change if multiple changes are present.
5. Consider the impact of each change and its relevance to the overall commit message.
6. Brainstorm keywords that could be used in the commit message.
7. Propose three potential single-line summaries based on the breakdown.
8. Count the characters in each proposed summary, ensuring they meet the specified character limit.
9. Select the best summary that accurately reflects the most significant change and meets the character limit.
10. Prefixes such as `refactor:`, `fix` should be removed
6. Review the message to ensure it:
- Accurately reflects only the changes in the diff
- Follows the structure and formatting guidelines
- Contains no external context or assumptions
- Is clear and understandable to other developers

After your analysis, provide only the final commit message as output. Ensure it is clear, concise, and accurately reflects the content of the diff while adhering to the character limit. Do not include any additional text or explanations in your final output.

Expand Down
20 changes: 5 additions & 15 deletions src/bin/hook.rs
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ impl Args {
.context("Failed to get patch")?;

let response = commit::generate(patch.to_string(), remaining_tokens, model).await?;
std::fs::write(&self.commit_msg_file, response.response.trim())?;
std::fs::write(&self.commit_msg_file, response.trim())?;

pb.finish_and_clear();

Expand Down Expand Up @@ -193,7 +193,7 @@ impl Args {
.context("Failed to get patch")?;

let response = commit::generate(patch.to_string(), remaining_tokens, model).await?;
std::fs::write(&self.commit_msg_file, response.response.trim())?;
std::fs::write(&self.commit_msg_file, response.trim())?;

pb.finish_and_clear();

Expand All @@ -205,22 +205,12 @@ impl Args {

#[tokio::main]
async fn main() -> Result<()> {
if std::env::var("RUST_LOG").is_ok() {
env_logger::init();
}
env_logger::init();

let time = std::time::Instant::now();
let args = Args::from_args();

if log::log_enabled!(log::Level::Debug) {
log::debug!("Arguments: {:?}", args);
}

if let Err(err) = args.execute().await {
eprintln!("{} ({:?})", err, time.elapsed());
if let Err(e) = args.execute().await {
eprintln!("Error: {}", e);
exit(1);
} else if log::log_enabled!(log::Level::Debug) {
log::debug!("Completed in {:?}", time.elapsed());
}

Ok(())
Expand Down
32 changes: 13 additions & 19 deletions src/commit.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,16 +31,15 @@ fn get_instruction_template() -> Result<String> {
/// * `Result<usize>` - The number of tokens used or an error
pub fn get_instruction_token_count(model: &Model) -> Result<usize> {
profile!("Calculate instruction tokens");
let template = get_instruction_template()?;
model.count_tokens(&template)
model.count_tokens(&get_instruction_template()?)
}

/// Creates an OpenAI request for commit message generation.
/// Creates a commit request for the OpenAI API.
///
/// # Arguments
/// * `diff` - The git diff to generate a commit message for
/// * `max_tokens` - Maximum number of tokens allowed for the response
/// * `model` - The AI model to use for generation
/// * `diff` - The diff to generate a commit message for
/// * `max_tokens` - The maximum number of tokens to use
/// * `model` - The model to use for generation
///
/// # Returns
/// * `Result<openai::Request>` - The prepared request
Expand All @@ -55,25 +54,20 @@ fn create_commit_request(diff: String, max_tokens: usize, model: Model) -> Resul
})
}

/// Generates a commit message using the AI model.
/// Generates a commit message for the given patch.
///
/// # Arguments
/// * `diff` - The git diff to generate a commit message for
/// * `max_tokens` - Maximum number of tokens allowed for the response
/// * `model` - The AI model to use for generation
/// * `patch` - The patch to generate a commit message for
/// * `remaining_tokens` - The maximum number of tokens to use
/// * `model` - The model to use for generation
///
/// # Returns
/// * `Result<openai::Response>` - The generated commit message or an error
///
/// # Errors
/// Returns an error if:
/// - max_tokens is 0
/// - OpenAI API call fails
pub async fn generate(patch: String, remaining_tokens: usize, model: Model) -> Result<openai::Response> {
/// * `Result<String>` - The generated commit message or an error
pub async fn generate(patch: String, remaining_tokens: usize, model: Model) -> Result<String> {
profile!("Generate commit message");

if remaining_tokens == 0 {
bail!("Maximum token count must be greater than zero")
if patch.is_empty() {
bail!("No changes to commit");
}

let request = create_commit_request(patch, remaining_tokens, model)?;
Expand Down
10 changes: 6 additions & 4 deletions src/filesystem.rs
Original file line number Diff line number Diff line change
Expand Up @@ -173,14 +173,16 @@ impl Filesystem {
/// * `Result<Self>` - The initialized filesystem or an error
pub fn new() -> Result<Self> {
// Get current directory
let current_dir = env::current_dir().context(ERR_CURRENT_DIR)?;
let current_dir = { env::current_dir().context(ERR_CURRENT_DIR)? };

// Get executable path
let git_ai_bin_path = env::current_exe().context("Failed to get current executable")?;
let git_ai_bin_path = { env::current_exe().context("Failed to get current executable")? };

// Open git repository
let repo = Repository::open_ext(&current_dir, Flags::empty(), Vec::<&Path>::new())
.with_context(|| format!("Failed to open repository at {}", current_dir.display()))?;
let repo = {
Repository::open_ext(&current_dir, Flags::empty(), Vec::<&Path>::new())
.with_context(|| format!("Failed to open repository at {}", current_dir.display()))?
};

// Get git path and ensure it's absolute
let git_path = {
Expand Down
Loading
Loading