Skip to content

Proposal: Gitea git cat-file Subprocess Management Optimization #33952

Open
@lunny

Description

@lunny

Feature Description

Background

Currently, Gitea creates a new git cat-file --batch subprocess for each request when handling git operations. While this approach is straightforward, it leads to the following issues in high-concurrency scenarios:

  • High system overhead due to frequent subprocess creation/destruction
  • Increased response latency as each request requires subprocess initialization
  • Potential overconsumption of system resources (such as file descriptors)
  • Non-gogit version of Gitea in Windows is slow for git related operations

Proposal Overview

Design and implement a lightweight git cat-file --batch subprocess manager to improve performance and resource utilization through subprocess reuse. Key features include:

  • Maintaining subprocess pools organized by repository path
  • Dynamically allocating idle subprocesses to handle requests
  • Automatically recycling long-idle subprocesses
  • Gracefully handling high-load situations

Detailed Design

Subprocess Manager Structure

type GitCatFileManager struct {
    // Subprocess pools indexed by repository path
    procPools     map[string]*ProcPool
    mutex         sync.RWMutex
    maxProcsPerRepo int     // Maximum number of subprocesses per repository
    idleTimeout   time.Duration // Idle timeout period
}

type ProcPool struct {
    repoPath      string
    processes     []*GitCatFileProcess
    mutex         sync.Mutex
}

type GitCatFileProcess struct {
    cmd           *exec.Cmd
    stdin         io.WriteCloser
    stdout        io.ReadCloser
    lastUsed      time.Time
    inUse         bool
    mutex         sync.Mutex
}

Core Functionality Implementation

Acquiring a Subprocess

func (m *GitCatFileManager) Get(repoPath string) (*GitCatFileProcess, error) {
    m.mutex.RLock()
    pool, exists := m.procPools[repoPath]
    m.mutex.RUnlock()
    
    if !exists {
        m.mutex.Lock()
        // Double-check to avoid race conditions
        pool, exists = m.procPools[repoPath]
        if !exists {
            pool = &ProcPool{repoPath: repoPath}
            m.procPools[repoPath] = pool
        }
        m.mutex.Unlock()
    }
    
    return pool.getProcess()
}

func (p *ProcPool) getProcess() (*GitCatFileProcess, error) {
    p.mutex.Lock()
    defer p.mutex.Unlock()
    
    // Look for an idle process
    for _, proc := range p.processes {
        if !proc.inUse {
            proc.inUse = true
            proc.lastUsed = time.Now()
            return proc, nil
        }
    }
    
    // Check if maximum limit has been reached
    if len(p.processes) >= maxProcsPerRepo {
        return nil, errors.New("reached max processes limit for repository")
    }
    
    // Create a new process
    proc, err := newGitCatFileProcess(p.repoPath)
    if err != nil {
        return nil, err
    }
    
    p.processes = append(p.processes, proc)
    return proc, nil
}

Creating a New Subprocess

func newGitCatFileProcess(repoPath string) (*GitCatFileProcess, error) {
    cmd := exec.Command("git", "-C", repoPath, "cat-file", "--batch")
    
    stdin, err := cmd.StdinPipe()
    if err != nil {
        return nil, err
    }
    
    stdout, err := cmd.StdoutPipe()
    if err != nil {
        stdin.Close()
        return nil, err
    }
    
    if err := cmd.Start(); err != nil {
        stdin.Close()
        stdout.Close()
        return nil, err
    }
    
    return &GitCatFileProcess{
        cmd:      cmd,
        stdin:    stdin,
        stdout:   stdout,
        lastUsed: time.Now(),
        inUse:    true,
    }, nil
}

Releasing a Subprocess

func (m *GitCatFileManager) Release(proc *GitCatFileProcess) {
    proc.mutex.Lock()
    proc.inUse = false
    proc.lastUsed = time.Now()
    proc.mutex.Unlock()
}

Periodic Cleanup

func (m *GitCatFileManager) StartCleaner(interval time.Duration) {
    ticker := time.NewTicker(interval)
    
    go func() {
        for range ticker.C {
            m.cleanIdleProcesses()
        }
    }()
}

func (m *GitCatFileManager) cleanIdleProcesses() {
    now := time.Now()
    m.mutex.Lock()
    defer m.mutex.Unlock()
    
    for repoPath, pool := range m.procPools {
        pool.mutex.Lock()
        
        activeProcs := make([]*GitCatFileProcess, 0, len(pool.processes))
        for _, proc := range pool.processes {
            proc.mutex.Lock()
            if !proc.inUse && now.Sub(proc.lastUsed) > m.idleTimeout {
                // Close long-idle processes
                proc.stdin.Close()
                proc.cmd.Process.Kill()
                proc.cmd.Wait() // Avoid zombie processes
                proc.mutex.Unlock()
            } else {
                proc.mutex.Unlock()
                activeProcs = append(activeProcs, proc)
            }
        }
        
        pool.processes = activeProcs
        pool.mutex.Unlock()
        
        // Remove empty process pools
        if len(pool.processes) == 0 {
            delete(m.procPools, repoPath)
        }
    }
}

start the manager

// Global instance
var gitCatFileManager = NewGitCatFileManager(
    10,               // Maximum subprocesses per repository
    5*time.Minute,    // Idle timeout period
)

func init() {
    // Start the cleanup goroutine, checking once per minute
    gitCatFileManager.StartCleaner(1 * time.Minute)
}

Implementation Considerations

  • Error Handling: Detect and handle subprocess abnormal exit situations
  • Thread Safety: Use appropriate mutex locks to ensure concurrency safety
  • Resource Limits: Add a global maximum process limit to prevent resource exhaustion
  • Monitoring Metrics: Add monitoring for subprocess pool usage to facilitate troubleshooting

Performance Expectations

  • Reduced Latency: Most requests use already-initialized subprocesses, avoiding startup overhead
  • Increased Throughput: Reduced system-level call overhead in high-concurrency scenarios
  • Lowered Resource Consumption: Control of total subprocess count prevents excessive resource usage

Drawbacks

  • Increased Complexity: The solution adds complexity to Gitea's codebase with new data structures, synchronization mechanisms, and lifecycle management that will need to be maintained.
  • Memory Footprint: Long-running subprocesses will consume more memory over time compared to short-lived ones. Each cached subprocess maintains open file handles and memory buffers.
  • UnReleased sub process maybe stuck forever, there should be timeout for a subprocess session.
  • When a git cat-file --batch run for a long time and repository updated, what will happen.

TODO

  • Implement a basic version and conduct performance benchmark tests
  • Consider adding subprocess health check mechanisms
  • Integrate with Gitea's monitoring and trace system

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/proposalThe new feature has not been accepted yet but needs to be discussed first.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions