Skip to content

Conversation

@roomote
Copy link
Contributor

@roomote roomote bot commented Nov 19, 2025

This PR attempts to address Issue #9403. Feedback and guidance are welcome.

Summary

This enhancement revamps the codebase indexing system to provide better understanding of code relationships and more intelligent chunking. The implementation introduces graph-based indexing alongside the existing vector store to enable hierarchical code understanding.

Key Improvements

1. Graph-Based Indexing

  • Implemented GraphIndexStore using Qdrant collections for nodes and edges
  • Support for hierarchical relationships between code elements (classes, functions, imports, etc.)
  • Enables traversal of code relationships for better context understanding

2. Relationship Extraction

  • Created RelationshipExtractor to analyze AST and extract code relationships
  • Support for multiple languages (TypeScript, JavaScript, Python, etc.)
  • Identifies various relationship types: contains, imports, extends, implements, calls, references

3. Semantic Chunking

  • Implemented SemanticParser for intelligent code chunking based on AST analysis
  • Respects code structure and logical boundaries
  • Handles large code blocks by splitting them semantically (e.g., by methods in classes)
  • Maintains context through scope tracking and dependency analysis

4. Context-Aware Search

  • Created ContextAwareSearchService for enhanced retrieval
  • Combines vector similarity with graph relationships
  • Provides call chains, dependency trees, and related code in search results
  • Enables location-based context retrieval for better code understanding

Technical Details

The implementation works alongside the existing Qdrant vector store without breaking changes. The graph-based index uses separate collections for nodes and edges, enabling efficient graph traversal while maintaining backward compatibility.

Benefits

  • Better Code Understanding: The system now understands relationships between code elements
  • Improved Search Results: Context-aware search provides more relevant results with related code
  • Intelligent Chunking: Semantic chunking respects code structure for better retrieval
  • Scalability: Graph-based approach scales well with large codebases

Testing

The implementation maintains compatibility with existing code and passes all linting and type checks. The graph-based indexing is designed to work alongside the existing vector store.

Next Steps

Future enhancements could include:

  • Caching layer for frequently accessed code patterns
  • Incremental re-indexing optimization
  • Support for more programming languages
  • Advanced graph algorithms for code analysis

Closes #9403


Important

Revamp codebase indexing with graph-based relationships, semantic chunking, and context-aware search.

  • Graph-Based Indexing:
    • Implement GraphIndexStore using Qdrant collections for nodes and edges.
    • Supports hierarchical relationships between code elements.
  • Relationship Extraction:
    • Add RelationshipExtractor to analyze AST and extract code relationships.
    • Supports multiple languages (TypeScript, JavaScript, Python, etc.).
  • Semantic Chunking:
    • Implement SemanticParser for intelligent code chunking based on AST analysis.
    • Handles large code blocks by splitting them semantically.
  • Context-Aware Search:
    • Create ContextAwareSearchService for enhanced retrieval.
    • Combines vector similarity with graph relationships.
  • Interfaces:
    • Update graph-index.ts to include new node and edge types.
    • Define interfaces for graph-based indexing and context-aware search.

This description was created by Ellipsis for c7e0475. You can customize this summary. It will automatically update as commits are pushed.

- Add graph-based index interfaces for hierarchical code relationships
- Implement GraphIndexStore using Qdrant for node and edge storage
- Create RelationshipExtractor to analyze AST and extract code relationships
- Add ContextAwareSearchService for enhanced retrieval with relationship traversal
- Implement SemanticParser for intelligent code chunking based on AST analysis
- Support multiple languages (TypeScript, JavaScript, Python, etc.)
- Enable context-aware search with call chains and dependency trees

This enhancement provides better codebase understanding through:
- Hierarchical indexing of code elements (classes, functions, imports, etc.)
- Graph-based relationships between code components
- Semantic chunking that respects code structure
- Context-aware retrieval that includes related code
@roomote roomote bot requested review from cte, jr and mrubens as code owners November 19, 2025 20:23
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Nov 19, 2025
@roomote roomote bot mentioned this pull request Nov 19, 2025
2 tasks
@roomote
Copy link
Contributor Author

roomote bot commented Nov 19, 2025

Rooviewer Clock   See task on Roo Cloud

Review completed. Found 5 issues that need attention:

  • Unresolved relationship targets create broken edges in the graph (relationship-extractor.ts:156-165)
  • Hard-coded embedding dimension will cause failures with different models (context-aware-search.ts:256)
  • N+1 query problem in getSubgraph causes performance degradation (graph-index-store.ts:286-293)
  • Type safety issue with hard-coded node type and unvalidated payload (context-aware-search.ts:268-280)
  • O(n²) complexity in relationship establishment causes performance issues (semantic-parser.ts:499-511)

Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues.

Comment on lines +156 to +165
if (superClass && currentNodeId) {
edges.push({
id: GraphIndexStore.generateEdgeId(currentNodeId, superClass, EdgeType.EXTENDS),
source: currentNodeId,
target: superClass, // This would need to be resolved to actual node ID
type: EdgeType.EXTENDS,
weight: 1.0,
metadata: { unresolved: true },
})
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolved relationship targets will create broken edges in the graph. When extracting inheritance/implementation relationships, edges are created with string identifiers (e.g., superClass, interfaceName) as targets instead of actual node IDs, marked as unresolved: true. However, there's no code to resolve these strings to node IDs later, leaving the graph with edges pointing to non-existent nodes. This breaks graph traversal features like call chains and dependency trees whenever inheritance is involved.

Fix it with Roo Code or mention @roomote and request a fix.

@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Nov 19, 2025

// Search through vector store for nodes in this file
const results = await this.vectorStore.search(
new Array(768).fill(0), // Dummy embedding
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coded embedding dimension will cause failures if the system uses a different embedding model. This creates a dummy 768-dimensional vector, assuming the OpenAI text-embedding-ada-002 model, but the codebase supports multiple embedding providers (OpenAI, Ollama, Gemini, Mistral, etc.) with varying dimensions. If the system is configured with a different model, Qdrant queries will fail with dimension mismatch errors. The vector size should be obtained from the embedder configuration or passed as a parameter.

Fix it with Roo Code or mention @roomote and request a fix.

Comment on lines +286 to +293
const allEdges: CodeGraphEdge[] = []
for (const id of nodeIds) {
const edges = await this.getEdges(id)
for (const edge of edges) {
if (nodeIds.has(edge.source) && nodeIds.has(edge.target)) {
allEdges.push(edge)
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N+1 query problem in getSubgraph will cause severe performance degradation. The method iterates through all connected node IDs and calls getEdges(id) for each one individually, resulting in N separate database queries where N is the number of nodes in the subgraph. For a subgraph with 100 nodes, this executes 100 separate queries. This should be optimized to fetch all edges in a single query using a batch operation or a filter that matches multiple node IDs at once.

Fix it with Roo Code or mention @roomote and request a fix.

Comment on lines +268 to +280
// Convert to graph node
const node: CodeGraphNode = {
id: result.id as string,
type: CodeNodeType.FUNCTION, // Would need proper type detection
name: `${filePath}:${line}`,
filePath,
startLine: result.payload.startLine,
endLine: result.payload.endLine,
content: result.payload.codeChunk,
metadata: {},
}
allNodes.push(node)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type safety issue with hard-coded node type and unvalidated payload structure. The code unconditionally assigns type: CodeNodeType.FUNCTION to all nodes regardless of their actual type, and assumes result.payload contains specific fields (filePath, startLine, endLine, codeChunk) without validation. If the vector store returns results with a different payload structure or if nodes of other types (classes, interfaces, etc.) are present at the location, this will create incorrectly typed graph nodes or throw runtime errors when accessing undefined properties.

Fix it with Roo Code or mention @roomote and request a fix.

Comment on lines +499 to +511
private establishBlockRelationships(blocks: SemanticCodeBlock[]): void {
// Find related blocks based on references
for (const block of blocks) {
for (const otherBlock of blocks) {
if (block === otherBlock) continue

// Check if block references the other
if (block.content.includes(otherBlock.identifier || "")) {
block.relatedBlockIds.push(otherBlock.segmentHash)
}
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O(n²) complexity in relationship establishment will cause severe performance issues with large files. The nested loop compares every block with every other block using simple string inclusion (block.content.includes(otherBlock.identifier)), resulting in O(n²) time complexity. For a file with 100 blocks, this performs 10,000 comparisons, and the string search itself is expensive. This will cause noticeable slowdowns when indexing large files. Consider using more efficient data structures like a trie or hash map for identifier lookups, or limiting relationship detection to adjacent scopes.

Fix it with Roo Code or mention @roomote and request a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

Status: Triage

Development

Successfully merging this pull request may close these issues.

[ENHANCEMENT] Revamp codebase index

3 participants