Skip to content

Fix/cg pagination#431

Open
ZohaibHassan16 wants to merge 2 commits intoHawksight-AI:mainfrom
ZohaibHassan16:fix/cg-pagination
Open

Fix/cg pagination#431
ZohaibHassan16 wants to merge 2 commits intoHawksight-AI:mainfrom
ZohaibHassan16:fix/cg-pagination

Conversation

@ZohaibHassan16
Copy link
Copy Markdown
Collaborator

Summary

Fixes critical data pipeline bug that caused valid edges to be dropped during the load_from_file phase and patches a vulnerability that caused the frontend layout engine to crash due to ghost nodes.

Changes Made:

  • Updated the add_edges to gracefully accept both naming conventions ("source" OR "source_id")
  • Added if not source_id or not target_id: continue in add_edges to prevent structurally broken data from entering RAM.
  • Added if n.node_id and n.is_active(now):) which guarantees mathematically sound data for UI physical engine.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Optimize ContextGraph pagination with lazy evaluation and edge ID mapping

🐞 Bug fix ✨ Enhancement

Grey Divider

Walkthroughs

Description
β€’ Optimize pagination with lazy evaluation using generators
β€’ Accept both naming conventions for edge source/target IDs
β€’ Filter invalid edges and nodes during data loading
β€’ Ensure deterministic pagination with sorted node IDs
Diagram
flowchart LR
  A["Edge data with mixed naming"] -->|Accept source/target OR source_id/target_id| B["Normalize edge IDs"]
  B -->|Filter invalid edges| C["Valid edges only"]
  D["Node queries"] -->|Lazy generator evaluation| E["Deterministic sorted results"]
  E -->|Pagination with islice| F["Paginated output"]
  G["Active node filtering"] -->|Generator-based| H["Memory efficient results"]
Loading

Grey Divider

File Changes

1. semantica/context/context_graph.py Enhancement, bug fix +53/-69

Lazy pagination and edge ID mapping normalization

β€’ Added itertools import for lazy pagination evaluation
β€’ Enhanced add_edges() to accept both "source"/"target" and "source_id"/"target_id" naming
 conventions
β€’ Added validation to skip edges with missing source_id or target_id
β€’ Refactored find_nodes() to use lazy generators with sorted deterministic ordering
β€’ Refactored find_active_nodes() to use generator-based filtering with node_id validation
β€’ Refactored find_edges() to use lazy generators with null-safety defaults
β€’ Replaced list slicing with itertools.islice() for memory-efficient pagination

semantica/context/context_graph.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 1, 2026

Code Review by Qodo

🐞 Bugs (3) πŸ“˜ Rule violations (0) πŸ“Ž Requirement gaps (0)

Grey Divider


Action required

1. Sorted IDs TypeError 🐞 Bug ≑ Correctness
Description
find_nodes()/find_active_nodes() call sorted() on node_type_index IDs; if malformed node_ids (e.g.,
None/int) exist alongside strings, sorting raises TypeError and crashes node-type pagination.
Malformed node_ids can enter via add_nodes() and are indexed verbatim in _add_internal_node().
Code

semantica/context/context_graph.py[R789-791]

+                # Sets are unordered, sort IDs for deterministic pagination
+                raw_ids = sorted(self.node_type_index.get(node_type, set()))
+                source = (self.nodes[nid] for nid in raw_ids if nid in self.nodes)
Evidence
add_nodes() forwards node.get("id") into ContextNode without validating type/emptiness, and
_add_internal_node() inserts that node_id into node_type_index. find_nodes()/find_active_nodes()
then sort() that set, which will raise at runtime if it contains mixed/non-comparable types (common
with malformed imports).

semantica/context/context_graph.py[349-392]
semantica/context/context_graph.py[1058-1067]
semantica/context/context_graph.py[783-806]
semantica/context/context_graph.py[808-821]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`find_nodes()`/`find_active_nodes()` sort IDs from `node_type_index` without validating ID types. If malformed node IDs (e.g., `None`, `int`) are present, `sorted()` raises `TypeError` and breaks pagination.

### Issue Context
Malformed IDs can enter through `add_nodes()` (no validation) and are indexed verbatim in `_add_internal_node()`.

### Fix Focus Areas
- semantica/context/context_graph.py[783-821]
- semantica/context/context_graph.py[349-392]
- semantica/context/context_graph.py[1058-1067]

### Implementation notes
- When building `raw_ids`, filter to non-empty strings only (e.g., `isinstance(nid, str) and nid`) before sorting.
- Preferably also prevent bad IDs at ingestion time: in `add_nodes()`, skip/reject nodes with missing/invalid `id` (and optionally log/count them).

β“˜ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Totals mismatch pagination 🐞 Bug ≑ Correctness
Description
find_nodes()/find_edges() now filter out invalid items (missing node_id / missing endpoints), but
stats() still counts all stored nodes/edges and index entries. Explorer pagination uses stats()
totals, so the API can report totals larger than the number of items find_* can ever return
(empty/incorrect pages).
Code

semantica/context/context_graph.py[R985-988]

+                for e in source if e.source_id and e.target_id
+            )
+            stop = skip + limit if limit is not None else None
+            return list(itertools.islice(gen, skip, stop))
Evidence
find_nodes filters out nodes with falsy node_id and find_edges filters out edges missing endpoints,
while stats() counts len(self.nodes)/len(self.edges) and raw index sizes. ExplorerSession uses those
stats totals for pagination when no in-memory filtering is applied, so totals can become
inconsistent with the actual page source.

semantica/context/context_graph.py[783-806]
semantica/context/context_graph.py[970-999]
semantica/explorer/session.py[150-194]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`stats()` totals can disagree with what `find_nodes()`/`find_edges()` return because `find_*` filters invalid entities but `stats()` counts everything. Explorer pagination uses `stats()` for totals, producing incorrect totals/empty pages.

### Issue Context
- `find_nodes()` filters `if n.node_id`
- `find_edges()` filters `if e.source_id and e.target_id`
- `stats()` counts `len(self.nodes)` / `len(self.edges)` and raw index sizes
- ExplorerSession uses `stats()` totals with `find_*` paging.

### Fix Focus Areas
- semantica/context/context_graph.py[783-807]
- semantica/context/context_graph.py[970-999]
- semantica/explorer/session.py[150-194]

### Implementation notes
Pick one consistent approach:
1) **Preferred:** prevent invalid nodes/edges from being stored/indexed (validate in `add_nodes()` similar to `add_edges()`), and optionally clean existing invalid entries on load/import.
2) Or compute `stats()` counts using the same validity criteria as `find_*` (and ensure `node_types`/`edge_types` reflect only valid IDs/edges).

β“˜ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Skipped edges unreported 🐞 Bug ✧ Quality
Description
add_edges() now silently skips edges missing source/target IDs with no logging or accounting, making
malformed imports/load_from_file partially apply without any diagnostics. This reduces debuggability
when edges_added is unexpectedly lower than input size.
Code

semantica/context/context_graph.py[R412-416]

+                source_id = edge.get("source_id") or edge.get("source")
+                target_id = edge.get("target_id") or edge.get("target")
+
+                if not source_id or not target_id:
+                    continue
Evidence
add_edges() drops edges via continue when endpoints are missing, and load_from_file() feeds file
edges directly into add_edges(); no warning is emitted identifying rejected edges or counts.

semantica/context/context_graph.py[394-430]
semantica/context/context_graph.py[722-758]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Edges with missing endpoints are skipped silently in `add_edges()`. When ingesting data (e.g., via `load_from_file()` or Explorer import), this produces partial graphs without any diagnostic output.

### Issue Context
Skipping invalid edges is reasonable, but callers need visibility into how many/which edges were rejected.

### Fix Focus Areas
- semantica/context/context_graph.py[394-430]
- semantica/context/context_graph.py[722-758]

### Implementation notes
- Track `skipped` count inside `add_edges()`.
- After processing the batch, emit a single `logger.warning` when `skipped > 0` (optionally include a small sample of rejected edges, redacting large metadata).
- Consider returning only `count` (to keep API stable) but logging the skipped count; or expose skipped count via an optional callback/diagnostics hook.

β“˜ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

β“˜ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +789 to +791
# Sets are unordered, sort IDs for deterministic pagination
raw_ids = sorted(self.node_type_index.get(node_type, set()))
source = (self.nodes[nid] for nid in raw_ids if nid in self.nodes)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Sorted ids typeerror 🐞 Bug ≑ Correctness

find_nodes()/find_active_nodes() call sorted() on node_type_index IDs; if malformed node_ids (e.g.,
None/int) exist alongside strings, sorting raises TypeError and crashes node-type pagination.
Malformed node_ids can enter via add_nodes() and are indexed verbatim in _add_internal_node().
Agent Prompt
### Issue description
`find_nodes()`/`find_active_nodes()` sort IDs from `node_type_index` without validating ID types. If malformed node IDs (e.g., `None`, `int`) are present, `sorted()` raises `TypeError` and breaks pagination.

### Issue Context
Malformed IDs can enter through `add_nodes()` (no validation) and are indexed verbatim in `_add_internal_node()`.

### Fix Focus Areas
- semantica/context/context_graph.py[783-821]
- semantica/context/context_graph.py[349-392]
- semantica/context/context_graph.py[1058-1067]

### Implementation notes
- When building `raw_ids`, filter to non-empty strings only (e.g., `isinstance(nid, str) and nid`) before sorting.
- Preferably also prevent bad IDs at ingestion time: in `add_nodes()`, skip/reject nodes with missing/invalid `id` (and optionally log/count them).

β“˜ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +985 to +988
for e in source if e.source_id and e.target_id
)
stop = skip + limit if limit is not None else None
return list(itertools.islice(gen, skip, stop))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Totals mismatch pagination 🐞 Bug ≑ Correctness

find_nodes()/find_edges() now filter out invalid items (missing node_id / missing endpoints), but
stats() still counts all stored nodes/edges and index entries. Explorer pagination uses stats()
totals, so the API can report totals larger than the number of items find_* can ever return
(empty/incorrect pages).
Agent Prompt
### Issue description
`stats()` totals can disagree with what `find_nodes()`/`find_edges()` return because `find_*` filters invalid entities but `stats()` counts everything. Explorer pagination uses `stats()` for totals, producing incorrect totals/empty pages.

### Issue Context
- `find_nodes()` filters `if n.node_id`
- `find_edges()` filters `if e.source_id and e.target_id`
- `stats()` counts `len(self.nodes)` / `len(self.edges)` and raw index sizes
- ExplorerSession uses `stats()` totals with `find_*` paging.

### Fix Focus Areas
- semantica/context/context_graph.py[783-807]
- semantica/context/context_graph.py[970-999]
- semantica/explorer/session.py[150-194]

### Implementation notes
Pick one consistent approach:
1) **Preferred:** prevent invalid nodes/edges from being stored/indexed (validate in `add_nodes()` similar to `add_edges()`), and optionally clean existing invalid entries on load/import.
2) Or compute `stats()` counts using the same validity criteria as `find_*` (and ensure `node_types`/`edge_types` reflect only valid IDs/edges).

β“˜ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant