Skip to content

feat: add NeuG as parallel graph storage engine with Cypher query support#1056

Open
BingqingLyu wants to merge 9 commits into
safishamsi:v8from
BingqingLyu:neug-integration
Open

feat: add NeuG as parallel graph storage engine with Cypher query support#1056
BingqingLyu wants to merge 9 commits into
safishamsi:v8from
BingqingLyu:neug-integration

Conversation

@BingqingLyu
Copy link
Copy Markdown

Summary

  • Add NeuG as an optional parallel graph storage engine alongside NetworkX
  • When installed, NeuG automatically generates a graph.db during extraction, enabling Cypher queries via CLI (graphify cypher) and MCP server (cypher_query tool)
  • Native incremental update via Cypher MERGE — O(delta) vs NetworkX's O(full graph) rebuild
  • Fix pre-existing id_remap bug in incremental extraction that caused unstable file node IDs

Motivation

Graphify currently uses NetworkX + graph.json as its core graph storage. This architecture has bottlenecks:

  • Limited query capability: No declarative graph query language — only Python API traversal
  • Inefficient incremental updates: Every update requires loading full graph.json → merge → rebuild → re-serialize (O(full graph) even for single-file changes)
  • Performance ceiling at scale: Entire graph must be loaded into memory; NetworkX's pure-Python execution becomes a bottleneck on large graphs
  • Limited graph algorithm extensibility: Adding custom graph algorithms requires Python-level implementation with no native acceleration path

Why NeuG?

NeuG is a lightweight embedded graph database (C++ core, Python bindings):

  1. Native Cypher support — Declarative graph query language; AI agents can query the knowledge graph directly without custom Python code
  2. Native incremental updates — Cypher MERGE enables O(delta) upserts in-place, no full-graph reload needed; for 10K+ node graphs, single-file updates are near-instantaneous
  3. Battle-tested performance — LDBC benchmark world record holder; lightweight & embeddable (no standalone server, pip install neug is all it takes)
  4. Extensible graph algorithms — Native C++ extension framework for custom graph algorithms; Louvain community detection already available, with more algorithms (Leiden, PageRank, etc.) in development — can replace the current Python-based algorithm layer with significant performance gains

Architecture

Dual-engine coexistence, each independently consuming extraction data:

extraction dict ──┬──> NetworkX (build.py)  → graph.json  (existing)
                  └──> NeuG (storage.py)    → graph.db    (new)

Changes

File Description
graphify/storage.py New — NeuG adapter layer (init, schema, ingest via MERGE, query, close)
graphify/__main__.py NeuG ingest during extract + graphify cypher CLI command
graphify/serve.py cypher_query MCP tool for AI agents
graphify/extract.py Fix id_remap bug in incremental extraction
pyproject.toml Add neug>=0.1.2 optional dependency
tests/ Unit tests (13 cases) + e2e integration script

Usage

# Install
pip install graphify[neug]

# Extract (automatically generates graph.db)
graphify extract /path/to/project

# Cypher query
graphify cypher "MATCH (n:code) RETURN n.label, n.source_file LIMIT 10"
graphify cypher "MATCH (a:code)-[e:edge_code_code_calls]->(b:code) RETURN a.label, b.label LIMIT 10"

# MCP server (AI agents query via cypher_query tool)
python -m graphify.serve graphify-out/graph.json

Bugfix: incremental extraction id_remap

The id_remap step uses an auto-inferred root (resolves to path.parent for single-file extraction), inconsistent with the project root used during full extraction. This causes file node IDs to be unstable, producing duplicate nodes on each incremental update.

Fix: use cache_root (the project target directory) for relative_to() in the id_remap step.

Note: deduplicate_entities() incorrectly merges AST nodes

During testing, we found that deduplicate_entities() merges functions from different files that share similar names (e.g., hooks.py:install() and __main__.py:install()). These functions have distinct IDs and different source_files — only their labels are similar. For pure AST extraction, node IDs are inherently unique, making fuzzy dedup harmful.

The NeuG engine writes raw extraction data directly (skipping dedup), preserving full precision. We suggest discussing dedup strategy optimization separately (e.g., applying fuzzy dedup only to LLM-extracted concept nodes).

Test Plan

  • pytest tests/test_storage.py tests/test_cypher_cli.py -v — 13 tests passed
  • MCP server cypher_query tool end-to-end verified
  • Full extract → cypher count matches expected
  • Incremental extract (add one function) → MERGE upsert correct
  • Uninstall neug → graphify extract . runs normally (silent skip)

🤖 Generated with Claude Code

BingqingLyu and others added 6 commits May 28, 2026 10:24
…port

Integrate NeuG as an optional storage backend alongside NetworkX.
During extract, graph data is written to both graph.json (NetworkX)
and graph.db (NeuG) when neug is installed. Adds `graphify cypher`
CLI command and `cypher_query` MCP tool for direct Cypher queries.

- New graphify/storage.py: NeuG adapter (init, ingest, query, close)
- __main__.py: NeuG write in extract flow + `cypher` CLI command
- serve.py: NeuG connection init + `cypher_query` MCP tool
- pyproject.toml: neug optional dependency
- Tests: unit tests, CLI tests, E2E integration script

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Edge tables are now named edge_{src}_{tgt}_{relation} instead of
edge_{src}_{tgt}. This keeps each table well under NeuG 0.1.0s
4096-row-per-table limit (max single table ~2475 rows for calls).

Removes EXTRACTED/INFERRED routing distinction -- all edges are
uniformly routed by their relation field.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…table file node IDs

When extracting a single file incrementally, the auto-inferred root
(paths[0].parent) differs from the project root used during full extraction,
causing file node IDs to mismatch (e.g. "build_py" vs "graphify_build_py").
This created duplicate file nodes on each incremental update (+2 instead of +1).

Fix: use cache_root (the project target directory passed by __main__.py)
for relative_to() in the id_remap step, ensuring file node IDs are consistent
regardless of whether extraction is full or incremental.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ation

Separates database opening (init_db) from schema DDL (ensure_schema) so
read-only consumers (CLI cypher, MCP server) can open an existing graph.db
without re-running CREATE TABLE statements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace MATCH+check+SET workaround with standard Cypher MERGE ON
  CREATE/ON MATCH syntax (now supported in NeuG 0.1.2)
- ensure_schema(create_tables=False) skips DDL on incremental runs,
  avoiding "table already exists" noise from the C++ layer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@safishamsi
Copy link
Copy Markdown
Owner

Great idea and architecturally sound — soft-import pattern is correct, the neug optional extra is right, and the bundled id_remap bugfix is a nice bonus. Six things to fix before merge:

  1. Module-level hard import: import neug at the top of storage.py means import graphify.storage raises ImportError if neug is missing. Move it inside init_db() or the first function that uses it.

  2. Cypher injection: relation, label, source_file, and source_location come from extraction dicts (including LLM output) and are interpolated directly into Cypher strings. Use parameterised queries (conn.execute(query, params)) if NeuG supports them, or at minimum document the trust boundary explicitly.

  3. Process-global _created_rel_tables: This module-level set means a second database opened in the same process won't re-issue CREATE statements. Move it into a per-connection registry.

  4. ingest_communities is O(nodes × 6 tables): On a 10k-node graph that's 60k Cypher round-trips. Pass the node_types dict (already populated in ingest_extraction) through so you can look up each node's label directly instead of probing all 6 tables.

  5. No upper version pin: neug>=0.1.2 — 0.x packages can break wire formats between minor versions. Pin <0.2 or equivalent.

  6. Bash e2e test: tests/test_neug_e2e.sh won't be picked up by CI (pytest). Convert to a pytest test or remove — the existing tests/test_storage.py is the right place.

Fix those and this is ready to land.

BingqingLyu and others added 2 commits May 29, 2026 10:49
…, per-conn registry

- Replace all _cesc() string interpolation with NeuG native $param syntax
  to prevent Cypher injection (community SET uses int literal due to NeuG
  limitation on parameterised SET)
- Move `import neug` into init_db() for lazy loading
- Make rel table registry per-connection via ensure_schema() return value
- ingest_extraction() returns node_types dict for O(n) community writes
- Require neug>=0.1.2,<0.2 for MERGE support
- Remove tests/test_neug_e2e.sh (manual script, not automated)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…adopt faster-whisper version guard)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@BingqingLyu
Copy link
Copy Markdown
Author

BingqingLyu commented May 29, 2026

Thanks for the detailed review! All 6 items are addressed in the latest push:

  1. Module-level hard import — Fixed. import neug is now inside init_db() only. The rest of storage.py imports nothing from neug at module level.
  2. Cypher injection / parameterised queries — Fixed. Replaced all string interpolation (_cesc()) with NeuG's native parameterised queries ($param syntax with parameters={} dict). _cesc() has been removed
    entirely.
  3. Process-global _created_rel_tables — Fixed. ensure_schema() now returns a set[str] (per-connection registry), which is threaded through to ingest_extraction() and _ensure_rel_table(). The module-level global
    has been removed.
  4. ingest_communities O(nodes × 6) — Fixed. ingest_extraction() now returns a node_types: dict[str, str] (node ID → file_type), which is passed to ingest_communities() for direct table lookup. Falls back to
    probing all 6 tables only when node_types is not provided.
  5. No upper version pin — Fixed. Changed to neug>=0.1.2,<0.2 in both the neug extra and the all extra.
  6. Bash e2e test — Removed. tests/test_neug_e2e.sh has been deleted. Coverage is handled by tests/test_storage.py and tests/test_cypher_cli.py.

Looking ahead, we're happy to keep contributing on the NeuG integration. A few directions we have in mind:

  • Community detection on NeuG: NeuG has an extensible extension architecture that allows plugging in custom graph algorithms. We're planning to develop community detection algorithms (e.g., Louvain/Leiden) as
    NeuG extensions, which would enable running community detection directly on NeuG instead of the current NetworkX/graspologic path.
  • Promoting NeuG to default storage: If through further testing NeuG proves its advantages in query flexibility, incremental updates, and large-scale corpora, we'd be glad to take on the work of making NeuG the
    primary storage engine (replacing NetworkX + graph.json).

We're very excited about this collaboration and would love to keep working on it together.

…storage.py module to ARCHITECTURE.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants