feat: add NeuG as parallel graph storage engine with Cypher query support by BingqingLyu · Pull Request #1056 · safishamsi/graphify

BingqingLyu · 2026-05-28T08:07:38Z

Summary

Add NeuG as an optional parallel graph storage engine alongside NetworkX
When installed, NeuG automatically generates a graph.db during extraction, enabling Cypher queries via CLI (graphify cypher) and MCP server (cypher_query tool)
Native incremental update via Cypher MERGE — O(delta) vs NetworkX's O(full graph) rebuild
Fix pre-existing id_remap bug in incremental extraction that caused unstable file node IDs

Motivation

Graphify currently uses NetworkX + graph.json as its core graph storage. This architecture has bottlenecks:

Limited query capability: No declarative graph query language — only Python API traversal
Inefficient incremental updates: Every update requires loading full graph.json → merge → rebuild → re-serialize (O(full graph) even for single-file changes)
Performance ceiling at scale: Entire graph must be loaded into memory; NetworkX's pure-Python execution becomes a bottleneck on large graphs
Limited graph algorithm extensibility: Adding custom graph algorithms requires Python-level implementation with no native acceleration path

Why NeuG?

NeuG is a lightweight embedded graph database (C++ core, Python bindings):

Native Cypher support — Declarative graph query language; AI agents can query the knowledge graph directly without custom Python code
Native incremental updates — Cypher MERGE enables O(delta) upserts in-place, no full-graph reload needed; for 10K+ node graphs, single-file updates are near-instantaneous
Battle-tested performance — LDBC benchmark world record holder; lightweight & embeddable (no standalone server, pip install neug is all it takes)
Extensible graph algorithms — Native C++ extension framework for custom graph algorithms; Louvain community detection already available, with more algorithms (Leiden, PageRank, etc.) in development — can replace the current Python-based algorithm layer with significant performance gains

Architecture

Dual-engine coexistence, each independently consuming extraction data:

extraction dict ──┬──> NetworkX (build.py)  → graph.json  (existing)
                  └──> NeuG (storage.py)    → graph.db    (new)

Changes

File	Description
`graphify/storage.py`	New — NeuG adapter layer (init, schema, ingest via MERGE, query, close)
`graphify/__main__.py`	NeuG ingest during extract + `graphify cypher` CLI command
`graphify/serve.py`	`cypher_query` MCP tool for AI agents
`graphify/extract.py`	Fix id_remap bug in incremental extraction
`pyproject.toml`	Add `neug>=0.1.2` optional dependency
`tests/`	Unit tests (13 cases) + e2e integration script

Usage

# Install
pip install graphify[neug]

# Extract (automatically generates graph.db)
graphify extract /path/to/project

# Cypher query
graphify cypher "MATCH (n:code) RETURN n.label, n.source_file LIMIT 10"
graphify cypher "MATCH (a:code)-[e:edge_code_code_calls]->(b:code) RETURN a.label, b.label LIMIT 10"

# MCP server (AI agents query via cypher_query tool)
python -m graphify.serve graphify-out/graph.json

Bugfix: incremental extraction id_remap

The id_remap step uses an auto-inferred root (resolves to path.parent for single-file extraction), inconsistent with the project root used during full extraction. This causes file node IDs to be unstable, producing duplicate nodes on each incremental update.

Fix: use cache_root (the project target directory) for relative_to() in the id_remap step.

Note: `deduplicate_entities()` incorrectly merges AST nodes

During testing, we found that deduplicate_entities() merges functions from different files that share similar names (e.g., hooks.py:install() and __main__.py:install()). These functions have distinct IDs and different source_files — only their labels are similar. For pure AST extraction, node IDs are inherently unique, making fuzzy dedup harmful.

The NeuG engine writes raw extraction data directly (skipping dedup), preserving full precision. We suggest discussing dedup strategy optimization separately (e.g., applying fuzzy dedup only to LLM-extracted concept nodes).

Test Plan

pytest tests/test_storage.py tests/test_cypher_cli.py -v — 13 tests passed
MCP server cypher_query tool end-to-end verified
Full extract → cypher count matches expected
Incremental extract (add one function) → MERGE upsert correct
Uninstall neug → graphify extract . runs normally (silent skip)

🤖 Generated with Claude Code

…port Integrate NeuG as an optional storage backend alongside NetworkX. During extract, graph data is written to both graph.json (NetworkX) and graph.db (NeuG) when neug is installed. Adds `graphify cypher` CLI command and `cypher_query` MCP tool for direct Cypher queries. - New graphify/storage.py: NeuG adapter (init, ingest, query, close) - __main__.py: NeuG write in extract flow + `cypher` CLI command - serve.py: NeuG connection init + `cypher_query` MCP tool - pyproject.toml: neug optional dependency - Tests: unit tests, CLI tests, E2E integration script Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Edge tables are now named edge_{src}_{tgt}_{relation} instead of edge_{src}_{tgt}. This keeps each table well under NeuG 0.1.0s 4096-row-per-table limit (max single table ~2475 rows for calls). Removes EXTRACTED/INFERRED routing distinction -- all edges are uniformly routed by their relation field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…table file node IDs When extracting a single file incrementally, the auto-inferred root (paths[0].parent) differs from the project root used during full extraction, causing file node IDs to mismatch (e.g. "build_py" vs "graphify_build_py"). This created duplicate file nodes on each incremental update (+2 instead of +1). Fix: use cache_root (the project target directory passed by __main__.py) for relative_to() in the id_remap step, ensuring file node IDs are consistent regardless of whether extraction is full or incremental. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ation Separates database opening (init_db) from schema DDL (ensure_schema) so read-only consumers (CLI cypher, MCP server) can open an existing graph.db without re-running CREATE TABLE statements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace MATCH+check+SET workaround with standard Cypher MERGE ON CREATE/ON MATCH syntax (now supported in NeuG 0.1.2) - ensure_schema(create_tables=False) skips DDL on incremental runs, avoiding "table already exists" noise from the C++ layer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

safishamsi · 2026-05-28T13:46:33Z

Great idea and architecturally sound — soft-import pattern is correct, the neug optional extra is right, and the bundled id_remap bugfix is a nice bonus. Six things to fix before merge:

Module-level hard import: import neug at the top of storage.py means import graphify.storage raises ImportError if neug is missing. Move it inside init_db() or the first function that uses it.
Cypher injection: relation, label, source_file, and source_location come from extraction dicts (including LLM output) and are interpolated directly into Cypher strings. Use parameterised queries (conn.execute(query, params)) if NeuG supports them, or at minimum document the trust boundary explicitly.
Process-global _created_rel_tables: This module-level set means a second database opened in the same process won't re-issue CREATE statements. Move it into a per-connection registry.
ingest_communities is O(nodes × 6 tables): On a 10k-node graph that's 60k Cypher round-trips. Pass the node_types dict (already populated in ingest_extraction) through so you can look up each node's label directly instead of probing all 6 tables.
No upper version pin: neug>=0.1.2 — 0.x packages can break wire formats between minor versions. Pin <0.2 or equivalent.
Bash e2e test: tests/test_neug_e2e.sh won't be picked up by CI (pytest). Convert to a pytest test or remove — the existing tests/test_storage.py is the right place.

Fix those and this is ready to land.

…, per-conn registry - Replace all _cesc() string interpolation with NeuG native $param syntax to prevent Cypher injection (community SET uses int literal due to NeuG limitation on parameterised SET) - Move `import neug` into init_db() for lazy loading - Make rel table registry per-connection via ensure_schema() return value - ingest_extraction() returns node_types dict for O(n) community writes - Require neug>=0.1.2,<0.2 for MERGE support - Remove tests/test_neug_e2e.sh (manual script, not automated) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…adopt faster-whisper version guard) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BingqingLyu · 2026-05-29T03:07:09Z

Thanks for the detailed review! All 6 items are addressed in the latest push:

Module-level hard import — Fixed. import neug is now inside init_db() only. The rest of storage.py imports nothing from neug at module level.
Cypher injection / parameterised queries — Fixed. Replaced all string interpolation (_cesc()) with NeuG's native parameterised queries ($param syntax with parameters={} dict). _cesc() has been removed
entirely.
Process-global _created_rel_tables — Fixed. ensure_schema() now returns a set[str] (per-connection registry), which is threaded through to ingest_extraction() and _ensure_rel_table(). The module-level global
has been removed.
ingest_communities O(nodes × 6) — Fixed. ingest_extraction() now returns a node_types: dict[str, str] (node ID → file_type), which is passed to ingest_communities() for direct table lookup. Falls back to
probing all 6 tables only when node_types is not provided.
No upper version pin — Fixed. Changed to neug>=0.1.2,<0.2 in both the neug extra and the all extra.
Bash e2e test — Removed. tests/test_neug_e2e.sh has been deleted. Coverage is handled by tests/test_storage.py and tests/test_cypher_cli.py.

Looking ahead, we're happy to keep contributing on the NeuG integration. A few directions we have in mind:

Community detection on NeuG: NeuG has an extensible extension architecture that allows plugging in custom graph algorithms. We're planning to develop community detection algorithms (e.g., Louvain/Leiden) as
NeuG extensions, which would enable running community detection directly on NeuG instead of the current NetworkX/graspologic path.
Promoting NeuG to default storage: If through further testing NeuG proves its advantages in query flexibility, incremental updates, and large-scale corpora, we'd be glad to take on the work of making NeuG the
primary storage engine (replacing NetworkX + graph.json).

We're very excited about this collaboration and would love to keep working on it together.

…storage.py module to ARCHITECTURE.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…n__, extract - pyproject.toml: keep neug extra, adopt dm extra and tree-sitter-dm in all - README: adopt uv tool install format, keep neug row, add dm row - __main__.py: keep both upstream --no-label/label help and our cypher help - extract.py: adopt upstream remap logic (already handles cache_root via root) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BingqingLyu · 2026-06-02T03:03:19Z

Hi @safishamsi , All 6 review items are addressed and I've merged the latest v8 to resolve conflicts. Let me know if there's anything else you'd like us to adjust — happy to iterate. Otherwise this should be ready to merge whenever you get a chance to take another look.

BingqingLyu and others added 6 commits May 28, 2026 10:24

chore: require neug>=0.1.2 for MERGE support

f18fa2c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BingqingLyu and others added 2 commits May 29, 2026 10:49

merge upstream/v8: resolve pyproject.toml conflict (keep neug extra, …

38c8a24

…adopt faster-whisper version guard) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BingqingLyu and others added 2 commits May 29, 2026 11:57

docs: add NeuG optional extra and cypher CLI examples to README; add …

d5c0334

…storage.py module to ARCHITECTURE.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add NeuG as parallel graph storage engine with Cypher query support#1056

feat: add NeuG as parallel graph storage engine with Cypher query support#1056
BingqingLyu wants to merge 10 commits into
safishamsi:v8from
BingqingLyu:neug-integration

BingqingLyu commented May 28, 2026

Uh oh!

safishamsi commented May 28, 2026

Uh oh!

BingqingLyu commented May 29, 2026 •

edited

Loading

Uh oh!

BingqingLyu commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

BingqingLyu commented May 28, 2026

Summary

Motivation

Why NeuG?

Architecture

Changes

Usage

Bugfix: incremental extraction id_remap

Note: deduplicate_entities() incorrectly merges AST nodes

Test Plan

Uh oh!

safishamsi commented May 28, 2026

Uh oh!

BingqingLyu commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BingqingLyu commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Note: `deduplicate_entities()` incorrectly merges AST nodes

BingqingLyu commented May 29, 2026 •

edited

Loading