feat: add NeuG as parallel graph storage engine with Cypher query support#1056
feat: add NeuG as parallel graph storage engine with Cypher query support#1056BingqingLyu wants to merge 9 commits into
Conversation
…port Integrate NeuG as an optional storage backend alongside NetworkX. During extract, graph data is written to both graph.json (NetworkX) and graph.db (NeuG) when neug is installed. Adds `graphify cypher` CLI command and `cypher_query` MCP tool for direct Cypher queries. - New graphify/storage.py: NeuG adapter (init, ingest, query, close) - __main__.py: NeuG write in extract flow + `cypher` CLI command - serve.py: NeuG connection init + `cypher_query` MCP tool - pyproject.toml: neug optional dependency - Tests: unit tests, CLI tests, E2E integration script Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Edge tables are now named edge_{src}_{tgt}_{relation} instead of
edge_{src}_{tgt}. This keeps each table well under NeuG 0.1.0s
4096-row-per-table limit (max single table ~2475 rows for calls).
Removes EXTRACTED/INFERRED routing distinction -- all edges are
uniformly routed by their relation field.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…table file node IDs When extracting a single file incrementally, the auto-inferred root (paths[0].parent) differs from the project root used during full extraction, causing file node IDs to mismatch (e.g. "build_py" vs "graphify_build_py"). This created duplicate file nodes on each incremental update (+2 instead of +1). Fix: use cache_root (the project target directory passed by __main__.py) for relative_to() in the id_remap step, ensuring file node IDs are consistent regardless of whether extraction is full or incremental. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ation Separates database opening (init_db) from schema DDL (ensure_schema) so read-only consumers (CLI cypher, MCP server) can open an existing graph.db without re-running CREATE TABLE statements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace MATCH+check+SET workaround with standard Cypher MERGE ON CREATE/ON MATCH syntax (now supported in NeuG 0.1.2) - ensure_schema(create_tables=False) skips DDL on incremental runs, avoiding "table already exists" noise from the C++ layer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Great idea and architecturally sound — soft-import pattern is correct, the
Fix those and this is ready to land. |
…, per-conn registry - Replace all _cesc() string interpolation with NeuG native $param syntax to prevent Cypher injection (community SET uses int literal due to NeuG limitation on parameterised SET) - Move `import neug` into init_db() for lazy loading - Make rel table registry per-connection via ensure_schema() return value - ingest_extraction() returns node_types dict for O(n) community writes - Require neug>=0.1.2,<0.2 for MERGE support - Remove tests/test_neug_e2e.sh (manual script, not automated) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…adopt faster-whisper version guard) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the detailed review! All 6 items are addressed in the latest push:
Looking ahead, we're happy to keep contributing on the NeuG integration. A few directions we have in mind:
We're very excited about this collaboration and would love to keep working on it together. |
…storage.py module to ARCHITECTURE.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
graph.dbduring extraction, enabling Cypher queries via CLI (graphify cypher) and MCP server (cypher_querytool)id_remapbug in incremental extraction that caused unstable file node IDsMotivation
Graphify currently uses NetworkX + graph.json as its core graph storage. This architecture has bottlenecks:
Why NeuG?
NeuG is a lightweight embedded graph database (C++ core, Python bindings):
pip install neugis all it takes)Architecture
Dual-engine coexistence, each independently consuming extraction data:
Changes
graphify/storage.pygraphify/__main__.pygraphify cypherCLI commandgraphify/serve.pycypher_queryMCP tool for AI agentsgraphify/extract.pypyproject.tomlneug>=0.1.2optional dependencytests/Usage
Bugfix: incremental extraction id_remap
The
id_remapstep uses an auto-inferredroot(resolves topath.parentfor single-file extraction), inconsistent with the project root used during full extraction. This causes file node IDs to be unstable, producing duplicate nodes on each incremental update.Fix: use
cache_root(the project target directory) forrelative_to()in the id_remap step.Note:
deduplicate_entities()incorrectly merges AST nodesDuring testing, we found that
deduplicate_entities()merges functions from different files that share similar names (e.g.,hooks.py:install()and__main__.py:install()). These functions have distinct IDs and different source_files — only their labels are similar. For pure AST extraction, node IDs are inherently unique, making fuzzy dedup harmful.The NeuG engine writes raw extraction data directly (skipping dedup), preserving full precision. We suggest discussing dedup strategy optimization separately (e.g., applying fuzzy dedup only to LLM-extracted concept nodes).
Test Plan
pytest tests/test_storage.py tests/test_cypher_cli.py -v— 13 tests passedcypher_querytool end-to-end verifiedgraphify extract .runs normally (silent skip)🤖 Generated with Claude Code