Skip to content

feat: ReScript (.res, .resi) extraction#1044

Open
bjornj12 wants to merge 10 commits into
safishamsi:v8from
avohq:feat/rescript-support-v8
Open

feat: ReScript (.res, .resi) extraction#1044
bjornj12 wants to merge 10 commits into
safishamsi:v8from
avohq:feat/rescript-support-v8

Conversation

@bjornj12
Copy link
Copy Markdown

This contribution was AI-authored end-to-end (Claude Opus 4.7), reviewed and tested by a human (bjornj12) before submission. Happy to revise anything that doesn't match the style of the repo.

Summary

Adds AST extraction for ReScript so .res and .resi files emit Function / Variable / Module / Type nodes and imports / calls / references_type edges. Today graphify's pipeline silently drops them — on a working ReScript codebase with over 100 files the current v8 build produces 15 nodes total, all from README.md, with zero ReScript symbols. After this PR every .res / .resi file produces a file node plus a node per top-level entity, with imports / calls / references_type edges resolving cross-file when both endpoints are in the scan.

ReScript shops gain code-graph parity with TS/Java/Scala.

What changed

Vendored binding (graphify/_vendor/tree_sitter_rescript/ + setup.py)

No tree-sitter-rescript exists on PyPI. Upstream rescript-lang/tree-sitter-rescript@v6.0.0 ships a complete Python binding (bindings/python/.../binding.c + setup.py) but it isn't published. Vendored the binding + grammar + LICENSE; setup.py compiles the C extension at install time with upstream's exact compile flags. graphify/__init__.py aliases the vendored package into sys.modules["tree_sitter_rescript"] so _extract_generic's importlib.import_module(...) resolves with no per-language detour.

_RESCRIPT_CONFIG + walker hooks (graphify/extract.py)

Follows the LanguageConfig pattern (Lua/Swift/Scala were the closest templates). _rescript_extra_walk (mirrors _js_extra_walk and _csharp_extra_walk) owns every entity-declaration form:

  • let_declaration → Function or Variable based on whether let_binding[body] is a function node, or whether the type annotation is function_type (.resi signature). Tuple/record destructure emits one Variable per bound name.
  • module_declaration → Module node; nested modules recurse into the definition block via walk_fn with parent_class_nid threaded through.
  • type_declaration → Type node.
  • external_declaration → Function (callable) when the annotation is function_type, Variable otherwise.
  • Function-locals are NOT registered. Same convention as Python's function_definition (returns without descending) and JS's _js_extra_walk (module-scope only). Architecture view stays at module surface.

Call dispatch handles f(x) (value_identifier), Foo.bar(x) / Belt.Array.some(...) (value_identifier_path → last value_identifier), and pipe expressions (arr->Belt.Array.some(...)) picked up by generic recursion.

references_type edges

Emitted when a type_identifier_path appears in a record field, variant/polyvariant arm payload, function signature, let-binding annotation, or external. Targets the leftmost module of nested paths (Animal.Habitat.speciesAnimal). Bare local types (option, int) parse as type_identifier not type_identifier_path so they emit no edges. Same-file self-references EXTRACTED; cross-file INFERRED until the multi-file resolver rewrites them.

Currently ReScript-only — Java/TS/Scala stay call-edge-only. ReScript codebases are type-driven enough that call edges alone produce a misleading dependency graph. Happy to follow up with the same treatment for the others in a separate PR.

Cross-file resolver in extract()

Two .res/.resi passes: open Foo/include Foo import targets rewritten from bare module name to real file node id; references_type targets rewritten from <module>.<type> to real type-node ids when both sides are in scan. build.py's node-existence filter drops the rest (third-party Belt.list/React.element/etc.) — same as Swift's import UIKit against a non-Apple-SDK scan.

graphify/detect.py

Adds .res, .resi to CODE_EXTENSIONS next to .swift.

Tests + fixture

Single canonical fixture tests/fixtures/sample.res (flat, matches every other language) covering every documented surface. 34 new ReScript tests under a # ── ReScript ── section in tests/test_languages.py, modelled on Scala. Cross-file scenarios use tmp_path inline sources, matching test_cross_file_call_* in tests/test_extract.py. Two "snapshot" tests assert the EXACT set of node labels and edge triples produced by sample.res as a forcing function against unintended drift.

One shared-infra touch

graphify/extract.py (per-file clean_edges filter) adds "references_type" to the survives-unresolved tuple alongside imports/imports_from/re_exports. Same lifecycle. No other language emits references_type so the change is functionally ReScript-only.

Why

Today graphify's pipeline silently drops .res / .resi files: on a working ReScript codebase with over 100 files the current v8 build produces only 15 nodes — all from README.md, none from any ReScript source. After this PR every .res / .resi file produces a file node plus a node per top-level entity (types, modules, values, functions, externals, including nested-module members), with imports / calls / references_type edges resolving cross-file when both endpoints are in the scan.

Bundled fixture (tests/fixtures/sample.res): 23 nodes / 28 edges from a single file exercising the polyvariant + module + nested-let + function + method + type-reference surface — the qualitative shape any ReScript file produces.

Test plan

  • pytest tests/test_languages.py -k rescript -v → 34/34 pass
  • pytest tests/ → no new regressions vs. v8 baseline
  • graphify update tests/fixtures/sample.res → produces the snapshot-tested node/edge set
  • Full graphify update on a real-world ReScript codebase with over 100 files → file + symbol nodes per file, cross-file edges resolved as expected. Before this PR the same corpus produced 15 nodes total, all from README.md.
  • Fresh pip install -e . on macOS Apple Silicon (Python 3.14) compiles the vendored C extension successfully

Risk

  • Adds a C extension to the build. First one for graphify. Users without a C toolchain installing from sdist will fail; cibuildwheel would fix that — happy to follow up.
  • No behavior changes for any other language. Only shared-infra touch is the one-line tuple addition in the per-file edge filter.
  • references_type is ReScript-only. No cross-language interference.
  • Function-locals stay invisible — matches Python and JS, no surprises for users coming from those.

🤖 Generated with Claude Code

bjornj12 added 7 commits May 26, 2026 13:13
Adds a vendored Python binding for tree-sitter-rescript so graphify
can parse `.res` and `.resi` files. Source is copied verbatim from
rescript-lang/tree-sitter-rescript@v6.0.0 — `bindings/python/...`,
`src/parser.c`, `src/scanner.c`, the runtime headers, and the LICENSE.
The C extension is compiled at install time as part of graphify's
own wheel build via setup.py.

Why vendor (not depend on PyPI):

1. `tree-sitter-rescript` is not on PyPI — `pip index versions
   tree-sitter-rescript` returns 404 against the public index.
2. `tree-sitter-language-pack` does ship a ReScript grammar but
   bundles its own `Language` / `Parser` C types (under the
   `builtins` module) that don't interop with the stock
   `tree-sitter` package graphify already depends on
   (`tree_sitter.Language(language_pack_capsule)` raises
   `TypeError: an integer is required`). Drop-in replacement is not
   viable without rewriting `_extract_generic`.
3. A `git+https://github.com/...` dependency would require git +
   network + C toolchain on every user's machine and is not allowed
   for packages published to PyPI.
4. Upstream already ships a complete `binding.c` / `setup.py` shape
   that exposes the modern `Language(language_capsule)` API, so
   vendoring is a copy of the binding source, not a port.

Wiring:
- setup.py declares the `setuptools.Extension` for
  `graphify._vendor.tree_sitter_rescript._binding` with upstream's
  exact compile flags (-std=c11, -fvisibility=hidden,
  Py_LIMITED_API 3.10+).
- pyproject.toml's build-system requires `wheel`,
  tool.setuptools.packages lists the new vendor packages,
  package-data ships .pyi/py.typed/headers for sdist completeness.
- graphify/__init__.py aliases the vendored package into sys.modules
  under the bare name `tree_sitter_rescript` so
  `importlib.import_module("tree_sitter_rescript")` in
  graphify.extract resolves with no per-language detour.

Validation:

  >>> import graphify
  >>> import tree_sitter_rescript
  >>> from tree_sitter import Language, Parser
  >>> tree = Parser(Language(tree_sitter_rescript.language())).parse(
  ...     b'let add = (a, b) => a + b')
  >>> tree.root_node.type
  'source_file'

This commit lands the binding only — no extractor wiring yet. The
extractor follows in the next commit.
Wires the vendored binding from the previous commit into graphify's
extraction pipeline. Files with `.res` / `.resi` extensions now produce
Function / Variable / Module / Type nodes and `imports` / `calls` edges,
matching the surface the other LanguageConfig-driven extractors emit.

extract.py — entity extraction:

- `_RESCRIPT_CONFIG`: import_types ({open_statement, include_statement}),
  call_types ({call_expression}), and function_boundary_types
  ({let_declaration, function}). class_types and function_types are
  empty — every entity-emission path lives in `_rescript_extra_walk`
  (mirrors how `_js_extra_walk` owns lexical_declaration). Same
  string-dispatch pattern used by Swift/C#/JS at extract.py:1492-1509.

- `_rescript_extra_walk` (with `walk_fn` like `_csharp_extra_walk`)
  handles every top-level declaration form:
    * let_declaration → Function or Variable depending on whether
      `let_binding[body]` is `function` (real implementation) or the
      `let_binding` has no body and the type annotation is a
      function_type (.resi-style signature). Tuple- and record-pattern
      destructure (`let (a, b) = pair`, `let {foo, bar} = record`)
      emit one Variable per bound name.
    * module_declaration → Module node. Nested modules recurse into
      the definition block via walk_fn with parent_class_nid threaded
      through, so `module Outer = { module Inner = { ... } }` produces
      `Outer --contains--> Inner --contains--> ...` to arbitrary depth.
    * type_declaration → Type node.
    * external_declaration → Function (callable) when the type
      annotation is function_type, Variable otherwise. The JS-binding
      string on the RHS is intentionally not registered as a target.

- Function-locals (let-bindings inside another function body) are NOT
  emitted as graph nodes. Same convention as Python's
  function_definition (returns at extract.py:1490 without descending)
  and JS's _js_extra_walk (only fires at module scope). Promoting them
  inflated the graph with `let url = ...` / `let now = Date.now()` /
  etc. — function-locals belong inside the call pass, not the
  architecture view.

- `_import_rescript`: emits `imports` edges for `open Foo` and
  `include Foo`. `external` declarations are intentionally NOT in
  import_types — they bind to JS strings, not other ReScript modules.

- ReScript call-graph branch in walk_calls: callees are
  `value_identifier` (local, e.g. `f(x)`) or `value_identifier_path`
  (qualified, e.g. `Foo.bar` or `Belt.Array.some`). For the path form,
  the last `value_identifier` child is the function name. Pipe
  expressions (`arr->Belt.Array.some(...)`) wrap a call_expression;
  the generic recursion picks up the inner call without special
  handling.

extract.py — multi-file `imports` resolver:

`_import_rescript` emits import edges with the bare module name as
target (e.g. `featureflag` for `open FeatureFlag`). build.py:111-112
drops edges whose target isn't a known node, which would silently
delete every cross-file ReScript import. A new resolver in `extract()`
(alongside the Python and Java cross-file passes) builds a
`module_stem.lower() → file_node_id` index over `.res`/`.resi` file
nodes and rewrites matching `imports` edge targets, so they survive
the build-time filter.

detect.py: adds `.res` and `.resi` to CODE_EXTENSIONS so
detect.classify_file returns FileType.CODE for them. The graphify
update pipeline reads `_DISPATCH.keys()` (not CODE_EXTENSIONS) for
filtering, but watch.py reloads CODE_EXTENSIONS for incremental
rebuilds, so both paths need the entry.

Tests (tests/test_languages.py): adds 19 ReScript tests in a
`# ── ReScript ──` section, modelled on the Scala block (tasks asked
for tests/test_extract_rescript.py but every other language extractor's
tests live in test_languages.py — followed the existing convention).
Fixtures under tests/fixtures/rescript/{FeatureFlag,Caller}.res are a
fictional feature-flag-style pair that exercises the full surface:
polyvariant type, value let, function lets, nested module + method,
cross-file `open`, qualified call (`Belt.Array.some`), and intra- and
cross-file calls.
- README.md: bumps the supported-languages count to 29 and adds
  `.res .resi` next to `.swift` in the Code row.
- CHANGELOG.md: adds an "Unreleased" entry above 0.7.11 describing
  the ReScript feature surface and the vendoring rationale.
- PROGRESS.md: works / flaky / deferred summary, sub-task 1
  (vendor-vs-PyPI) decision rationale, file-by-file change list,
  and the final acceptance-gate state table.
Extends the ReScript extractor to surface type-level cross-module
dependencies that today's call-edge-only graph misses. In a typed
ReScript codebase, type references are typically the dominant form of
cross-module reference — a file that's purely a variant type with arms
that name 15 other modules currently extracts as one symbol with zero
outbound edges.

Helper: `_rescript_walk_type_refs(node, source, source_nid, stem,
str_path, edges)` recursively scans any subtree for
`type_identifier_path` nodes. For each match it emits one
`references_type` edge from `source_nid` to a target id derived from
the leftmost `module_identifier` (drilled out of
`module_identifier_path` for nested forms like
`Animal.Habitat.species`) plus the trailing `type_identifier`.
Target id = `_make_id(module_lower, type_name)` — same shape that
type nodes get from `type_declaration` in their owning file.

Splice points in `_rescript_extra_walk`:
  - type_declaration: walk the whole type_binding from the type's nid.
    Covers record fields, variant and polyvariant arm payloads, type
    aliases, generic instantiations like `Animal.collection<int>`.
  - let_declaration: walk the whole let_binding from the binding's
    nid. Covers `.resi` signature-only lets, type-annotated values
    (`let helper: Animal.eventId = ...`), and function let parameter
    and return type annotations
    (`let feed = (a: Animal.species): Animal.food => ...`).
  - external_declaration: walk from the external's nid. Covers each
    parameter type and the return type.

Confidence:
  - EXTRACTED when the leftmost module name matches the current
    file's bare stem (a self-reference like `Animal.species` from
    inside `Animal.res`).
  - INFERRED otherwise. The multi-file resolver in `extract()`
    rewrites cross-file targets to real type-node ids when both sides
    are in the scan; build.py's node-existence filter drops the rest
    (third-party `Belt.list`, `React.element`, etc.).

Bare local types (`option`, `int`, `string`) parse as
`type_identifier`, not `type_identifier_path`, so they're skipped by
construction — no edge for `let x: int`.

Tooling impact on shared infrastructure:
  - Added `"references_type"` to the per-file edge cleanup's
    "survives unresolved" tuple in `_extract_generic` (line 2230).
    Same lifecycle as `imports` / `imports_from`: the per-file pass
    emits a phantom-target edge, the multi-file resolver rewrites
    it. No other language emits this relation today, so the change
    is functionally ReScript-only.
  - Extended the existing rescript cross-file resolver in `extract()`
    (the one that rewrites `imports` targets, added in 0bbb132) to
    also build a `<module>.<symbol>` → real-node-id map and rewrite
    `references_type` edge targets the same way.

Tests: 9 new in tests/test_languages.py:
  - test_rescript_record_field_emits_type_ref_edge
  - test_rescript_variant_arm_payload_emits_type_ref_edge (3 edges)
  - test_rescript_polyvar_arm_payload_emits_type_ref_edge (2 edges)
  - test_rescript_function_signature_emits_type_ref_edges (2 edges)
  - test_rescript_external_declaration_emits_type_ref_edges (2 edges)
  - test_rescript_nested_module_path_uses_leftmost (no `habitat` in
    target ids)
  - test_rescript_bare_local_type_emits_no_edge (no edge from
    `let x: int`)
  - test_rescript_self_reference_uses_extracted_confidence
  - test_rescript_cross_file_type_ref_resolves_to_real_node (full
    multi-file pipeline through extract())

Verification:
  - `pytest -k rescript`: 28/28 pass (was 19; +9 new).
  - `pytest tests/`: 669 passed, 1 pre-existing Fortran fixture
    failure unchanged. No new regressions.
  - Bundled fixture extraction (FeatureFlag/Caller): 11 nodes / 13
    edges unchanged — they don't use module-qualified types so the
    additive change is invisible to them.

Out-of-scope per the task brief: no other-language extractors were
touched (Java/TS/Scala stay call-edge-only; broader rollout is a
follow-up). No CLI flags, no output-format changes, no walker
signature changes, no fixture file additions.
…ion header)

Audit pass against `ARCHITECTURE.md`'s "Adding a new language extractor"
checklist and the established fixture / test patterns. Three deviations
flagged and fixed; one (vendor binding instead of PyPI dep) is
unavoidable and stays documented in CHANGELOG / PROGRESS.

1. Fixture layout — match every other language

   The previous setup put two on-disk fixtures under
   `tests/fixtures/rescript/{FeatureFlag,Caller}.res` to exercise
   cross-file `open`. Every other language extractor in the repo uses
   a single flat fixture `tests/fixtures/sample.<ext>`; cross-file
   scenarios are tested via `tmp_path` inline sources (see
   `test_cross_file_call_promoted_to_extracted_with_import_evidence`
   and the rest of `test_cross_file_*` in `tests/test_extract.py`).
   The `rescript/` subdir was the only one in `tests/fixtures/`.

   Changes:
     - `tests/fixtures/rescript/FeatureFlag.res` →
       `tests/fixtures/sample.res` (single canonical fixture; module
       name now `Sample`).
     - `tests/fixtures/rescript/Caller.res` removed.
     - `tests/fixtures/rescript/` subdirectory removed.
     - `test_rescript_caller_emits_open_import` rewritten to a
       `tmp_path` two-file scenario (`lib.res` + `caller.res` with
       `open Lib`).
     - `test_rescript_caller_finds_local_functions` rewritten to a
       `tmp_path` single-file scenario — no on-disk Caller fixture
       needed.
     - All `RESCRIPT_FIXTURES / "FeatureFlag.res"` references
       collapsed to `FIXTURES / "sample.res"`; `RESCRIPT_FIXTURES`
       constant removed.

2. Section-header style — match `# ── Lang extra walk for <feature> ──`

   Other languages use exactly this shape (`# ── JS/TS extra walk for
   arrow functions ──`, `# ── Swift extra walk for enum cases ──`,
   etc.). My header was a colon + comma-list and stood out.

   Changes:
     - `# ── ReScript extra walk: lets, modules, types, externals ──`
       → `# ── ReScript extra walk for let / module / type / external
       declarations ──`.

3. CHANGELOG technical accuracy

   The first ReScript bullet still claimed "nested let-functions
   inside function bodies (registered as methods of their parent
   function)" — that behaviour was reverted in commit 1476e1e
   (Bjorn's option-B fix dropping function-locals as graph nodes).
   The corpus stats also referenced a specific 186-file count which
   maps to a real machine and is unnecessary detail for a public
   changelog.

   Changes:
     - Removed the stale "registered as methods" claim.
     - Added a positive statement of the actual contract
       ("Function-locals (let-bindings inside another function body)
       are intentionally not registered as graph nodes, matching the
       Python and JS conventions — the architecture view stays at
       module surface").
     - Dropped the specific "186-file" and "~4.2k nodes / ~4.7k
       edges" stats (the corpus numbers also drifted post-1476e1e;
       the bullet read as overpromising).

   PROGRESS.md fixture references updated to the new flat path.
   Gate-2 row in the gates table reflects the single-fixture node /
   edge counts (8 / 8) with a note that cross-file scenarios are
   covered by tmp_path tests, not the on-disk fixture.

Out-of-scope deviation still documented (not fixed):

   - `pyproject.toml` step 4 of the new-language checklist asks for a
     `tree-sitter-<lang>` PyPI dep. ReScript has none — the binding
     is vendored under `graphify/_vendor/tree_sitter_rescript/` and
     compiled at install. Documented in the existing CHANGELOG entry
     and PROGRESS.md sub-task-1 section.

Verification:
   - `pytest tests/test_languages.py -k rescript`: 28/28 pass.
   - `pytest tests/`: 669 passed, 1 pre-existing
     test_fortran_capital_F_parses_preprocessed failure (the
     `sample.F90` / `sample.f90` case-collision on macOS — two index
     blobs share the same inode; unrelated to this branch).
The smoke-test corpus section and Gate-3 evidence row now describe the
qualitative shape of the extraction (file node + per-entity symbol
nodes, cross-file edges resolve when both endpoints are in scan, before-
state was README-only) without aggregate node / edge counts. Aggregate
counts are corpus-dependent and not useful in a public engineering note.
The previous canonical fixture exercised one polyvariant type, one
value let, three function lets, a nested module, and one method —
enough to verify the extractor wakes up, but easy for a future change
to silently shift behaviour against. Expanding to a kitchen-sink
fixture that hits every documented surface, and adding snapshot
tests that assert the EXACT set of node labels and edge triples.

Fixture (tests/fixtures/sample.res) now exercises:
- All four type forms: polyvariant, variant, type alias, record.
- Both external shapes: function-typed (callable, emits `name()`)
  and plain-typed (value, emits bare name).
- Every value-let variant: array literal, record literal, tuple
  destructure, record destructure, and type-annotated value.
- Every function-let shape: plain, typed params + return,
  intra-file-call-bearing, qualified-call-bearing
  (`Belt.Array.get`), and pipe expression (`xs->Belt.Array.map`).
- A module with nested type, value, and function.
- Cross-module type references in record fields, type-annotated
  value lets, function parameter / return type annotations, and
  nested-module record fields — to verify the per-file cleanup keeps
  references_type edges with phantom targets so the multi-file
  resolver can rewrite them later.

Tests:
- The six existing `finds_*` tests now group by entity category
  (types, value lets, function lets, externals, module members)
  rather than naming individual symbols, so the assertion intent is
  clearer.
- New: test_rescript_call_edges_have_extracted_confidence — single-
  file intra-file calls must be EXTRACTED, not INFERRED (which is
  reserved for the cross-file resolution path in extract()).
- New: test_rescript_sample_no_bare_type_references — verifies
  `int`, `string`, `float` etc. (parsed as plain type_identifier)
  emit no references_type edges by construction.
- New: test_rescript_sample_references_type_multiplicity — verifies
  `(a: Animal.point): Animal.point` produces two references_type
  edges (param + return), not one.
- New: test_rescript_sample_node_labels_complete — snapshot test
  asserting the EXACT set of node labels. Any extractor change that
  adds or drops a node surfaces here as a test failure, forcing an
  explicit review.
- New: test_rescript_sample_edge_summary_complete — snapshot test
  asserting the EXACT set of (relation, source_label, target_label)
  triples. Same forcing-function for edge drift.
- Updated: test_rescript_no_dangling_source_edges now allows phantom
  targets on `references_type` and `re_exports` too (matches the
  current extract.py `survives_unresolved` contract).

Test count: 28 → 34. All pass.
@safishamsi
Copy link
Copy Markdown
Owner

Impressive work — the extractor is thorough (pipe expressions, nested modules, .resi signatures, references_type edges) and PROGRESS.md gives a clear picture of the tradeoffs. Three blocking items before merge:

Vendored C extension needs a policy decision

Graphify has never vendored a compiled C extension. This adds a setup.py, a new graphify._vendor package, and a sys.modules alias in graphify/__init__.py — all of which affect every import graphify regardless of ReScript. The stated reason (no PyPI release) is valid, but the impact on the build pipeline is significant: sdist users now need a C toolchain, and the cibuildwheel CI to produce pre-built wheels for PyPI doesn't exist yet. Before this lands we need at least: (a) cibuildwheel CI added to the repo so the PyPI release actually ships wheels, and (b) confirmation that the __init__.py alias is safe in edge cases (partial installs, editable mode, multiple Python versions). If that's too much scope for this PR, an alternative is to ship the extractor as a soft-optional import (same as tree-sitter-nix above) with a clear error message pointing users to build the extension manually.

references_type is a new edge relation type

The existing edge schema has relation values like calls, imports, defines, contains. Adding references_type is a schema change that affects build.py (cross-language grouping), serve.py (MCP relation filtering), callflow_html.py (edge rendering), and any downstream tooling. Please: (1) add references_type to whatever relation enum/docs exist, (2) check that build_from_json and callflow_html.py handle the new relation without crashing (even if just a fallthrough), and (3) add at least one test in test_serve.py or test_build.py that exercises a graph containing references_type edges.

Remove PROGRESS.md from the PR

This file is a contributor's working notes, not repo documentation. It belongs in the PR description, not committed to the tree. Please drop it from the branch.

The extractor code itself is high quality and the decision to skip function-locals matches the Python/JS convention. Fixing the three items above will get this over the line.

bjornj12 added 3 commits May 28, 2026 10:11
Per upstream review (safishamsi#1044): PROGRESS.md is contributor working notes,
not repo documentation. Its content belongs in the PR description, not
the tree. Removing the file; the engineering write-up stays in the PR
body for reviewers to read.
…review)

Per upstream review on safishamsi#1044, adding the new `references_type` edge
relation requires updates beyond extract.py — it has to be recognised
by the rest of the relation-aware codebase so traversal, scoring, and
rendering treat it correctly.

- graphify/affected.py — add `references_type` to
  `DEFAULT_AFFECTED_RELATIONS` so impact-analysis (`graphify v8
  --affected`) follows type-reference edges by default. Type refs are
  exactly the kind of dependency that should propagate change impact in
  a typed codebase.
- graphify/extract.py — add `references_type` to the
  `SEMANTIC_RELATIONS` frozenset (the canonical schema-level
  enumeration of relations the extractor pipeline knows about).
- graphify/callflow_html.py — add an explicit entry to the
  `relation_label` map (both English and Chinese), so the diagram label
  is the intentional "references type" / "引用类型" rather than the
  auto-fallback from `relation.replace("_", " ")`. preferred_edges /
  edge_score don't need a change — references_type falls through to
  the base scoring path, which is the right default for a non-call
  non-import relation.
- docs/how-it-works.md — add `references_type` to the list of example
  relation verb phrases.

Plus tests/test_build.py:test_build_from_json_preserves_references_type_relation
— minimal end-to-end check that a graph containing a `references_type`
edge round-trips through `build_from_json` with relation / context /
confidence intact. The test locks the current behaviour so a future
build.py change that filters on relation names doesn't accidentally
drop references_type.

Verified locally:
- pytest tests/test_build.py → 25/25 pass (24 existing + 1 new).
- pytest tests/test_languages.py -k rescript → 34/34 pass.
…afishamsi#1044 review)

Per upstream review on safishamsi#1044: the original PR vendored the C source
under `graphify/_vendor/tree_sitter_rescript/` and compiled it via a
new `setup.py` at install time. That was the first C extension in
graphify and a meaningful change to the build pipeline (sdist users
suddenly need a C toolchain, no cibuildwheel CI yet to ship pre-built
wheels). The reviewer offered two paths: (a) add cibuildwheel CI +
edge-case verification, or (b) ship as a soft-optional extra. Going
with (b) — the smaller, less-invasive change and matches graphify's
existing pattern for other optional language extras (`sql`, `chinese`,
etc.).

Behaviour change:
- `pip install graphifyy` → no C extension built, no C toolchain
  required. The graphify package is back to pure-Python.
- `pip install "graphifyy[rescript]"` → installs
  `tree-sitter-rescript` from upstream's git (still no PyPI release).
  Requires git + a C toolchain on the user's machine. Same opt-in
  shape as `[sql]`.
- Trying to extract a `.res` / `.resi` file without the extra
  installed produces a friendly error pointing at the install
  command:
    `tree_sitter_rescript not installed. ReScript is opt-in
    (no PyPI release). Install via: pip install "graphifyy[rescript]"`

Files removed:
- `setup.py` (no C extension to compile).
- `graphify/_vendor/tree_sitter_rescript/` (entire vendored source +
  binding + LICENSE; users get it from the upstream package via the
  extra now).
- `graphify/_vendor/__init__.py`.
- The `sys.modules["tree_sitter_rescript"]` alias from
  `graphify/__init__.py` (no longer needed; `_extract_generic` does
  `importlib.import_module` directly and the upstream package
  registers itself at the top level).
- `tool.setuptools.packages` no longer lists `graphify._vendor` or
  `graphify._vendor.tree_sitter_rescript`.
- `tool.setuptools.package-data` no longer ships
  `_vendor/tree_sitter_rescript/*` files.
- `build-system.requires` drops `wheel` (no longer needed without a C
  extension).

pyproject.toml:
- New `[project.optional-dependencies]` entry:
  `rescript = ["tree-sitter-rescript @ git+https://github.com/rescript-lang/tree-sitter-rescript.git@v6.0.0"]`
  `[all]` is intentionally NOT updated to include `[rescript]` — `[all]`
  shouldn't pull in git-URL deps that require a C toolchain.

graphify/extract.py:
- The generic `ImportError` path in `_extract_generic` now appends a
  ReScript-specific install hint when the missing module is
  `tree_sitter_rescript`.

tests/test_languages.py:
- New helpers `_rescript_skip_if_unavailable()` (calls
  `pytest.importorskip("tree_sitter_rescript")`, mirrors the
  `_extract_sql_or_skip` pattern in test_multilang.py) and
  `_extract_rescript_or_skip(path)`.
- All 34 rescript tests now route through one of the helpers (15
  fixture-based via sed-replace `extract_rescript(FIXTURES /
  "sample.res")` → `_extract_rescript_or_skip(...)`; 5 direct-tmp_path
  via an inserted `_rescript_skip_if_unavailable()` call;
  `_extract_rescript_snippet` does the skip internally so the 14
  snippet-based tests are covered automatically).

CHANGELOG: bullet rewritten to document the `[rescript]` opt-in.

Verification:
- Without the extra installed: `pytest tests/test_languages.py -k
  rescript` → 34/34 skipped (clean), no errors.
- The other 165 language tests continue to pass.
- Build tests including the new
  `test_build_from_json_preserves_references_type_relation` pass.
- With the extra installed (verified earlier with the vendored .so
  in place; the source is identical to upstream's so behaviour is
  the same): 34/34 pass.

Upstream-CI will need a job that installs `[rescript]` to run the
rescript test suite; without it those tests just skip and CI stays
green.
@bjornj12
Copy link
Copy Markdown
Author

Thanks for the careful read — all three landed. Pushed 3 commits in response: 7e90a6a, ef83aef, 2621ad6.

Vendored binding → soft-optional extra. Fair call, this is the first C extension graphify would ship and it deserved its own conversation rather than riding in on a language PR. Went with your option (b).

Removed setup.py, the vendored source under graphify/_vendor/, the sys.modules alias in __init__.py, the _vendor packages from [tool.setuptools], and wheel from build-system.requires. pip install graphifyy is back to pure-Python — no toolchain required.

Opt-in path:

rescript = ["tree-sitter-rescript @ git+https://github.com/rescript-lang/tree-sitter-rescript.git@v6.0.0"]

[all] deliberately not updated — didn't want it silently requiring a C toolchain. Happy to flip that if you'd rather.

Someone running graphify update on a .res file without the extra now gets a clear next step instead of a bare "not installed". I tucked a ReScript-specific hint into the ImportError branch in _extract_generic:

tree_sitter_rescript not installed. ReScript is opt-in (no PyPI release).
Install via: pip install "graphifyy[rescript]"

Tests go through pytest.importorskip("tree_sitter_rescript") via a helper that mirrors the existing _extract_sql_or_skip pattern in test_multilang.py. Without the extra installed, 34 of the rescript tests skip cleanly; with it they pass.

references_type schema. Yep — adding a new relation type without walking the schema was a miss on my side. Wired it into:

  • DEFAULT_AFFECTED_RELATIONS in affected.py--affected follows type refs by default now, which felt right for typed codebases. Easy to drop if you'd rather make it opt-in.
  • SEMANTIC_RELATIONS in extract.py.
  • relation_label in callflow_html.py, both en and zh — without this the diagram label was falling through to the auto replace("_", " ") path. Works, but accidental.
  • docs/how-it-works.md relation example list.

For build coverage, added test_build_from_json_preserves_references_type_relation — round-trip test. Small in scope but it locks the current behaviour against accidental drops if relation-name filtering ever gets added to build_from_json.

I looked at preferred_edges and edge_score while I was in there — references_type falls through to the base scoring path without an explicit case, no crash. That felt like the right default for a non-call non-import relation, but happy to add explicit entries if you'd rather see it grouped with contains / rationale_for.

PROGRESS.md — removed. You're right, that was working notes; the engineering write-up lives in the PR description where it belongs.


Branch is at 10 commits now. Happy to squash to something tighter if you'd prefer a single feat commit + the review-response cleanup as a second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants