Linking SMIR JSON data from different crates



Linking several SMIR JSON files together
----------------------------------------

SMIR JSON is emitted _per crate_. 
For a multi-crate program, this means we get several `*.smir.json` files. 
Each of these files has its own independent set of internalised `Ty` IDs
to refer to functions and types (same for `adt_def` IDs).

In order to call a function in another crate, the `functions` table contains
entries which do not have a corresponding item in the `items`. Instead, the
called code resides in the other crate and its item is another SMIR JSON file.

Within a single crate, all functions and types are referred to by their 
crate-local interned ID `Ty`. These references occur in various SMIR statements
and terminators. Also, the type metadata contains `Ty`s for structured types such
as tuples, arrays, and structs.

We can link together a number of SMIR JSON files (each one for a named crate) by
applying a positive per-crate _offset_ to all `Ty` IDs in the SMIR JSON.

The chosen offset value must be large enough to avoid any collisions between `Ty`s from different crates.
A suitable value may be obtained by computing the max of all `Ty` IDs for each crate.
Traversing items is not necessary for this because all `Ty`s are present in the `functions` or `types` tables.

**TODO** is this true for the types, or are we skipping some?
All `Ty`s should be present in `types` (except the function ones, of course), and if an item contains a `Ty` not present in `types` or `functions`, we aren't using it!

This approach will lead to duplicates in the `types`, several (offset-equipped) 
`Ty`s will point to the same actual type (primitive or structured). This is harmless 
in most cases, but the `Ty`s cannot be used for type checking at runtime.

Algorithm to link several SMIR JSON crate files together:

* Determine the maximum `Ty` used in each SMIR JSON (`functions`, `types`);
* Assign a suitable offset to each crate (in alphabetical order of crate names?);
* traverse all items of each crate, adding the `offset` of the current crate to each `Ty` contained within;
* traverse all `types` in each crate, adding the `offset` of the current crate to all `Ty` keys and each `Ty` contained within;
* add the `offset` of the crate to each `Ty` key of the `functions` in each crate
* concatenate `types`, `functions`, `items`

In the concatenated SMIR JSON, there should be _no_ `TerminatorKindCall` with a
`Ty` that is absent from the `functions` table. Again, we may have duplicate references to
(multiple offset-equipped `Ty`s referring to) the same function (when the calls come from
different crates) but all call targets are known in the `functions` table.
The fully-qualified name is then looked up in the `items` and should be present because
all `items` have been concatenated.

**TODO** Can there be multiple _items_ which are in fact the same?
(Same qualified name, items coming from different crates)
AFAICT this is possible for monomorphic instances of polymorphic functions (stdlib, specifically).
However, the items would in fact be _identical_ , it would not matter which one to use.

MIR's visitors cannot be used for this, since they are not able to  mutate the data they traverse.

**TODO** we have to _read / deserialise_ SMIR JSON, which is a problem for types that come directly from `rustc`.
* Workaround: write the `impl`s by hand (the code that would be derived)
* Future work: submit a PR to `rustc` to add `#[derive(Deserialize)]` where required
This will _hopefully_ only be necessary for `stable_mir` types (not `rustc_middle` ones).

Eliminating dead code from SMIR JSON
------------------------------------

Given a particular _start symbol_, we want to eliminate all items from `items` that 
aren't executed (transitively) by running the item referred to by the `start_symbol`.
The `functions` and `types` tables can remain unmodified, they only cause minor
overhead in lookups in K during execution.


To compute the required items:
* Traverse each item's terminators to determine which functions are called.
  - Resolve the `Ty` IDs of called functions using the `functions` table.
  - These are the edges of a graph of items, keyed by `symbol_name`.
* Compute the transitive closure of the `start_symbol`'s item in this graph.
  - A simple graph walk starting at `start_symbol` can be used.
  - This is the set of required `items` for execution of `start_symbol`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Linking SMIR JSON data from different crates #86

Linking several SMIR JSON files together

Eliminating dead code from SMIR JSON

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Linking SMIR JSON data from different crates #86

Description

Linking several SMIR JSON files together

Eliminating dead code from SMIR JSON

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions